Project setup

How to start your project

Now that you have an assigned user, you can get started with your first project. Following this guide from start to finish will give you all the information you need to apply kern to your data challenges.

📘

Need support?

Please do not hesitate to contact our support if you need assistance. We will be happy to help you. The easiest way to do this is via the in-app chat, which you can find in the bottom right corner. You will also find a ? at the bottom of the left sidebar. When you click on it, tooltips will pop up to give you more information.

📂 Creating projects

There are three ways to create a project:

  • Create a new project (default).
  • Use our sample project (i.e. one that we have prepared and that you can use to better understand our features)
  • Upload a project that you have previously exported

If you decide to create a new project, you will be asked for the project name and description, which you can change later. You will also be asked to enter a language. This is only necessary for text applications because we parse the texts with a language-specific tokenizer.
The tokenizer can't be changed afterwards!

🚧

Missing a language?

Contact us and we'll be happy to extend the list of available tokenizers by the languages you need.

If you want to follow along with this guide, you can just choose to load a sample project. If that's the case, feel free to skip over the upload section.

💾 Uploading data

We have set up the upload process similar to Pandas. You can choose from CSV, Excel, JSON, or even HTML files. When you upload them, please choose from the arguments in the Pandas documentation. For example, if you upload a JSON file, you can specify a record orientation by typing orient=records. If you do not specify one, we assume you want the default configuration.

The file you want to upload may already contain pre-labeled data. To parse this correctly, we expect the file to contain the attributes and labeling tasks (optional, see the section below). Attributes are the value columns of your raw data sets, while labeling tasks contain the sets of labels you want for a particular data set. Imagine you have a headline with a running ID and you want to label the sentiment. A JSON file containing this information might look like this:

[{
  "running_id": 4711,
  "headline": "Recent studies showed breakthroughs in the area of Artificial Intelligence!",
  "headline__sentiment": "Positive"
}]

Please note the two underscores in headline _ _ sentiment. If you want to assign labels on a global record level (i.e. do not assign them to a specific attribute), you can simply omit the attribute (e.g. __global). Of course, you do not have to provide pre-labeled data.

When you upload data, you can also distinguish between scale and test datasets. scale datasets are intended for training and validation, and the goal is to scale their size. test datasets, on the other hand, must have the highest possible data quality and should not contain any potential errors. Therefore, we do not apply Weak Supervision to them.

📘

Data types

We infer the data types of your attributes automatically, so that we can help guide you e.g. in the process of creating embeddings or labeling tasks (see sections below).

Once you have uploaded your first data set the upload process can be repeated. With the additional upload, your data will be extended or updated. To ensure this process works you need to provide an appropriate (unique) primary key. This can be a composition of multiple columns but should always identify a unique record to ensure the update process targets the correct data set.

👍

Record Limits

To provide you with the best possible experience regarding performance we decided to add some limitations.
The limitations are 50,000 records per project 25 attributes per project and 100,000 characters per record. We noticed that these limitations ensure a smooth experience while still providing you with meaningful insights regarding the overall goal.

✄ Tokenization

If you ever worked with natural language processing (NLP) you most certainly came across tokenization. And who would have thought? We use that too. During your project setup phase, you were asked to select a tokenizer for exactly that reason. The general idea is to use everything at our disposal to support you. If you ever worked with spaCy you will be familiar with the given functionality. For example, every text will be provided as a spaCy Document within your labeling functions. Furthermore, the information extraction (labeling page) uses tokenized data to display DOM elements that you can select/mark with a label.

To make that work we need to generate the tokenized versions of your records which takes some time. You can still use the app without a problem, potentially with a reduced set of records for your labeling functions. Once the process is finished you will be notified.

🖍️ Creating labeling tasks

Labeling tasks help you structure the labeling process for your project. They represent what you want to predict at a specific time in the future. For example, you might want to predict the sentiment of a sentence. Or you might want to extract names so you can anonymize sentences, etc.

You can create as many labeling tasks for a project as you want. Each of these tasks has its own set of labels that you can assign in the Labeling workflow. Currently, the following labeling tasks are available:

  • Multiclass classification: general classification labeling, e.g. sentiment analysis.
  • Information extraction: named entity recognition, e.g. pseudonymization.
We uploaded data with two attributes: "running_id" and "headline". We selected the running_id to be the primary key, which allows us to upload more records later on without the danger of duplicates. We also added the labeling tasks we want to work on with this particular data, which is one classification and one information extraction task.We uploaded data with two attributes: "running_id" and "headline". We selected the running_id to be the primary key, which allows us to upload more records later on without the danger of duplicates. We also added the labeling tasks we want to work on with this particular data, which is one classification and one information extraction task.

We uploaded data with two attributes: "running_id" and "headline". We selected the running_id to be the primary key, which allows us to upload more records later on without the danger of duplicates. We also added the labeling tasks we want to work on with this particular data, which is one classification and one information extraction task.

⚗️ Creating embeddings

Embeddings are the foundation for our Active Transfer Learning. We apply large, pre-trained machine learning models to your datasets. For example, these are the transformer models from Huggingface. To create an embedding for an attribute, you can simply select it and then enter the configuration string of the particular model, e.g. distilbert-base-uncased. Additionally, you can select a granularity:

  • Attribute: A vector is calculated for each record containing this attribute. It can be used for similarity search or classification with our Active Transfer Learning module.
  • Token: Uses the project tokenizer in combination with the pre-trained model to compute token-level embeddings, i.e. create a matrix for each dataset. This is computationally more complex than attribute-level embeddings.

Please keep in mind that such models are very large and need to be initiated (if not downloaded) first. We will keep you informed about the creation process with a progress bar, and you can continue working on other tasks in your project. Once the process is complete, you will be notified in the application.

🚧

No GPU support yet

We currently only offer CPU instances, but we are already working on GPU acceleration. This means that the process may take some time (depending on your workload, it may take a few hours). If you have any problems with the embedding creation, please do not hesitate to contact us at any time.


Did this page help you?