First 10 minutes in Kern Refinery

Your first steps in the application

To get you started with the application, we will look into one exemplary use case: analyzing clickbait data.


Also available on YouTube 📺

Click here to see a video explanation of how you can spend the first ten minutes in the application.

Creating the Clickbait project

For this, we will use the Clickbait dataset, for which we created a sample project. If you want to follow along with this quick start, choose the "Clickbait" option from the "Sample projects" button on the start screen.

Once you see that you are in the settings screen, you can start by creating an embedding. As we have generic English text, we can choose the distilbert-base-uncased transformer model from 🤗 Hugging Face. We could create an embedding as in the following screenshot by clicking on "Generate embedding":

Why do we do that? Well, we want to first show you our active transfer learning and neural search. With active transfer learning, we'll be able to automate larger parts of the data labeling simply via labeling some reference data. And the neural search engine helps us a lot to navigate through our unlabeled data. You will see this in a second.

While our embeddings are initializing, we can already create a new labeling task. Simply click on "Add labeling task" and input data as in the following screenshot into the modal:

Now, by clicking on the "+" icon, you can create new labels. Insert the following:

  • positive
  • neutral
  • negative

If you want to, you can change the label color and set a hotkey by clicking on the pipette.

Ok, we now can label a few samples manually. We know that is not the magic part - but it is still a tool for labeling after all. Let's quickly label 30 samples per class (our general minimum recommendation for active learning), and you'll see how Kern refinery will help you to build high-quality training data quickly. We head over to the labeling tab and begin:

Neural search

Great, we now have some labeled samples. Let's do the cool stuff! We're first looking to the neural search in order to look at similar records of an existing sample via the given embedding. Let's see an example by clicking on the data browser in the sidebar, and then on "Find similar records" in the record card we want to apply similarity search for:

This will become super helpful in cases in which we want to find records that most likely have a similar category or characteristics.

Active transfer learning

Now, that is already cool, but we still need to automate our labeling, right? So let's jump into the active transfer learning. With the few labeled data, we can create a very basic first heuristic simply by putting a logistic regression on the embedded and labeled data. Let's create a new active learning heuristic by clicking on "New heuristic" and "Active learning":

A modal will open up, which asks you to insert the settings for your active learner.

Here, we have some template code. This way, we connect a logistic regression with our embedding. Of course, you could choose any other Scikit-Learn classifier for this. For extraction tasks, you can use our own library sequence-learn.

The rest is looking good for now. Let's run the module!

If you scroll to the statistic, you can see that the results are already really good. We can still improve its accuracy afterward, but for now, it is perfectly fine.

Applying weak supervision

Now, this active learner can be used as a heuristic. Using weak supervision, we can integrate as many heuristics as we want to in order to create automated and denoised labels. A heuristic does neither need to be 100% precise nor need to cover all the data. This way, it is really easy to come up with different types of heuristics (see also how to build labeling functions).

We can just select our one heuristic and use it to apply weak supervision, or build additional heuristics as in the screenshot below. Mark the checkbox, and hit "Weak supervision".

This will take only a few seconds, applying your heuristics to the available data and computing denoised labels. Since we currently only have one heuristic, weak supervision is providing the same results as our heuristic. But this is really where weak supervision shines because it is so simple to add new heuristics and improve the results. For instance, from here we can decide to:

  • Label more records to improve our active learner
  • Implement labeling functions; make sure to check out our automatically created lookup lists from span labeling, which is super helpful for labeling functions.
  • Build zero-shot classifiers
  • Use the heuristic data to manage our data; because neural search is not the only thing you can make use of for data management
  • Play around with other embeddings

This is it for now with our quick start. We hope it helps you to get started with Kern Refinery. If you have any questions, join our Discord channel.

Cheers, and happy labeling ✌️