Key concepts

How we aim to build the development environment for AI training data

It is no longer a secret that large amounts of training data are an important prerequisite for reliable and precisely supervised learning models, be it for classification, extraction, or generation. At kern, we want to make it much easier for Data Scientists to obtain exactly this training data.

In doing so, we follow some core concepts that we would like to share with you in advance.

🍃 Greenfield Labeling is about data quantity

Have you ever heard of greenfield and brownfield projects? These are common terms, for example when new IT systems are introduced. In greenfield projects, the systems are developed for a completely new environment; you start from scratch, so to speak. That's why they are also associated with higher risks since the results are usually not yet known.

This also applies to data science and artificial intelligence. Such projects usually require a proof of concept to be able to validate the result of a prototype at an early stage. We coin the term Greenfield Labeling for this.

📘

Greenfield Labeling

Rapid scaling of the amount of training data. The quality of the training data remains important but is not the most important factor. This is NOT to be confused with the quality of the test data, which should always be of the highest possible quality!

Greenfield Labeling is implemented by using heuristics, so-called information sources, integrated with Weak Supervision, a framework for synthesizing them into cleaned-up programmatic labels. kern uses this framework together with Active Transfer Learning, i.e. integrating pre-trained models and using their embeddings to actively train lightweight models as implicit information sources. Properly applied, it can help you turn raw data into good quality training data within hours. It's all about getting your project off the ground.

We believe that Greenfield Labeling is what enables quick prototyping.


🍂 Brownfield Labeling is about data quality

Unlike Greenfield, Brownfield is much more about improving existing systems. You have to think about integrating legacy systems or third-party vendors, but you also have to know a lot more about the risks.

And that's true for AI as well:

📘

Brownfield Labeling

Improving on the quality of existing training data, i.e. finding potential labeling errors and debugging your data.

For Brownfield Labeling, it is already a given that you have some training data available. Now you need to make sure that they are of the best, affordable quality for your project. This can be done by monitoring and managing your data appropriately. The data needs to be enriched with valuable metadata so you can find potential data slices that have systematic errors. Or you can use techniques from areas like Confident Learning to help you identify weaknesses in your training data. Your goal is to get from 90% performance to 95% or even 99%.

We believe that Brownfield Labeling is what enables reliable and precise predictions.


✨ Labeling = enrichments

The easiest way to explain this is with an example. Let's say you are labeling the intent of an email, i.e. a classification. Now, during your labeling session, you might recognize some patterns. For example, if a certain regular expression or list of terms occurs, it is product feedback. If the sentiment is negative and based on facts, it is most likely a complaint.

In a typical project, you would just set the intent of an email. With kern, you have labeling tasks instead - and you can use them 100% dynamically. Integrate extractions with classifications, label multiple classifications at once, or augment your data with third-party applications. Information sources are also enrichments. So, in essence, the goal is to gather as much information as possible about your training data. kern is designed to do just that.

**By scaling data enrichment, we enable data management that helps you turn large amounts of raw data into high-quality training data.

👍

With that being said, we're happy that you choose kern as your dev environment for AI training data!


Did this page help you?