Knowledge bases

How to create and maintain knowledge bases in kern

Knowledge bases are collections of terms that are saved for later use. They can be created manually in the knowledge bases tab or are created automatically for every label in information extraction tasks (e.g. for NER tasks, there will be a knowledge base for every entity that you can label). If you created them manually, you have to add terms manually. If the knowledge base was created automatically by an information extraction task, every label you set adds the word to the respective knowledge base.

🧮 Knowledge bases

Often times a good indication for the label of a record is the presence of certain keywords in your texts. If you want to create labeling functions that look for those keywords, you don't have to type them manually every time, but you can directly use the knowledge bases in your code. That way your labeling functions will always utilize the latest set of keywords to look for. For an actual code reference, see below.

Let's say you have a binary text classification where you want to classify if the text in a record's "content" (which is an attribute of a record) refers to the process of buying or selling in an auction house. You created an information extraction task on the record's content and labeled a lot of terms relating to the buying process that were added to the related knowledge base automatically. You can now use those terms like the following:

import knowledge # module containing your knowledge bases

def terms_auction_knowledge_base_lookup(record):
    for keyword in knowledge.auction_terms:
        if keyword.lower() in record["content"].text.lower():
            return "Buy"

For more information, see the next page.

❗️

Blacklisting terms

This is useful if you want to exclude certain terms in your label function, but do not want to delete the associated label from which the term originated. For example, if you have terms with multiple meanings (such as "bank"), you can exclude them. In such cases, it is better to use Active Transfer Learning instead, because transformer-based embeddings contain contextual information that a labeling function does not normally analyze.


Did this page help you?