Labeling functions

Ideas for kickstarting your automation

We have collected some common labeling functions for you as a guide. You can either copy these or use them for brainstorming 😄

📘

Not sure what labeling functions are?

To better understand how to best use them, take a look at our guide.

#️⃣ Keyword lookups

We have found that keyword searching is an excellent labeling function in many text use cases. The text is searched for specific terms associated with one or more labels. They are easy to use and often have high precision.

There are two common ways to implement them:

  • As a list lookup
def terms_auction_list_lookup(record):
    keywords = ["forced sale", "foreclosure sale", "court order", "compulsory auction"]
    for keyword in keywords:
        if keyword in record["content"].text.lower():
            return "Buy"
import knowledge # module containing your knowledge bases

def terms_auction_knowledge_base_lookup(record):
    for keyword in knowledge.auction_terms:
        if keyword in record["content"].text.lower():
            return "Buy"

You can also use them for extracting entities:

import knowledge

def terms_names_knowledge_base_lookup(record):
    for token in record["content"]:
        if token.text in knowledge.names:
            yield "Name", token.i, token.i

🌀 Regular expressions

When keywords aren't sufficient, you can always switch to regular expressions, working for both classifications and extractions. Common patterns include e.g.:

  • Extracting links from raw texts:
import re

# you can scroll to the left to see all the endings
pattern = re.compile(r'((?<=[^a-zA-Z0-9])(?:https?\:\/\/|[a-zA-Z0-9]{1,}\.{1}|\b)(?:\w{1,}\.{1}){1,5}(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw){1}(?:\/[a-zA-Z0-9]{1,})*)')

def extract_links(record):
    for token in record["text"]:
        if pattern.match(token.text):
            yield "Link", token.i, token.i
  • Extracting HTML tags:
import re

# you can scroll to the left to see all the endings
pattern = re.compile(r'<(.*?)>')

def extract_links(record):
    for token in record["raw_html"]:
        if pattern.match(token.text):
            yield "Tag", token.i, token.i
  • Mentioning of a name (e.g. within Slack, like "Today was great @Johannes what about you?")
import re

pattern = re.compile("Today was great @Johannes what about you?")

def extract_mentions(record):
    for token in record["message"]:
        if pattern.match(token.text):
            yield "Mention", token.i, token.i + 1

It is really easy to build functions on top of regular expressions. Here you can find some additional regular expressions we find useful from time to time:

Regular expression

Meaning

Example

r'[^\x00-\x7F]+'

Emojis, High Unicode Characters

"Let's talk about regular expressions 😉"

r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'

IP addresses

"We might need to contact 172.217.0.0 to find out more."

r'([a-zA-Z0-9_]+).(jpg|png|gif|jpeg|pdf|ipynb|py)'

File names

"So where was the file epic_journey.png located?"

r'\[(.*?)\]'

Content between brackets (here: [, ])

"Veni, Vidi, Vici [Julius Caesar ~ 46 b.c.]"


Did this page help you?