Now, who got the Friends reference? That's right. It's all about some insights into our peers. With our previously introduced multi-user labeling workflow, it's now time to get some insights in the form of a brand new graph—the inter-annotator agreement. And as if this wasn't enough, we overhauled our embedding creation and active learning module.
You know you're right. I mean, it's common sense, right? In some cases, maybe even in most, this might be true. But the really interesting ones are the ones where there is disagreement between you and your colleagues. Not just from a personal standpoint but for the neural network as well. These are the ones pushing you from mediocre to excellence, which all of us aim for. With that said, let's take a look at the new graph.
As you can see, here are users listed with each of their intersected record votings. If both answered the same, the agreement is (as expected) 100%. Others have much less than that, so for example, Jens and Moritz should talk about their differences and why they decided to choose labels A or B.
In a previous version, we had to disable the token embeddings (for NER) to improve the underlying algorithms - and we're happy to tell you that they are back! By introducing standard dimensionality reduction algorithms like PCA, we can now reduce the size of our embeddings by up to 90%. This decreases the amount of space required to store the embedding and increases your workflow substantially. Of course, reducing the overall size will introduce a slight accuracy loss but after some intensive testing, we can firmly tell you: It's not to an extent that you'll notice any shortcomings.
This might be a good time to introduce you to our embedders repository. You can take a look at what is happening underneath if you're curious, and integrate the library for your custom applications!
Now back to embeddings and the newly introduced suggestions.
Have you ever heard of Bag of Words? In short, it's a simplifying representation of the given data by counting the words in a document. Let's say we build a vocabulary for all text data we have, a bag of words builds a vector for each document and counts the occurrence per token.
Also pretty much similar to the previous one: Bag of Characters. The only difference: This counts characters rather than words. This is especially helpful in scenarios such as OCR-parsed texts, where you easily have mismatched characters, or when you want to find units (KG, temperature, or timestamps) in your text.
Last but certainly not least: TF-IDF. Also known as term frequency – inverse document frequency. The benefit of TF-IDF is in identifying the importance per token in a document for a certain classification. You can easily use it to play around with some classifiers.
The Active Learning module now also comes with a few more customization options. Don't worry; you don't have to use them if you don't want to. New possibilities include setting minimum confidence that has to be reached for a label to be considered worthy/relevant. Also, an option to filter suitable labels was included, so if you want the module to only return a subset of your labeling task labels, you can do so.
- You can now decide which Project Overview graphs are displayed, so the view isn't cluttered
- Project Overview now remembers your previously selections per project
- Labels are now being displayed on the Labeling Function page for easier access
- Embeddings are now being displayed on the Active Learning page for easier access
- Fixed information source overview date issue with specific region settings
- Projects that are being deleted are now excluded on the overview page
- Running an Active Learning module now displays the correct placeholder
- Running multiple functions won't result in overlapping issues
- Overall stability for the database session was increased
- The embedding state is now displayed during the creation
- Rebranding of the main app is now complete - the login page is up next
- Weak Supervision performance boost
- Stat calculation performance boost
- Last Weak Supervision run is now accessible for some additional information
- Tokenization progress visible in settings
- Empty Projects now disable all tabs but settings
- Empty Projects now open in the settings tab
- Fixed a NER Confusion Matrix issue for projects with multiple labeling tasks
- Fixed a small date display issue for the notification center