v1.5.0 - Export to Label Studio, new data types, massive size reduction and nicer looking comments

For this new version of refinery, we set ourselves the goal to improve some of the existing parts of the application and create connections to existing labeling tools. Outside of the application, we worked on reducing the overall size of refinery and all the docker images combined are now just 5.2 GB in size. In the application, you can now select new data types for calculated attributes to better reflect how that data should be treated. The global comment system we introduced in v1.4.0 now looks a lot better and slides in smoothly from the side of the screen. If you want to export your records, you can now precisely select the format, attributes, labeling tasks, and much more.

Reduction of refinery's total size

Every time you start refinery you start more than 20 separate services that are containerized with docker. Those separate services use memory-intensive libraries, e.g. the npm modules for our frontend or PyTorch for transformers. Before this change, every service was containerized separately, so if two different services require PyTorch to be installed, they would both install it in their own virtual environment. This resulted in a total size of 10.96 GB (arm) or 15.32 GB (amd) for refinery v1.4.0, which was quite a lot to pull for every new update. Thanks to some excellent engineering, refinery now only requires you to download 5.2 GB, a size reduction of more than 50 percent. Also updating utilizes the layer structure to its fullest so oftentimes only the last layer needs to be updated. Depending on the container a few KB up to ~500 MB.

This was achieved by two optimizations: choosing smaller parent images and sharing layers between different images. If you're interested in a more elaborate explanation, we will probably do a blog post about that soon.

Comparison of the previous version to the latest one for three docker images used in refinery. The individual size was reduced drastically and the shared size was further optimized.

Comparison of the previous version to the latest one for three docker images used in refinery. The individual size was reduced drastically and the shared size was further optimized.

Now that we're talking about it, we want to take this opportunity to also remind you about removing outdated versions of refinery from docker that you don't need anymore, as this is not part of our update procedure! If you have questions regarding this take a look at docker prune or reach out to us during our office hours.

Customizable record export - Supporting Label Studio import format

We are excited to bring you a new record export functionality that lets you customize the export to your needs. There are different presets that you can choose from, or you can make all the decisions yourself. This is great news if you need a specific format or just a selection of attributes. If you're exporting to Label Studio there is also a neat little feature to prepare your labeling interface based on your refinery project settings.

Screenshot of the new record export modal, in which you can fully customize the data you want to download.

![](https://files.readme.io/e043fe3-lstudio-example.png)

Screenshot of the Labeling Interface generated by kern depending on your refinery project settings

Screenshot of the new record export modal, in which you can fully customize the data you want to download.

Screenshot of the Labeling Interface generated by kern depending on your refinery project settings

Specify data types in your attribute calculation

Screenshot of the modal that appears when you add a new attribute on the settings page.

Screenshot of the modal that appears when you add a new attribute on the settings page.

We introduced attribute calculation in v1.3.0, which allows you to create new attributes programmatically directly in refinery. As the potential of this is huge, we wanted to further enhance the usability. That is why you can now specify the attribute type at creation time.

Screenshot of the container logs after the function was tested on ten randomly sampled records. The returned value was "some string", which did not match the required boolean type.

Screenshot of the container logs after the function was tested on ten randomly sampled records. The returned value was "some string", which did not match the required boolean type.

There are two main advantages of specifying the data type of your calculated attribute:

  • better documentation
  • reliability through type safety

The type of the returned values from your function is checked against the type that you specified and if something does not match, you will get a ValueError stating what went wrong. This is especially useful with the "Run on 10" functionality, which can be used to check the correctness of your function on ten randomly sampled records.

We recommend utilizing this typing functionality as future versions of refinery will be able to use this information for much more, e.g. better filtering in the data browser.

Screenshot of the details page of the attribute calculation function. The available attributes are now color-coded to represent their data type.

Screenshot of the details page of the attribute calculation function. The available attributes are now color-coded to represent their data type.

To get you started, the selection of the data type also decides the example code that is generated, so you get a sense of how your function could look like.

Comment system got a visual upgrade

GIF of the overview page with the comments on the right-hand side. Upon clicking on the comments icon in the top right corner, the comment section appears.

GIF of the overview page with the comments on the right-hand side. Upon clicking on the comments icon in the top right corner, the comment section appears.

In the previous version, the comments were a modal in the middle of the screen, which was not as intuitive for parallel usage. By making the comments an overlay that slides in, you can now work on your tasks and have the comments open at the same time, so you don't miss anything that could be important. As parts of the screen are concealed by the comments, you can decide whether you want to display them on the left or right-hand side.

Minor changes

  • Bugfix import Primary-Key: Issue#105
  • Bugfix save slice on low confidence: Issue#99
  • Bugfix empty columns display in data browser & labeling: Issue#144
  • Bugfix admin dashboard user deletion (managed version): Issue#132
  • Bugfix utf-8 encoding on json export: Issue#89
  • Color coding for attribute types in labeling function execution environment and attribute calculation