Logo

Data export

There are two ways of exporting your data that are useful for different purposes. There is the option to download your records and there is the option to create a project snapshot. Both are accessible on the settings page.

Fig. 1: Screenshot of the settings page where the user is about to create a project snapshot.

Download records

This option will be the most commonly chosen one when it comes to exporting your data in order to use it outside of refinery. Once you select to download your record, a modal will appear that lets you fully customize the data you want to export (see Fig. 2).

Fig. 2: Modal that appears after clicking on the 'Download records' button.

The default settings are a great starting point, but the "custom" option on the top right makes everything selectable to your needs. There is also the option to export your data in the Label Studio format, which also comes with custom HTML code to get a refinery-like UI in Label Studio.

If you want to protect your file from unauthorized access, you can specify a password which will protect the resulting zip file. This way, you would have to enter the password when extracting the data or when importing the data back into refinery.

One of the most useful options is the "export amount" as you can select to export only a certain data slice. A few scenarios where this can help you in your data development process could be to e.g. only export labeled records, export training and testing data separately, or only export data that is above a certain weak supervision confidence threshold.

You don't have to decide that at exporting time, though. You can filter the data according to your needs afterward, so let's take a look at the exported file to better understand the schema:

[
    {
        "headline": "$200 Laptops Break a Business Model",
        "running_id": "1479",
        "__clickbait__MANUAL": null,
        "__clickbait__WEAK_SUPERVISION": "yes",
        "__clickbait__WEAK_SUPERVISION__confidence": "0.4685",
        "headline__entities__MANUAL": [
            "O",
            "O",
            "O",
            "O",
            "O",
            "O",
            "O"
        ],
        "headline__entities__WEAK_SUPERVISION": [
            "MONEY",
            "MONEY",
            "O",
            "O",
            "O",
            "O",
            "O"
        ],
        "headline__entities__WEAK_SUPERVISION__confidence": [
            0.83,
            0.83,
            0.0,
            0.0,
            0.0,
            0.0,
            0.0
        ]
    }
]

headline and running_id are the attributes of our records, they were given in the initial data import. Everything that starts with a double underscore __ is a reference to a full attribute classification labeling task, that comes with three separate entries:

  • MANUAL: the manually set label for this task.
  • WEAK_SUPERVISION: the weakly supervised label for this task.
  • WEAK_SUPERVISION_CONFIDENCE: the confidence for the weakly supervised label.

That means the JSON key __clickbait__MANUAL contains the manual label for the full record classification labeling task clickbait as a value.

As refinery also offers information extraction labeling tasks or classification tasks that are defined on a single attribute and not on the whole record, you can also find other entries in the JSON, which follow the same pattern with three separate entries, but this time they are prefaced with the name of the attribute that the labeling task is defined on. Therefore, headline__entities__MANUAL refers to the manual label for the labeling task entities that is defined on the attribute headline. The labeling task type will define whether the entry is a list (extraction) or a single value (classification).

If you need a refresher on the different types of labeling tasks, please have a look at the section labeling tasks.

Creating a project snapshot

This data export option is mainly designed to be a backup option. If you want to export your whole refinery project that can be loaded back into refinery at a later point, then you should create a project snapshot.

Fig. 3: Modal that appears after selecting the project snapshot creation.

You can customize the export to your needs (see Fig. 3), e.g. you could include the embedding tensors, which requires more disk space but saves on computation time when importing it back into refinery. Just like the "download records" option, you can also specify a password that protects the resulting zip file.