Project creation and data upload

Setting up the first steps in your new project

Project creation

📘

Also available on 📺 YouTube

Click here to see a video explanation of how you can set up your project, including embedding integration and labeling task creation.

Once you select to start a new project, the following screen opens up:

28802880

After you give your project a name (and optionally a description), you can choose one from many spaCy tokenizers via selecting the language of your project. The tokenizer helps to:

  • define atomic information units in your texts, making it much easier to label data during span labeling
  • precompute valuable metadata for each token, which you can later on use for labeling functions and other tasks

Currently, we support data upload via file; we use JSON to model user inputs. In the near future, we'll also enable uploads via API (and Python SDK) and database integrations. Our file uploads work using pandas, so you can specify import options for your files just as you would for reading dataframes. If you're not sure which parameters you can specify, have a look at the documentation pages (JSON, CSV, spreadsheets).

For instance, a file containing data about headlines collected from newsletters could look as follows:

[
    {
        "headline": "Mike Tyson set to retire after loss",
        "running_id": 0
    },
    {
        "headline": "Iraqi vote remains in doubt",
        "running_id": 1
    },
    {
        "headline": "Conservatives Ponder Way Out of Wilderness",
        "running_id": 2
    },
    {
        "headline": "Final report blames instrument failure for Adam Air Flight 574 disaster",
        "running_id": 3
    }
]

Uploading existing labeled data

If you already have (partially) labeled data, you can add your labels in the upload. Let's say we have a labeling task called sentiment, then we could do the following to upload the labels:

[
    {
        "headline": "Mike Tyson set to retire after loss",
        "running_id": 0,
        "headline__sentiment": "positive"
    },
    {
        "headline": "Iraqi vote remains in doubt",
        "running_id": 1
    }
]

In this example, the first record contains labels, whereas the second has none. If you don't want to associate the label with a specific attribute, you can just leave the attribute out ("__sentiment": "Positive").

Project settings

As soon as you continue, the data integration and tokenization procedure begins. Also, you're being forwarded to the project settings page.

28802880

On the very top, you see the data schema. We infer the data types of your attributes, but you can change them how you like. Also, if you have a primary key (i.e. a unique attribute), please mark them too - this is required for later added data. We'll go into detail about what you can do next on the following pages.


Did this page help you?