Label Studio ( Data Labeling)

This project contains the Helm values of label studio and a helmfile for a deployment with label studio

Installation

Dependencies

Before deploying Label studio you need at least PostgreSQL.

S3 MinIO is optional but recommended so we will use it. The alternative is managing persistent storage manually.

PostgreSQL requires monitoring related CRDs as a dependency.

In this repo we'll be using TopoLVM for PostgreSQL but it's optional.

TopoLVM will add cert-manager as a dependency.

Deploy using Helmfile

Helmfile is prepared with all the aforementioned optional dependencies

cd helmfile
helmfile sync -f helmfile.yaml

Deploying using Helm

Alternatively you can deploy Helm charts manually in case you have your own

Dependencies

Here is how to deploy the dependencies:

Monitoring CRDs

helm install monitoringcrds ../monitoring/kube-prometheus-stack/charts/crds \
    --create-namespace -n monitoring 

Cert manager and topolvm

helm install cert-manager ../cert-manager/cert-manager \
    --create-namespace -n kosmos-system \
    -f helmfile/values/cert-manager.yaml
helm install topolvm ../lvm-csi/topolvm\
    --create-namespace -n kosmos-system \
    -f helmfile/values/lvm-csi.yaml

PostgreSQL

helm install cnpg ../postgresql/cloudnative-pg \
    --create-namespace -n kosmos-sql \
    -f helmfile/values/psql-operator.yaml
helm install ../postgresql/cluster \
    --create-namespace -n kosmos-sql \
    -f helmfile/values/psql-minimal.yaml

S3 Minio

helm install operator ../s3/operator\
    --create-namespace -n kosmos-s3 \
    -f helmfile/values/s3-operator-min.yaml
helm install minio-secrets ../s3/minio-secrets\
    --create-namespace -n kosmos-s3
helm install s3-tenant ../s3/tenant-5.0.15.tgz\
    --create-namespace -n kosmos-s3\
    -f helmfile/values/s3-tenant-min.yaml

Deploy Label Studio

To deploy Label studio, we need to use the Helm chart that will generate secrets then the one that will create the database then the one that will install label studio itself

cd kosmos-apps/label-studio
# generate secrets
helm install label-studio-secrets label-studio-secrets \
  --namespace kosmos-data \
  --create-namespace \
  --set postgres.rootUser=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.username}" | base64 --decode) \
  --set postgres.rootPassword=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.password}" | base64 --decode) \
  --set minio.accessKey=minioadmin \
  --set minio.secretKey=minioadmin
# create database
helm install label-studio-initdb ../platform-provisioner/initdb \
  --namespace kosmos-data \
  --set secret.name=label-studio-secrets \
  --set secret.dbUserKey=DB_USER \
  --set secret.dbPassKey=DB_PASSWORD \
  --set secret.dbRootUserKey=PG_USER \
  --set secret.dbRootPassKey=PG_PASSWORD \
  --set database=labelstudio \
  --set host=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.host}" | base64 --decode) \
  --set dbNamespace=kosmos-sql

# install label-studio
helm install label-studio label-studio-1.6.3.tgz \
    --namespace kosmos-data \
    -f values.yaml

Configuration

Accessing UI

If your UI doesn't load correctly then you either don't have LABEL_STUDIO_HOST set for some reason or you didn't enter http(s):// in the beginning of your url. Make sure you explicitly enter the protocol instead of letting the browser infer it.

Main parameters

First thing to know is that the variable LABEL_STUDIO_HOST is very important in a Kubernetes setup because it's an external host, it is configured automatically if your ingress is enabled.

When this env variable is set, all links are generated with an absolute path, and using LABEL_STUDIO_HOST as host, this also impacts the s3 content, even if you have s3.endpointUrl set. If the variable isn't set, links will be relative and won't work if the browser doesn't have access to the internal host.

You can configure S3 persistence and admin username password this way:

We need to set up a default user, remember that the default user name has to include @ as label studio expects an email. We are also using an external domain since we're in a Kubernetes ingress setting.

These are the parameters needed for that:

app:
  extraEnvironmentVars:
    LABEL_STUDIO_USERNAME: admin@localhost #or admin@athea.tech
    LABEL_STUDIO_PASSWORD: password 
    LABEL_STUDIO_HOST: https://label-studio.wip

We will be using s3 storage for persistence, however one thing to keep in mind is that all connections to S3 backend are not done by the server, they are done directly by the client so we need to use S3 API ingress.

Here is the configuration for that:

global:
  persistence:
    enabled: true
    type: s3 # s3, azure, gcs
    config:
      s3:
        accessKey: "minioadmin"
        secretKey: "minioadmin"
        bucket: "labelstudio"
        folder: "uploads"
        region: "us-east-1"
        endpointUrl: "http://minio.wip"

Considerations

There are things to look out for here.

Label Studio uses S3 storage in two main different ways:

Through the browser to retrieve annotation files via urls to fill the labeling interface. This goes through ingress minio.<domain>
Through the container to retrieve task files, check connections, push outputs, save tasks, etc. This goes through service minio.<namespace>.svc.cluster.local.

However for both these different ways, the same URL is used and you cannot submit one for each. This can be worked around by adding a new rewrite entry to CoreDNS to make it redirect minio's ingress domain name to service whenever requested from inside the cluster.

Once this is handled, you'll also have a problem with mixed-content since this means Label Studio which is in HTTPS contacts the Minio service which is in HTTP. Once you work around this by allowing mixed-content you'll have to provide the ingress CA to label studio's trust store via environment variable CUSTOM_CA_CERTS.

Using Label Studio

Importing Tasks

Importing Task with inline text via UI

Once you have created a project, you can choose to create tasks. Let's take the example of text files that require annotations, the logic will be the same for images or other content.

By default Label studio will expect inline text content in your task file. Something like this:

[
  {
    "data": {
      "text": "Hello, this is my Lorem Ipsum"
    }
  }
]

You can drag and drop this task file using the UI, in our case it will be stored in s3. If you didn't use it for persistence, it will be stored on disk.

Importing a Task with URL via UI

Besides supplying the text to be annotated inline, you can also add a URL to reference it, like file:// , http(s):// or s3://.

You can use a task like this:

[
  {
    "data": {
      "text": "s3://textfiles/file.txt"
    }
  }
]

Using a URL with the s3:// protocol requires having configured a Source Cloud Storage via Settings > Cloud Storage, its configuration will be explained below. That will make the browser connect to the Source Cloud Storage server using those credentials but not the bucket nor prefix configured in the source then retrieve the file's content for the user to annotate and display it on the page.

Importing the Task via the UI or the API will assign it an ID.

Importing a Task via Remote

You can connect to a cloud storage in the settings of each project and specify sources for tasks, and targets for where the outputs of your annotation tasks will be pushed.

When you add a new source storage (in our case a MinIO bucket) all recognized task files will be added automatically. You can use a regex filter on files to process, any file that doesn't match the filter will be ignored.

If new files are added to your bucket afterwards you have to manually press the "Sync" button or manually use the API to trigger a sync action.

warning

Another thing to keep in mind is each file must include a single task, not a list of tasks like the example given above, even if it is only a list of 1 task, Label Studio still expects only a json dict of a single task like this one:

{
  "data": {
  "text": "s3://textfiles/file.txt"
  }
}

warning

If a single file that matches the filter is invalid, no files will be imported following the Sync command, the errors will be clear and specifiy the S3 object key that causes the issue.

Importing a Task

Importing a Task with Pre-annotations

Sometimes we want to import already annotated data, then let the user fix or improve upon them. In that case our task json file has to already include the annotations

Here is an example of how it should be:

[
  {
    "data": {
      "text": "Enter your text here"
    },
    "predictions": [
      {
        "result": [
          {
            "from_name": "label",
            "to_name": "text",
            "type": "labels",
            "value": {
              "start": 0,
              "end": 4,
              "labels": ["Verb"]
            }
          },
          {
            "from_name": "label",
            "to_name": "text",
            "type": "labels",
            "value": {
              "start": 11,
              "end": 14,
              "labels": ["Noun"]
            }
          }
        ]
      }
    ]
  }
]

Annotation Task Output

After the annotation task is finished, the annotator can submit it. The result of the annotation is pushed to a PostgreSQL table task_completion. A report can be generated.

If your project has a configured Target bucket via Settings > Cloud Storage > Target Cloud Storage, then the report will be generated automatically and pushed there.

Each Task generates one report in minified JSON format, and the report's name will be the Task's ID.

The output of the task will only contain the annotation's indices of start and end of the substring, and whatever labels were assigned to them, and the same path that was in the original task.json file.

Here is an example of the generated output report:

export.json

{
  "id": 106,
  "result": [
    {
      "value": {
        "start": 0,
        "end": 76,
        "labels": [
          "Privacy"
        ]
      },
      "id": "7QKoXPWzTO",
      "from_name": "labels",
      "to_name": "text",
      "type": "labels",
      "origin": "manual"
    },
    {
      "value": {
        "start": 392,
        "end": 822,
        "labels": [
          "Accountability"
        ]
      },
      "id": "-YOm2u2zqd",
      "from_name": "labels",
      "to_name": "text",
      "type": "labels",
      "origin": "manual"
    }
  ],
  "created_username": " admin@localhost, 1",
  "created_ago": "0 minutes",
  "completed_by": {
    "id": 1,
    "first_name": "",
    "last_name": "",
    "email": "admin@localhost"
  },
  "task": {
    "id": 109,
    "data": {
      "text": "s3://textfiles/file.txt"
    },
    "meta": {},
    "created_at": "2024-10-02T14:32:21.440188Z",
    "updated_at": "2024-10-02T14:32:21.440216Z",
    "is_labeled": true,
    "overlap": 1,
    "inner_id": 109,
    "total_annotations": 1,
    "cancelled_annotations": 0,
    "total_predictions": 0,
    "comment_count": 0,
    "unresolved_comment_count": 0,
    "last_comment_updated_at": null,
    "project": 2,
    "updated_by": null,
    "file_upload": 5,
    "comment_authors": []
  },
  "was_cancelled": false,
  "ground_truth": false,
  "created_at": "2024-10-02T14:35:10.454111Z",
  "updated_at": "2024-10-02T14:35:10.454132Z",
  "draft_created_at": "2024-10-02T14:35:09.248094Z",
  "lead_time": 9.608,
  "import_id": null,
  "last_action": null,
  "project": 2,
  "updated_by": 1,
  "parent_prediction": null,
  "parent_annotation": null,
  "last_created_by": null
}

Besides the report pushed to storage, you can always generate a report for any completed task via UI or API.

Using predictions

Tasks need to contain a data.tokens field for training the Model manually on everything available

Installation​

Dependencies​

Deploy using Helmfile​

Deploying using Helm​

Dependencies​

Monitoring CRDs​

Cert manager and topolvm​

PostgreSQL​

S3 Minio​

Deploy Label Studio​

Configuration​

Accessing UI​

Main parameters​

Considerations​

Using Label Studio​

Importing Tasks​

Importing Task with inline text via UI​

Importing a Task with URL via UI​

Importing a Task via Remote​

Importing a Task​

Importing a Task with Pre-annotations​

Annotation Task Output​

Using predictions​

Installation

Dependencies

Deploy using Helmfile

Deploying using Helm

Dependencies

Monitoring CRDs

Cert manager and topolvm

PostgreSQL

S3 Minio

Deploy Label Studio

Configuration

Accessing UI

Main parameters

Considerations

Using Label Studio

Importing Tasks

Importing Task with inline text via UI

Importing a Task with URL via UI

Importing a Task via Remote

Importing a Task

Importing a Task with Pre-annotations

Annotation Task Output

Using predictions