Label Studio ( Data Labeling)
This project contains the Helm values of label studio and a helmfile for a deployment with label studio
Installationβ
Dependenciesβ
Before deploying Label studio you need at least PostgreSQL.
S3 MinIO is optional but recommended so we will use it. The alternative is managing persistent storage manually.
PostgreSQL requires monitoring related CRDs as a dependency.
In this repo we'll be using TopoLVM for PostgreSQL but it's optional.
TopoLVM will add cert-manager as a dependency.
Deploy using Helmfileβ
Helmfile is prepared with all the aforementioned optional dependencies
cd helmfile
helmfile sync -f helmfile.yaml
Deploying using Helmβ
Alternatively you can deploy Helm charts manually in case you have your own
Dependenciesβ
Here is how to deploy the dependencies:
Monitoring CRDsβ
helm install monitoringcrds ../monitoring/kube-prometheus-stack/charts/crds \
--create-namespace -n monitoring
Cert manager and topolvmβ
helm install cert-manager ../cert-manager/cert-manager \
--create-namespace -n kosmos-system \
-f helmfile/values/cert-manager.yaml
helm install topolvm ../lvm-csi/topolvm\
--create-namespace -n kosmos-system \
-f helmfile/values/lvm-csi.yaml
PostgreSQLβ
helm install cnpg ../postgresql/cloudnative-pg \
--create-namespace -n kosmos-sql \
-f helmfile/values/psql-operator.yaml
helm install ../postgresql/cluster \
--create-namespace -n kosmos-sql \
-f helmfile/values/psql-minimal.yaml
S3 Minioβ
helm install operator ../s3/operator\
--create-namespace -n kosmos-s3 \
-f helmfile/values/s3-operator-min.yaml
helm install minio-secrets ../s3/minio-secrets\
--create-namespace -n kosmos-s3
helm install s3-tenant ../s3/tenant-5.0.15.tgz\
--create-namespace -n kosmos-s3\
-f helmfile/values/s3-tenant-min.yaml
Deploy Label Studioβ
To deploy Label studio, we need to use the Helm chart that will generate secrets then the one that will create the database then the one that will install label studio itself
cd kosmos-apps/label-studio
# generate secrets
helm install label-studio-secrets label-studio-secrets \
--namespace kosmos-data \
--create-namespace \
--set postgres.rootUser=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.username}" | base64 --decode) \
--set postgres.rootPassword=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.password}" | base64 --decode) \
--set minio.accessKey=minioadmin \
--set minio.secretKey=minioadmin
# create database
helm install label-studio-initdb ../platform-provisioner/initdb \
--namespace kosmos-data \
--set secret.name=label-studio-secrets \
--set secret.dbUserKey=DB_USER \
--set secret.dbPassKey=DB_PASSWORD \
--set secret.dbRootUserKey=PG_USER \
--set secret.dbRootPassKey=PG_PASSWORD \
--set database=labelstudio \
--set host=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.host}" | base64 --decode) \
--set dbNamespace=kosmos-sql
# install label-studio
helm install label-studio label-studio-1.6.3.tgz \
--namespace kosmos-data \
-f values.yaml
Configurationβ
Accessing UIβ
If your UI doesn't load correctly then you either don't have LABEL_STUDIO_HOST set for some reason
or you didn't enter http(s):// in the beginning of your url. Make sure you explicitly enter the protocol instead of letting the browser infer it.
Main parametersβ
First thing to know is that the variable LABEL_STUDIO_HOST is very important in a Kubernetes setup because it's an external host, it is configured automatically if your ingress is enabled.
When this env variable is set, all links are generated with an absolute path, and using LABEL_STUDIO_HOST as host, this also impacts the s3 content, even if you have s3.endpointUrl set. If the variable isn't set, links will be relative and won't work if the browser doesn't have access to the internal host.
You can configure S3 persistence and admin username password this way:
We need to set up a default user, remember that the default user name has to include @ as label studio expects an email.
We are also using an external domain since we're in a Kubernetes ingress setting.
These are the parameters needed for that:
app:
extraEnvironmentVars:
LABEL_STUDIO_USERNAME: admin@localhost #or admin@athea.tech
LABEL_STUDIO_PASSWORD: password
LABEL_STUDIO_HOST: https://label-studio.wip
We will be using s3 storage for persistence, however one thing to keep in mind is that all connections to S3 backend are not done by the server, they are done directly by the client so we need to use S3 API ingress.
Here is the configuration for that:
global:
persistence:
enabled: true
type: s3 # s3, azure, gcs
config:
s3:
accessKey: "minioadmin"
secretKey: "minioadmin"
bucket: "labelstudio"
folder: "uploads"
region: "us-east-1"
endpointUrl: "http://minio.wip"
Considerationsβ
There are things to look out for here.
Label Studio uses S3 storage in two main different ways:
- Through the browser to retrieve annotation files via urls to fill the labeling interface. This goes through ingress
minio.<domain> - Through the container to retrieve task files, check connections, push outputs, save tasks, etc. This goes through service
minio.<namespace>.svc.cluster.local.
However for both these different ways, the same URL is used and you cannot submit one for each. This can be worked around by adding a new rewrite entry to CoreDNS to make it redirect minio's ingress domain name to service whenever requested from inside the cluster.
Once this is handled, you'll also have a problem with mixed-content since this means Label Studio which is in HTTPS contacts the Minio service which is in HTTP.
Once you work around this by allowing mixed-content you'll have to provide the ingress CA to label studio's trust store via environment variable CUSTOM_CA_CERTS.
Using Label Studioβ
Importing Tasksβ
Importing Task with inline text via UIβ
Once you have created a project, you can choose to create tasks. Let's take the example of text files that require annotations, the logic will be the same for images or other content.
By default Label studio will expect inline text content in your task file. Something like this:
[
{
"data": {
"text": "Hello, this is my Lorem Ipsum"
}
}
]
You can drag and drop this task file using the UI, in our case it will be stored in s3. If you didn't use it for persistence, it will be stored on disk.
Importing a Task with URL via UIβ
Besides supplying the text to be annotated inline, you can also add a URL to reference it, like file:// , http(s):// or s3://.
You can use a task like this:
[
{
"data": {
"text": "s3://textfiles/file.txt"
}
}
]
Using a URL with the s3:// protocol requires having configured a Source Cloud Storage via Settings > Cloud Storage, its configuration will be explained below. That will make the browser connect to the Source Cloud Storage server using those credentials but not the bucket nor prefix configured in the source then retrieve the file's content for the user to annotate and display it on the page.
Importing the Task via the UI or the API will assign it an ID.
Importing a Task via Remoteβ
You can connect to a cloud storage in the settings of each project and specify sources for tasks, and targets for where the outputs of your annotation tasks will be pushed.
When you add a new source storage (in our case a MinIO bucket) all recognized task files will be added automatically. You can use a regex filter on files to process, any file that doesn't match the filter will be ignored.
If new files are added to your bucket afterwards you have to manually press the "Sync" button or manually use the API to trigger a sync action.
Another thing to keep in mind is each file must include a single task, not a list of tasks like the example given above, even if it is only a list of 1 task, Label Studio still expects only a json dict of a single task like this one:
{
"data": {
"text": "s3://textfiles/file.txt"
}
}
If a single file that matches the filter is invalid, no files will be imported following the Sync command, the errors will be clear and specifiy the S3 object key that causes the issue.
Importing a Taskβ
Importing a Task with Pre-annotationsβ
Sometimes we want to import already annotated data, then let the user fix or improve upon them. In that case our task json file has to already include the annotations
Here is an example of how it should be:
[
{
"data": {
"text": "Enter your text here"
},
"predictions": [
{
"result": [
{
"from_name": "label",
"to_name": "text",
"type": "labels",
"value": {
"start": 0,
"end": 4,
"labels": ["Verb"]
}
},
{
"from_name": "label",
"to_name": "text",
"type": "labels",
"value": {
"start": 11,
"end": 14,
"labels": ["Noun"]
}
}
]
}
]
}
]
Annotation Task Outputβ
After the annotation task is finished, the annotator can submit it. The result of the annotation is pushed to a PostgreSQL table task_completion. A report can be generated.
If your project has a configured Target bucket via Settings > Cloud Storage > Target Cloud Storage, then the report will be generated automatically and pushed there.
Each Task generates one report in minified JSON format, and the report's name will be the Task's ID.
The output of the task will only contain the annotation's indices of start and end of the substring, and whatever labels were assigned to them, and the same path that was in the original task.json file.
Here is an example of the generated output report:
export.json
{
"id": 106,
"result": [
{
"value": {
"start": 0,
"end": 76,
"labels": [
"Privacy"
]
},
"id": "7QKoXPWzTO",
"from_name": "labels",
"to_name": "text",
"type": "labels",
"origin": "manual"
},
{
"value": {
"start": 392,
"end": 822,
"labels": [
"Accountability"
]
},
"id": "-YOm2u2zqd",
"from_name": "labels",
"to_name": "text",
"type": "labels",
"origin": "manual"
}
],
"created_username": " admin@localhost, 1",
"created_ago": "0 minutes",
"completed_by": {
"id": 1,
"first_name": "",
"last_name": "",
"email": "admin@localhost"
},
"task": {
"id": 109,
"data": {
"text": "s3://textfiles/file.txt"
},
"meta": {},
"created_at": "2024-10-02T14:32:21.440188Z",
"updated_at": "2024-10-02T14:32:21.440216Z",
"is_labeled": true,
"overlap": 1,
"inner_id": 109,
"total_annotations": 1,
"cancelled_annotations": 0,
"total_predictions": 0,
"comment_count": 0,
"unresolved_comment_count": 0,
"last_comment_updated_at": null,
"project": 2,
"updated_by": null,
"file_upload": 5,
"comment_authors": []
},
"was_cancelled": false,
"ground_truth": false,
"created_at": "2024-10-02T14:35:10.454111Z",
"updated_at": "2024-10-02T14:35:10.454132Z",
"draft_created_at": "2024-10-02T14:35:09.248094Z",
"lead_time": 9.608,
"import_id": null,
"last_action": null,
"project": 2,
"updated_by": 1,
"parent_prediction": null,
"parent_annotation": null,
"last_created_by": null
}
Besides the report pushed to storage, you can always generate a report for any completed task via UI or API.
Using predictionsβ
Tasks need to contain a data.tokens field for training the Model manually on everything available