Aller au contenu principal

Label Studio Machine Learning Backend (Auto annotation engine)

Label Studio has a feature to generate labels automatically using a model's inference.

However Label Studio requires a very specific interface which means we need to code the wrapper around our server's inference for it to function properly.

For this we have coded a sample using this model found on huggingface: urchade/gliner_small-v2.1.

It is a model for Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder.

Here is how Label studio interacts with this model:

  • Launch a training session manually using the tasks already submitted as a dataset, first 10 elements are using as an evaluation dataset, the rest are used as a training dataset.
  • Launch a training session after every submission using only that new task for finetuning. This also updates the currently loaded model.
  • Create annotations by sending tasks to this model. It will return the annotations it was able to generate. Their quality still have to be measured before submitting the labeling task. The annotations created by a model automatically are called predictions in the label studio lexicon.
  • A manual training session will complain about a gateway error which is to be ignored. It just means the model didn't finish the training session. Accurate status can be obtained from MLFlow.

Usage

Dependencies

No required dependencies.

Optional dependencies:

  • Label Studio: To be able to interact with annotation tasks.
  • MLFlow: Used for tracking training and fine tuning experiments, and as a model registry.

Installation

You can either deploy the model used by label studio via Helm

cd kosmos-apps/label-studio/ml-backend
helm install ml-backend ml-backend \
--namespace kosmos-data \
-f values.yaml

Modifications to the provided Label Studio Backend

Label studio Limitations

Here is an outline of things we had to work around:

  • Generic interfaces don't work and we have to code the support for a specific API for interaction.
  • Submitting text alone isn't sufficient. We need to either require the users submit an extra 'tokens' field with a list of tokens. Or generate it automatically using a WhiteSpace Tokenizer.
  • No support for remote urls. If the label studio annotation task contains a url instead of an inline text the python wrapper needs to download it first before passing it to the model for predictions.
  • python libraries related to huggingface and transformers store files in many fields. However that isn't possible with readOnlyRootFS, this means environment variables need to be set to avoid that issue.
  • Download a model from huggingface sometimes need to download other base models, this has to be prepared in advance in the container image in case of an airgapped deployment.

Extra features implemented in the sample

  • Configuration for training hyperparameters wasn't exposed, this was added to the sample.
  • By default only a single annotation event triggered the finetuning, this has to be parametered. The behavior can further be improved using the Webhook event reference.
  • Added MLFlow support during training, this includes logging system metrics, training metrics, model related artifacts and pushing the model to the MLFlow model registry.
  • Same container image can be used as an API server to work with label studio. Or in a standalone training workload as a step of a pipeline that's not related to label studio.

Adapt to your own model

If you want to use your own model, most of the code remains the same.

You will only have to change:

  • How the model is loaded in memory and pushed to variable MODEL.
  • Training:
    • dataloader.
    • training and evaluation steps.
    • model signature.
  • Prediction: Only the part where entities is filled. In our sample we use an underlying predict_entities(text, labels, threshold) method. You will have to use the equivalent for your model.

There are other things that are recommended to change to be adapted to your specific model such as the optimizer to use, the scheduler and other things like fitting model. You can however still choose to change anything you would like.