MLOps Pipeline

This document describes the entire process of the MLOps Pipeline in our platform.

This includes, retrieving the raw datasets, preparing and annotating them. Then training and putting the model for inference.

Main Steps

Retrieve raw datasets
Prepare them for annotation (classical ETL)
Annotate them using Label Studio
Prepare annotations for training
Start training with mlflow tracking enabled
Optional Push the model to the Mlflow registry if training is satisfactory
Deploy in an inference server using KServe. Custom models will require a custom container but will still be deployed using KServe.
Optional Use Open WebUI for interacting with the KServe deployed model.
User feedback to detect when the model stops being useful.
Update datasets with new data where the model falls short.
Go back to step 1 to retrain the model.

Datasets must be stored in S3, however since we need some need-to-know we cannot access the S3 API directly.

We'll go through a sort of proxy that uses the credentials to authenticate the user and figure out if they are allowed access to the resource.

If they are allowed, this proxy will forward the request with the proper S3 credentials to the S3 API.

Most of the time, datasets are unusable as-is since they're not tailored for your specific training use case.

Examples of preparing data:

Adapt tasks and pre-annotations to label studio format. Converting XML to JSON.
Preprocess labeled images to extract bounding boxes from images into a json file to make labels useable by the training script.
Remove the labels that aren't needed for our usecases (e.g if we have models of car detections but are not interested in SUVs we can remove all SUV images and labels)

To address