Mlflow (Training monitoring and model registry)

Mlflow provides a suite of tools that help build an ML workflow. We are interested in tracking, model registry, and evaluation.

We will mainly be using it to compare training executions with different datasets and parameters, and storing the resulting machine learning models in a registry.

Tracking: Mlflow Tracking provides both an API and UI dedicated to the logging of parameters, code versions, metrics, and artifacts during the ML process. This centralized repository captures details such as parameters, metrics, artifacts, data, and environment configurations, giving teams insight into their models’ evolution over time. Whether working in standalone scripts, notebooks, or other environments, Tracking facilitates the logging of results either to local files or a server, making it easier to compare multiple runs across different users.
Model Registry: A systematic approach to model management, the Model Registry assists in handling different versions of models, discerning their current state, and ensuring smooth productionization. It offers a centralized model store, APIs, and UI to collaboratively manage an Mlflow Model’s full lifecycle, including model lineage, versioning, aliasing, tagging, and annotations.
Evaluation: Designed for in-depth model analysis, this set of tools facilitates objective model comparison, be it traditional ML algorithms or cutting-edge LLMs.

Installation

Dependencies

Before deploying Label Studio, you need at least PostgreSQL.

S3 MinIO is optional but recommended, so we will use it. The alternative is managing artifacts and models' persistent storage manually.

PostgreSQL requires monitoring-related CRDs as a dependency.

In this repo, we'll be using TopoLVM for PostgreSQL, but it's optional.

TopoLVM will add cert-manager as a dependency.

Deploy using Helmfile

Helmfile is prepared with all the aforementioned optional dependencies.

cd helmfile
helmfile sync -f helmfile.yaml

Deploying using Helm

Alternatively, you can deploy Helm charts manually in case you have your own.

Dependencies

Here is how to deploy the dependencies:

Monitoring CRDs

helm install monitoringcrds ../monitoring/kube-prometheus-stack/charts/crds \
  --create-namespace -n monitoring 

Cert manager and TopoLVM

helm install cert-manager ../cert-manager/cert-manager \
  --create-namespace -n kosmos-system \
  -f helmfile/values/cert-manager.yaml
helm install topolvm ../lvm-csi/topolvm\
  --create-namespace -n kosmos-system \
  -f helmfile/values/lvm-csi.yaml

PostgreSQL

helm install cnpg ../postgresql/cloudnative-pg \
  --create-namespace -n kosmos-sql \
  -f helmfile/values/psql-operator.yaml
helm install ../postgresql/cluster \
  --create-namespace -n kosmos-sql \
  -f helmfile/values/psql-minimal.yaml

S3 Minio

helm install operator ../s3/operator\
  --create-namespace -n kosmos-s3 \
  -f helmfile/values/s3-operator-min.yaml
helm install minio-secrets ../s3/minio-secrets\
  --create-namespace -n kosmos-s3
helm install s3-tenant ../s3/tenant-5.0.15.tgz\
  --create-namespace -n kosmos-s3\
  -f helmfile/values/s3-tenant-min.yaml

Deploy ML Backend

cd kosmos-apps/mlflow
# generate secrets
helm install mlflow-secrets mlflow-secrets \
  --namespace kosmos-data \
  --create-namespace \
  --set postgres.rootUser=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.username}" | base64 --decode) \
  --set postgres.rootPassword=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.password}" | base64 --decode) \
  --set minio.accessKey=minioadmin \
  --set minio.secretKey=minioadmin
# create database
helm install mlflow-initdb ../platform-provisioner/initdb \
  --namespace kosmos-data \
  --set secret.name=mlflow-secrets \
  --set secret.dbUserKey=DB_USER \
  --set secret.dbPassKey=DB_PASSWORD \
  --set secret.dbRootUserKey=PG_USER \
  --set secret.dbRootPassKey=PG_PASSWORD \
  --set database=mlflow \
  --set host=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.host}" | base64 --decode) \
  --set dbNamespace=kosmos-sql

# install mlflow
helm install mlflow mlflow-1.6.3.tgz \
  --namespace kosmos-data \
  -f values.yaml

Security

SSO Proxy

You can choose to use the OAuth2 Proxy that forces users to log in through a Keycloak account before accessing Mlflow. You can also filter users allowed in based on realm roles using these Helm values:

proxy:
  enabled: true
  oidc:
  allowedGroups: [admin, datascientist] # Only users who have the realm role admin or datascientist will be able to access mlflow

attention

Any user that is allowed access to Mlflow will have complete access to everything.

Basic Permissions

Mlflow provides mechanisms for rudimentary user and permission management.

It allows filtering permissions according to API paths.

This is useful for limiting permissions for users on experiments, runs, and registry models.

Keep in mind that a single permission has 3 elements: the user, the resource, and the level of clearance the user will have on this specific resource. This isn't global on all experiments, for example, but really a specific experiment id. This is why it's better to either automate the permission behavior or rely heavily on the default permission.

You can choose a default permission that applies to all new users on all resources not owned by them:

Permission	Can read	Can update	Can delete	Can manage
`READ`	Yes	No	No	No
`EDIT`	Yes	Yes	No	No
`MANAGE`	Yes	Yes	Yes	Yes
`NO_PERMISSIONS`	No	No	No	No

Configuration

Configuration is done via a configuration file and is pushed to a SQL database chosen under database_uri in that database.

You have to set the environment variable MLFLOW_AUTH_CONFIG_PATH to your chosen config file:

Here is an example:

[mlflow]
default_permission = READ # default value
database_uri = postgresql://<db_uri>/mlflow_auth
admin_username = admin
admin_password = password
authorization_function = mlflow.server.auth.jwt_auth:authenticate_sso_request # custom authorization function that decodes a jwt token, default is basic auth

[role_mappings]
admin = admin
MANAGE = devops
EDIT = dataingenieur,dataanalyst
READ = datascientist

Mlflow native authentication

You can use Mlflow native authentication alone:

mlflow:
  auth:
    enabled: true
    adminUsername: admin
    adminPassword: password
    defaultPermission: READ
    dbName: mlflow_auth
    sso: false
proxy:
  enabled: false

Or with SSO while keeping the two unrelated, this means that the SSO user and the Mlflow user will not be the same:

Keep mlflow.auth.sso as false if you don't want to delegate authentication to the OAuth2 proxy. Because if you choose to rely on the proxy, users will have to already be existent in your Mlflow authentication database otherwise you won't be able to access Mlflow.

mlflow:
  auth:
    enabled: true
    adminUsername: admin
    adminPassword: password
    defaultPermission: READ
    dbName: mlflow_auth
    sso: false
proxy:
  enabled: true
  oidc:
  allowedGroups: [admin, datascientist]

Mlflow SSO authentication with permissions

You can also opt for using the same users between SSO and Mlflow internal authentication.

mlflow:
  auth:
    enabled: true
    adminUsername: admin
    adminPassword: password
    defaultPermission: READ
    dbName: mlflow_auth
    sso: true
    roleMappings:
      admin: [ admin ]
      MANAGE: [ devops ]
      EDIT: [ 'dataingenieur' , 'dataanalyst']
      READ: [ datascientist ]
proxy:
  enabled: true
  oidc:
    allowedGroups: [admin, datascientist]

Mlflow SSO Role Mapping

Mlflow supports implementing custom authentication mechanisms. We leveraged this to add role mapping features.

Each time Mlflow receives an authorization token from the SSO Proxy that is valid, it attempts to create a new Mlflow local account if it doesn't exist, or update its local Mlflow permissions according to the roles included in the token.

If a user doesn't have a role that is part of the role mapping configuration, they will have no permissions.

Here is an example of role mapping configuration, the keys are Mlflow local permissions. The values are a list of corresponding Keycloak roles.

mlflow:
  auth:
    roleMappings:
      admin: [ 'admin' ]
      MANAGE: [ 'devops' ]
      EDIT: [ 'dataingenieur' , 'dataanalyst']
      READ: [ 'datascientist' ]

Details on how the code works

OAuth2-proxy sends the connection request to Mlflow.
Mlflow calls our custom mlflow.server.auth.jwt_auth:authenticate_sso_request function.
The function retrieves the token from the authorization header.
Token is decoded:

token is expired return error 401.
token is valid.

Retrieve user name and roles from the token.
Create a user account if it doesn't exist in local Mlflow, its password will be its sub value from the Keycloak token which is the local user ID inside Keycloak. This is needed because Mlflow still manages everything internally as basic authentication.
Sync permissions:

if the user is not an admin, update permissions. Due to how Mlflow works, any new object created has default permissions which are the same for everyone, it is not possible to use Mlflow roles or groups and set their own default permissions. This means whenever a user tries to access Mlflow, we have to update their permissions on Mlflow resources to match our role mapping policy.
if the user is an admin, nothing is needed because Mlflow gives the admin user the admin permission on everything created.

Give access to Mlflow.

Mlflow authentication API samples

Create new user

Using UI

You can create a new user either using the UI while logged in as admin: https://<mlflow_external_url>/signup

Using Python

You need environment variables first for admin credentials

export MLFLOW_TRACKING_USERNAME=admin
export MLFLOW_TRACKING_PASSWORD=password

import mlflow

auth_client = mlflow.server.get_app_client(
  "basic-auth", "https://<mlflow_external_url>/"
)

auth_client.create_user(username="username", password="password")

Using API

Finally, you can do so via cURL

curl -XPOST -u admin:password
  -d '{"username": "username", "password": "password"}'
  "https://<mlflow_external_url>/api/2.0/mlflow/users/create"

Add new permissions

Similarly, you can find the rest of the API documentation for adding or modifying singular permissions on Mlflow official documentation.

Usage

Track and push to registry

As explained above, this helps push artifacts and metadata to Mlflow about a specific training session. Artifacts are stored in S3. Metadata, parameters, and general information about runs and experiments are stored in PostgreSQL.

Concepts

Run: Mlflow Tracking is organized around the concept of runs, which are executions of some piece of data science code, for example, a single python train.py execution. Each run records metadata (various information about your run such as metrics, parameters, start and end times) and artifacts (output files from the run such as model weights, images, etc).
Experiment: An experiment groups together runs for a specific task. You can create an experiment using the CLI, API, or UI. The Mlflow API and UI also let you create and search for experiments.

Logging a Run

In order to log an Mlflow run, here is a simple python example:

mlflow_train.py

import os
import mlflow

# you need to set environment variables, it is not recommended to do them via python like this example

os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow.kosmos-data:5000"

# s3 env vars to be able to push artifacts
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "http://minio.kosmos-s3"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin" 

# for automatic system metrics logging use this, you need psutil for more metrics, and pynvml for GPU metrics
os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING']= 'true'

# Creates the experiment if it doesn't exist
mlflow.set_experiment("Experiment Name")

with mlflow.start_run() as run:
  # you can let mlflow decide which metrics to compute and to push including parameters and models
  # this is supported only for popular libraries such as Scikit-learn, xgboost, pytorch, keras or spark
  # you have to set it before starting your training session though.
  mlflow.autolog()

  # you can also combine autologging with manual logging

  # add tags

  tags = {
  "mlflow.note.content": "This will change the description.\n And it *supports* _markdown_\n",
  "framework": "the one used",
  "base_model": "huggingface/author/name"
  }

  mlflow.set_tags(tags)
  # you can store a single value parameter
  mlflow.log_param("lr", 0.01)

  # insert your ml code which normally should include logging parameters at multiple steps, not necessary every one.
  model = ...

  # you can also push lists to display graphs on the UI, x axis will be steps or time
  mlflow.log_param("val_loss", [0.0001, 0.0002, 0.0003,0.0004])

  # you can also push to the same key which creates a list
  mlflow.log_param("train_loss", 0.0001)
  mlflow.log_param("train_loss", 0.0003)
  mlflow.log_param("train_loss", 0.0004)

  # log model as an artifact to this run,
  # make sure to use the right backends in this example we are using pytorch
  # you can have a lot of other backends such as keras sklearn tensorflow spacy spark pyfunc etc
  mlflow.pytorch.log_model(
  model,
  "pytorch_model",
  model_signature, # you have to prepare a model signature for this, but it isn't required.
  )

  # push the model to the registry
  registered_model = mlflow.register_model(
  model_uri=f"runs:/{run.info.run_id}/model",
  name="pytorch_model"
  )
  # add tags to the registered model
  client = mlflow.tracking.MlflowClient()
  for k,v in tags.items():
  client.set_model_version_tag(registered_model.name, registered_model.version, k, v)


  print(f"Registered model {registered_model.name}:{registered_model.version}")
  print(f"Finished run ID:{run.info.run_id}")

Pull model from registry

In this example, we will pull a PyTorch model that we have pushed previously and run inference.

mlflow_pull.py

import mlflow

model_name = "pytorch_model"
model_version = 1
model = mlflow.pytorch.load_model(f"models:/{model_name}/{model_version}")

text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""

labels = ["person", "award", "date", "competitions", "teams"]

entities = model.predict_entities(text, labels)

for entity in entities:
  print(f"{entity['text']} => {entity['label']}")

Installation​

Dependencies​

Deploy using Helmfile​

Deploying using Helm​

Dependencies​

Monitoring CRDs​

Cert manager and TopoLVM​

PostgreSQL​

S3 Minio​

Deploy ML Backend​

Security​

SSO Proxy​

Basic Permissions​

Configuration​

Mlflow native authentication​

Mlflow SSO authentication with permissions​

Mlflow SSO Role Mapping​

Mlflow authentication API samples​

Create new user​

Using UI​

Using Python​

Using API​

Add new permissions​

Usage​

Track and push to registry​

Concepts​

Logging a Run​

Pull model from registry​

Installation

Dependencies

Deploy using Helmfile

Deploying using Helm

Dependencies

Monitoring CRDs

Cert manager and TopoLVM

PostgreSQL

S3 Minio

Deploy ML Backend

Security

SSO Proxy

Basic Permissions

Configuration

Mlflow native authentication

Mlflow SSO authentication with permissions

Mlflow SSO Role Mapping

Mlflow authentication API samples

Create new user

Using UI

Using Python

Using API

Add new permissions

Usage

Track and push to registry

Concepts

Logging a Run

Pull model from registry