Mlflow (Training monitoring and model registry)
Mlflow provides a suite of tools that help build an ML workflow. We are interested in tracking, model registry, and evaluation.
We will mainly be using it to compare training executions with different datasets and parameters, and storing the resulting machine learning models in a registry.
- Tracking: Mlflow Tracking provides both an API and UI dedicated to the logging of parameters, code versions, metrics, and artifacts during the ML process. This centralized repository captures details such as parameters, metrics, artifacts, data, and environment configurations, giving teams insight into their models’ evolution over time. Whether working in standalone scripts, notebooks, or other environments, Tracking facilitates the logging of results either to local files or a server, making it easier to compare multiple runs across different users.
- Model Registry: A systematic approach to model management, the Model Registry assists in handling different versions of models, discerning their current state, and ensuring smooth productionization. It offers a centralized model store, APIs, and UI to collaboratively manage an Mlflow Model’s full lifecycle, including model lineage, versioning, aliasing, tagging, and annotations.
- Evaluation: Designed for in-depth model analysis, this set of tools facilitates objective model comparison, be it traditional ML algorithms or cutting-edge LLMs.
Installation
Dependencies
Before deploying Label Studio, you need at least PostgreSQL.
S3 MinIO is optional but recommended, so we will use it. The alternative is managing artifacts and models' persistent storage manually.
PostgreSQL requires monitoring-related CRDs as a dependency.
In this repo, we'll be using TopoLVM for PostgreSQL, but it's optional.
TopoLVM will add cert-manager as a dependency.
Deploy using Helmfile
Helmfile is prepared with all the aforementioned optional dependencies.
cd helmfile
helmfile sync -f helmfile.yaml
Deploying using Helm
Alternatively, you can deploy Helm charts manually in case you have your own.
Dependencies
Here is how to deploy the dependencies:
Monitoring CRDs
helm install monitoringcrds ../monitoring/kube-prometheus-stack/charts/crds \
--create-namespace -n monitoring
Cert manager and TopoLVM
helm install cert-manager ../cert-manager/cert-manager \
--create-namespace -n kosmos-system \
-f helmfile/values/cert-manager.yaml
helm install topolvm ../lvm-csi/topolvm\
--create-namespace -n kosmos-system \
-f helmfile/values/lvm-csi.yaml
PostgreSQL
helm install cnpg ../postgresql/cloudnative-pg \
--create-namespace -n kosmos-sql \
-f helmfile/values/psql-operator.yaml
helm install ../postgresql/cluster \
--create-namespace -n kosmos-sql \
-f helmfile/values/psql-minimal.yaml
S3 Minio
helm install operator ../s3/operator\
--create-namespace -n kosmos-s3 \
-f helmfile/values/s3-operator-min.yaml
helm install minio-secrets ../s3/minio-secrets\
--create-namespace -n kosmos-s3
helm install s3-tenant ../s3/tenant-5.0.15.tgz\
--create-namespace -n kosmos-s3\
-f helmfile/values/s3-tenant-min.yaml
Deploy ML Backend
cd kosmos-apps/mlflow
# generate secrets
helm install mlflow-secrets mlflow-secrets \
--namespace kosmos-data \
--create-namespace \
--set postgres.rootUser=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.username}" | base64 --decode) \
--set postgres.rootPassword=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.password}" | base64 --decode) \
--set minio.accessKey=minioadmin \
--set minio.secretKey=minioadmin
# create database
helm install mlflow-initdb ../platform-provisioner/initdb \
--namespace kosmos-data \
--set secret.name=mlflow-secrets \
--set secret.dbUserKey=DB_USER \
--set secret.dbPassKey=DB_PASSWORD \
--set secret.dbRootUserKey=PG_USER \
--set secret.dbRootPassKey=PG_PASSWORD \
--set database=mlflow \
--set host=$(kubectl get secret -n kosmos-sql pgcluster-minimal-superuser -o jsonpath="{.data.host}" | base64 --decode) \
--set dbNamespace=kosmos-sql
# install mlflow
helm install mlflow mlflow-1.6.3.tgz \
--namespace kosmos-data \
-f values.yaml
Security
SSO Proxy
You can choose to use the OAuth2 Proxy that forces users to log in through a Keycloak account before accessing Mlflow. You can also filter users allowed in based on realm roles using these Helm values:
proxy:
enabled: true
oidc:
allowedGroups: [admin, datascientist] # Only users who have the realm role admin or datascientist will be able to access mlflow
Any user that is allowed access to Mlflow will have complete access to everything.
Basic Permissions
Mlflow provides mechanisms for rudimentary user and permission management.
It allows filtering permissions according to API paths.
This is useful for limiting permissions for users on experiments, runs, and registry models.
Keep in mind that a single permission has 3 elements: the user, the resource, and the level of clearance the user will have on this specific resource. This isn't global on all experiments, for example, but really a specific experiment id. This is why it's better to either automate the permission behavior or rely heavily on the default permission.
You can choose a default permission that applies to all new users on all resources not owned by them:
| Permission | Can read | Can update | Can delete | Can manage |
|---|---|---|---|---|
READ | Yes | No | No | No |
EDIT | Yes | Yes | No | No |
MANAGE | Yes | Yes | Yes | Yes |
NO_PERMISSIONS | No | No | No | No |
Configuration
Configuration is done via a configuration file and is pushed to a SQL database chosen under database_uri in that database.
You have to set the environment variable MLFLOW_AUTH_CONFIG_PATH to your chosen config file:
Here is an example:
[mlflow]
default_permission = READ # default value
database_uri = postgresql://<db_uri>/mlflow_auth
admin_username = admin
admin_password = password
authorization_function = mlflow.server.auth.jwt_auth:authenticate_sso_request # custom authorization function that decodes a jwt token, default is basic auth
[role_mappings]
admin = admin
MANAGE = devops
EDIT = dataingenieur,dataanalyst
READ = datascientist
Mlflow native authentication
You can use Mlflow native authentication alone:
mlflow:
auth:
enabled: true
adminUsername: admin
adminPassword: password
defaultPermission: READ
dbName: mlflow_auth
sso: false
proxy:
enabled: false
Or with SSO while keeping the two unrelated, this means that the SSO user and the Mlflow user will not be the same:
Keep mlflow.auth.sso as false if you don't want to delegate authentication to the OAuth2 proxy. Because if you choose to rely on the proxy,
users will have to already be existent in your Mlflow authentication database otherwise you won't be able to access Mlflow.
mlflow:
auth:
enabled: true
adminUsername: admin
adminPassword: password
defaultPermission: READ
dbName: mlflow_auth
sso: false
proxy:
enabled: true
oidc:
allowedGroups: [admin, datascientist]
Mlflow SSO authentication with permissions
You can also opt for using the same users between SSO and Mlflow internal authentication.
mlflow:
auth:
enabled: true
adminUsername: admin
adminPassword: password
defaultPermission: READ
dbName: mlflow_auth
sso: true
roleMappings:
admin: [ admin ]
MANAGE: [ devops ]
EDIT: [ 'dataingenieur' , 'dataanalyst']
READ: [ datascientist ]
proxy:
enabled: true
oidc:
allowedGroups: [admin, datascientist]
Mlflow SSO Role Mapping
Mlflow supports implementing custom authentication mechanisms. We leveraged this to add role mapping features.
Each time Mlflow receives an authorization token from the SSO Proxy that is valid, it attempts to create a new Mlflow local account if it doesn't exist, or update its local Mlflow permissions according to the roles included in the token.
If a user doesn't have a role that is part of the role mapping configuration, they will have no permissions.
Here is an example of role mapping configuration, the keys are Mlflow local permissions. The values are a list of corresponding Keycloak roles.
mlflow:
auth:
roleMappings:
admin: [ 'admin' ]
MANAGE: [ 'devops' ]
EDIT: [ 'dataingenieur' , 'dataanalyst']
READ: [ 'datascientist' ]
Details on how the code works
- OAuth2-proxy sends the connection request to Mlflow.
- Mlflow calls our custom
mlflow.server.auth.jwt_auth:authenticate_sso_requestfunction. - The function retrieves the token from the
authorizationheader. - Token is decoded:
- token is expired return error 401.
- token is valid.
- Retrieve user name and roles from the token.
- Create a user account if it doesn't exist in local Mlflow, its password will be its
subvalue from the Keycloak token which is the local user ID inside Keycloak. This is needed because Mlflow still manages everything internally as basic authentication. - Sync permissions:
- if the user is not an admin, update permissions. Due to how Mlflow works, any new object created has default permissions which are the same for everyone, it is not possible to use Mlflow roles or groups and set their own default permissions. This means whenever a user tries to access Mlflow, we have to update their permissions on Mlflow resources to match our role mapping policy.
- if the user is an admin, nothing is needed because Mlflow gives the admin user the admin permission on everything created.
- Give access to Mlflow.
Mlflow authentication API samples
Create new user
Using UI
You can create a new user either using the UI while logged in as admin: https://<mlflow_external_url>/signup
Using Python
You need environment variables first for admin credentials
export MLFLOW_TRACKING_USERNAME=admin
export MLFLOW_TRACKING_PASSWORD=password
import mlflow
auth_client = mlflow.server.get_app_client(
"basic-auth", "https://<mlflow_external_url>/"
)
auth_client.create_user(username="username", password="password")
Using API
Finally, you can do so via cURL
curl -XPOST -u admin:password
-d '{"username": "username", "password": "password"}'
"https://<mlflow_external_url>/api/2.0/mlflow/users/create"
Add new permissions
Similarly, you can find the rest of the API documentation for adding or modifying singular permissions on Mlflow official documentation.
Usage
Track and push to registry
As explained above, this helps push artifacts and metadata to Mlflow about a specific training session. Artifacts are stored in S3. Metadata, parameters, and general information about runs and experiments are stored in PostgreSQL.
Concepts
-
Run: Mlflow Tracking is organized around the concept of runs, which are executions of some piece of data science code, for example, a single
python train.pyexecution. Each run records metadata (various information about your run such as metrics, parameters, start and end times) and artifacts (output files from the run such as model weights, images, etc). -
Experiment: An experiment groups together runs for a specific task. You can create an experiment using the CLI, API, or UI. The Mlflow API and UI also let you create and search for experiments.
Logging a Run
In order to log an Mlflow run, here is a simple python example:
mlflow_train.py
import os
import mlflow
# you need to set environment variables, it is not recommended to do them via python like this example
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow.kosmos-data:5000"
# s3 env vars to be able to push artifacts
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "http://minio.kosmos-s3"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
# for automatic system metrics logging use this, you need psutil for more metrics, and pynvml for GPU metrics
os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING']= 'true'
# Creates the experiment if it doesn't exist
mlflow.set_experiment("Experiment Name")
with mlflow.start_run() as run:
# you can let mlflow decide which metrics to compute and to push including parameters and models
# this is supported only for popular libraries such as Scikit-learn, xgboost, pytorch, keras or spark
# you have to set it before starting your training session though.
mlflow.autolog()
# you can also combine autologging with manual logging
# add tags
tags = {
"mlflow.note.content": "This will change the description.\n And it *supports* _markdown_\n",
"framework": "the one used",
"base_model": "huggingface/author/name"
}
mlflow.set_tags(tags)
# you can store a single value parameter
mlflow.log_param("lr", 0.01)
# insert your ml code which normally should include logging parameters at multiple steps, not necessary every one.
model = ...
# you can also push lists to display graphs on the UI, x axis will be steps or time
mlflow.log_param("val_loss", [0.0001, 0.0002, 0.0003,0.0004])
# you can also push to the same key which creates a list
mlflow.log_param("train_loss", 0.0001)
mlflow.log_param("train_loss", 0.0003)
mlflow.log_param("train_loss", 0.0004)
# log model as an artifact to this run,
# make sure to use the right backends in this example we are using pytorch
# you can have a lot of other backends such as keras sklearn tensorflow spacy spark pyfunc etc
mlflow.pytorch.log_model(
model,
"pytorch_model",
model_signature, # you have to prepare a model signature for this, but it isn't required.
)
# push the model to the registry
registered_model = mlflow.register_model(
model_uri=f"runs:/{run.info.run_id}/model",
name="pytorch_model"
)
# add tags to the registered model
client = mlflow.tracking.MlflowClient()
for k,v in tags.items():
client.set_model_version_tag(registered_model.name, registered_model.version, k, v)
print(f"Registered model {registered_model.name}:{registered_model.version}")
print(f"Finished run ID:{run.info.run_id}")
Pull model from registry
In this example, we will pull a PyTorch model that we have pushed previously and run inference.
mlflow_pull.py
import mlflow
model_name = "pytorch_model"
model_version = 1
model = mlflow.pytorch.load_model(f"models:/{model_name}/{model_version}")
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""
labels = ["person", "award", "date", "competitions", "teams"]
entities = model.predict_entities(text, labels)
for entity in entities:
print(f"{entity['text']} => {entity['label']}")