Logging metrics and tags of multiple models into one mlflow in Databricks

data engineering

Publish Date: 2023-02-07

There are situations where we need to run multiple notebooks or algorithms that are closely related. In this case, it makes sense to log all the information into one central place.

We can leverage MLflow with Databricks to achieve this goal.

First, let’s see what information we can log with MLflow?

Information to log in MLflow

mlflow.log_param() logs a single key-value param in the currently active run. The key and value are both strings. Use mlflow.log_params() to log multiple params at once.
mlflow.log_metric() logs a single key-value metric. The value must always be a number. MLflow remembers the history of values for each metric. Use mlflow.log_metrics() to log multiple metrics at once.
mlflow.set_tag() sets a single key-value tag in the currently active run. The key and value are both strings. Use mlflow.set_tags() to set multiple tags at once.
mlflow.log_artifact() logs a local file or directory as an artifact, optionally taking an artifact_path to place it in within the run’s artifact URI. Run artifacts can be organized into directories, so you can place the artifact in a directory this way.
mlflow.log_artifacts() logs all the files in a given directory as artifacts, again taking an optional artifact_path.

To distinguish different algorithm names, we can use the mlflow.set_tag()function to log the algorithm name into the experiments.

Example code to use in Databricks notebooks


import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


# Import the dataset from scikit-learn and create the training and test datasets. 
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

db = load_diabetes()
X = db.data
y = db.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

# This run uses mlflow.set_experiment() to specify an experiment in the workspace where runs should be logged. 
# If the experiment specified by experiment_name does not exist in the workspace, MLflow creates it.
# Access these runs using the experiment name in the workspace file tree. 

experiment_name = "/xxxx/test-experiment"  # pluging your path at Databricks, where the "test-experiment" is the  actual experiment name
mlflow.set_experiment(experiment_name)

with mlflow.start_run():
  n_estimators = 100
  max_depth = 6
  max_features = 3
  # Create and train model
  rf = RandomForestRegressor(n_estimators = n_estimators, max_depth = max_depth, max_features = max_features)
  rf.fit(X_train, y_train)
  # Make predictions
  predictions = rf.predict(X_test)
  
  # Log algorithm name
  mlflow.set_tag("model_name", "user_test1")
    
  # Log parameters
  mlflow.log_param("num_trees", n_estimators)
  mlflow.log_param("maxdepth", max_depth)
  mlflow.log_param("max_feat", max_features)
  
  # Log model
  mlflow.sklearn.log_model(rf, "random-forest-model")
  
  # Create metrics
  mse = mean_squared_error(y_test, predictions)
    
  # Log metrics
  mlflow.log_metric("mse", mse)

Results in MLflow experiment dashboard

robot learner

https://datasciencebyexample.github.io/2023/02/07/logging-multiple-models-in-one-mlflow-experiment-with-databricks/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

databricks

Difference between 'python -m pip install' and 'pip install'

2023-02-08 data engineering

python

Difference between fit, transform, fit_transform, predict, and predict_proba in a sklearn pipeline

2023-02-02 data science

pipeline scikit-learn