The Challengers tab is a feature exclusive to DataRobot MLOps users. Contact your DataRobot representative for information on enabling it.
During model development, many models are often compared to one another until one is chosen to be deployed into a production environment. The Challengers tab provides a way to continue model comparison post-deployment. You can submit challenger models that shadow a deployed model and replay predictions made against the deployed model. This allows you to compare the predictions made by the challenger models to the currently deployed model (the "champion") to determine if there is a superior DataRobot model that would be a better fit.
Enable challenger models¶
To enable Challenger models for a deployment, you must enable the Challengers tab and prediction row storage. To do so, adjust the deployment's data drift settings either when creating a deployment or on the Settings > Data tab. If you enable Challenger models, prediction row storage will automatically be enabled for the deployment as well and cannot be turned off, as it is required for challengers.
Select a challenger model¶
Before adding a challenger model to a deployment, you must first build and select the model to be added as a challenger. Complete the modeling process and choose a model from the Leaderboard, or deploy a custom model as a model package. When selecting a challenger model, consider the following:
- It must have the same target type as the champion model.
- It does not need to be trained on the same feature list as the champion model, but it must share some features. However, to successfully replay predictions, you must send the union of all features required for champion and challengers.
- It does not have to be built from the same project as the champion model.
- Feature Discovery models cannot be used as challenger models.
When you have selected a model to serve as a challenger, from the Leaderboard, navigate to Predict > Deploy and select Add model package to registry. This creates a model package for the selected model in the Model Registry, so you can add the model to a deployment as a challenger.
Add challengers to a deployment¶
To add a challenger model to a deployment, navigate to the Challengers tab and select Add challenger model. You can add up to 5 challengers to each deployment.
The selection list contains only model packages where the target type and name are the same as the champion model.
The modal prompts you to select a model package from the registry to serve as a challenger model. Choose the model to add and click Select model package.
DataRobot verifies that the model shares features and a target type with the champion model. Once verified, click Add Challenger. The model is now added to the deployment as a challenger.
After adding a challenger model, you can replay stored predictions made with the champion model for all challengers, allowing you to compare performance metrics such as predicted values, accuracy, and data errors across each model.
To replay predictions, select Update challenger predictions.
If you aren't in the Organization associated with the deployment, you don't have the required permissions to replay predictions against challenger models. This restriction also applies to deployment Owners.
The champion model computes and stores up to 100,000 prediction rows per hour. The challengers replay the first 10,000 rows of the prediction requests made for each hour within the time range specified by the date slider. Note that for time series deployments, this limit does not apply. All prediction data is used by the challengers to compare statistics.
After predictions are made, click Refresh on the date slider to view an updated display of performance metrics for the challenger models.
Scheduled replay of predictions¶
You can replay predictions with challengers on a periodic schedule instead of doing so manually. Navigate to a deployment's Settings > Challengers tab. Turn on the toggle to automatically replay challengers. Scheduled replay can only be configured by the Owner of a deployment.
Configure the preferred cadence and time of day for replaying predictions.
Once enabled, the replay will trigger at the configured time for all challengers. Note that if you have a deployment with prediction requests made in the past and choose to add challengers at the current time, the scheduled job scores the newly added challenger models upon the next run cycle.
Challenger models overview¶
The Challengers tab displays information about the champion model and each challenger.
|Display Name||The display name for each model. Use the pencil icon to edit the display name. This field is useful for describing the purpose or strategy of each challenger (e.g., "reference model," "former champion," "reduced feature list").|
|Challenger models||The list of challenger models. Each model is associated with a color. These colors allow you to compare the models using visualization tools.|
|Model data||The metadata for each model, including the project name, model name, and the execution environment type.|
|Prediction Environment||The external environment the model uses to manage deployment predictions on a system outside of DataRobot. For more information, see Prediction environments.|
|Accuracy||The model's accuracy metric calculation for the selected date range and, for challengers, a comparison with the champion's accuracy metric calculation. Use the Accuracy metric dropdown menu to compare different metrics. For more information on model accuracy, see the Accuracy chart.|
|Training Data||The filename of the data used to train the model.|
|Actions||The actions available for each model:
Challenger performance metrics¶
After prediction data is replayed for challenger models, you can examine the charts displayed below that capture the various performance metrics recorded for each model.
Each model is listed with its corresponding color. Uncheck a model's box to stop displaying the model's performance data on the charts.
The Predictions chart records the average predicted value of the target for each model over time. Hover over a point to compare the average value for each model at a specific point in time.
For binary classification projects, use the Class dropdown to select the class for which you want to analyze the average predicted values. The chart also includes a toggle that allows you to switch between continuous and binary modes. Continuous mode shows the positive class predictions as probabilities between 0 and 1 without taking the prediction threshold into account. Binary mode takes the prediction threshold into account and shows, for all predictions made, the percentage for each possible class.
The Accuracy chart records the change in a selected accuracy metric value (LogLoss in this example) over time. These metrics are identical to those used for the evaluation of the model before deployment. Use the dropdown to change the accuracy metric. You can select from any of the supported metrics for the deployment's modeling type.
Data Errors chart¶
The Data Errors chart records the data error rate for each model over time. Data error rate measures the percentage of requests that result in a 4xx error (problems with the prediction request submission).
Challenger model comparisons¶
MLOps allows you to compare challenger models against each other and against the currently deployed model (the "champion") to ensure that your deployment uses the best model for your needs. After evaluating DataRobot's model comparison visualizations, you can replace the champion model with a better-performing challenger.
DataRobot renders visualizations based on a dedicated comparison dataset, which you select, ensuring that you're comparing predictions based on the same dataset and partition while still allowing you to train champion and challenger models on different datasets. For example, you may train a challenger model on an updated snapshot of the same data source used by the champion.
Make sure your comparison dataset is out-of-sample for the models being compared (i.e., it doesn't include the training data from any models included in the comparison).
Generate model comparisons¶
On the Deployments page, locate and expand the deployment with the champion and challenger models you want to compare.
Click the Challengers tab.
Click the Model Comparison tab.
The following table describes the elements of the Model Comparison tab:
Element Description Model 1 Defaults to the champion model—the currently deployed model. Click to select a different model to compare. Model 2 Defaults to the first challenger model in the list. Click to select a different model to compare. If the list doesn't contain a model you want to compare to Model 1, click the Challengers Summary tab to add a new challenger. Open model package Click to view the model's details. The details display in the Model Packages tab in the Model Registry. Promote to champion If the challenger model in the comparison is the best model, click Promote to champion to replace the deployed model (the "champion") with this model. Add comparison dataset Select a dataset for generating insights on both models. Be sure to select a dataset that is out-of-sample for both models (see stacked predictions). Holdout and validation partitions for Model 1 and Model 2 are available as options if these partitions exist for the original model. By default, the holdout partition for Model 1 is selected. To specify a different dataset, click + Add comparison dataset and choose a local file or a snapshotted dataset from the AI Catalog. Prediction environment Select a prediction environment for scoring both models. Model Insights Compare model predictions, metrics, and more.
Scroll to the Model Insights section of the Challengers page and click Compute insights.
You can generate new insights using a different dataset by clicking + Add comparison dataset, then selecting Compute insights again.
View model comparisons¶
Once you compute model insights, the Model Insights page displays the following tabs depending on the project type:
Multiclass classification projects only support accuracy comparison.
|Accuracy||Dual lift||Lift||ROC||Predictions Difference|
After DataRobot computes model insights for the deployment, you can compare model accuracy.
Under Model Insights, click the Accuracy tab to compare accuracy metrics:
The two columns show the metrics for each model. Highlighted numbers represent favorable values. In this example, the champion, Model 1, outperforms Model 2 for most metrics shown.
For time series projects, you can evaluate accuracy metrics by applying the following filters:
For all x series: View accuracy scores by metric. This view reports scores in all available accuracy metrics for both models across the entire time series range (x).
Per series: View accuracy scores by series within a multiseries comparison dataset. This view reports scores in a single accuracy metric (selected in the Metric dropdown menu) for each Series ID (e.g., store number) in the dataset for both models.
For multiclass projects, you can evaluate accuracy metrics by applying the following filters:
For all x classes: View accuracy scores by metric. This view reports scores in all available accuracy metrics for both models across the entire multiclass range (x).
Per class: View accuracy scores by class within a multiclass classifcation problem. This view reports scores in a single accuracy metric (selected in the Metric dropdown menu) for each Class (e.g., buy, sell, or hold) in the dataset for both models.
A dual lift chart is a visualization comparing two selected models against each other. This visualization can reveal how models underpredict or overpredict the actual values across the distribution of their predictions. The prediction data is evenly distributed into equal size bins in increasing order.
To view the dual lift chart for the two models being compared, under Model Insights, click the Dual lift tab:
The curves for the two models represented on this chart maintain the color they were assigned when added to the deployment (as either a champion or challenger). To interact with the dual lift chart, you can hide the model curves and the actual curve.
- The + icons in the plot area of the chart represent the models' predicted values. Click the + icon next to a model name in the header to hide or show the curve for a particular model.
- The orange o icons in the plot area of the chart represent the actual values. Click the orange o icon next to Actual to hide or show the curve representing the actual values.
A lift chart depicts how well a model segments the target population and how capable it is of predicting the target, allowing you to visualize the model's effectiveness.
To view the lift chart for the models being compared, under Model Insights, click the Lift tab:
The curves for the two models represented on this chart maintain the color they were assigned when added to the deployment (as either a champion or challenger).
The ROC tab is only available for binary classification projects.
An ROC curve plots the true-positive rate against the false-positive rate for a given data source. Use the ROC curve to explore classification, performance, and statistics for the models you're comparing.
To view the ROC curves for the models being compared, under Model Insights, click the ROC tab:
The curves for the two models represented on this chart maintain the color they were assigned when added to the deployment (as either a champion or challenger). You can update the prediction thresholds for the models by clicking the pencil icons.
Click the Predictions Difference tab to compare the predictions of two models on a row-by-row basis. The histogram shows the percentage of predictions that fall within the match threshold you specify in the Prediction match threshold field (along with the corresponding numbers of rows).
The header of the histogram displays the percentage of predictions:
- Between the positive and negative values of the match threshold (shown in green)
- Greater than the upper (positive) match threshold (shown in red)
- Less than the lower (negative) match threshold (shown in red)
How are bin sizes calculated?
The size of the Predictions Difference bins in the histogram depends on the Prediction match threshold you set. The value of the prediction match threshold bin is equal to the difference between the upper match threshold (positive) and the lower match threshold (negative). The default prediction match threshold value is 0.0025, so for that value, the center bin is 0.005 (0.0025 + |-0.0025|). The bins on either side of the central bin are ten times larger than the previous bin. The last bin on either end expands to fit the full Prediction Difference range. For example, based on the default Prediction match threshold, the bin sizes would be as follows (where x is the difference between 250 and the maximum Prediction Difference):
|Bin -5||Bin -4||Bin -3||Bin -2||Bin -1||Bin 0||Bin 1||Bin 2||Bin 3||Bin 4||Bin 5|
|Range||(−250 + x) to −25||−25 to −2.5||−2.5 to −0.25||−0.25 to −0.025||−0.025 to −0.0025||−0.0025 to +0.0025||+0.0025 to +0.025||+0.025 to +0.25||+0.25 to +2.5||+2.5 to +25||+25 to (+250 + x)|
|Size||225 + x||22.5||2.25||0.225||0.0225||0.005||0.0225||0.225||2.25||22.5||225 + x|
If many matches dilute the histogram, you can toggle Scale y-axis to ignore perfect matches to focus on the mismatches.
The bottom section of the Predictions Difference tab shows the 1000 most divergent predictions (in terms of absolute value).
The Difference column shows how far apart the predictions are.
Replace champion with challenger¶
After comparing models, if you find a model that outperforms the deployed model, you can set it as the new champion.
Evaluate the comparison model insights to determine the best-performing model.
If a challenger model outperforms the deployed model, click Promote to champion.
Select a Replacement Reason and click Accept and Replace.
The challenger model is now the champion (deployed) model.
Challengers for external deployments¶
External deployments with remote prediction environments can also use the Challengers tab. Remote models can serve as the champion model, and you can compare them to DataRobot and custom models serving as challengers.
The workflow for adding challenger models is largely the same; however, there are unique differences for external deployments outlined below.
Add challenger models to external deployments¶
To enable challenger support, access an external deployment (one created with an external model package). In the Settings tab, under the Data Drift header, enable challenger models and prediction row storage.
The Challengers tab is now accessible. To add challenger models to the deployment, navigate to the tab and select Add challenger model.
Select a model package for the challenger you want to add (custom and DataRobot models only). Additionally, you must indicate a prediction environment used by the model package; this details where the model runs predictions. DataRobot or custom model can only use a DataRobot prediction environment for challengers models (unlike the champion model, deployed to an external prediction environment). When you have chosen the desired prediction environment, click Select.
The tab updates to display the model package you wish to add, verifying that the features used in the model package match the deployed model. Select Add challenger.
The model package is now serving as a challenger model for the remote deployment.
Add external challenger comparison dataset¶
To compare an external model challenger, you need to provide a dataset that includes the actuals and the prediction results. When you upload the comparison dataset, you can specify a column containing the prediction results.
To add a comparison dataset for an external model challenger, follow the Generate model comparisons process, and on the Model Comparison tab, upload your comparison dataset with a Prediction column identifier. Make sure the prediction dataset you provide includes the prediction results generated by the external model at the location identified by the Prediction column.
Manage challengers for external deployments¶
You can manage challenger models for remote deployments with various actions:
To edit the prediction environment used by a challenger, select the pencil icon and choose a new prediction environment from the dropdown.
To replace the deployed model with a challenger, the challenger must have a compatible prediction environment. Once replaced, the champion does not become a challenger because remote models are ineligible.
Challenger promotion to champion¶
A deployment's champion can't switch between an external prediction environment and a DataRobot prediction environment. When a challenger replaces a champion running in an external prediction environment, that challenger inherits the external environment of the former champion. If the Management Agent isn't configured in the external prediction environment, you must manually deploy the new champion in the external environment to continue making predictions.
Champion demotion to challenger¶
If the former champion isn't an external model package, it is compatible with DataRobot hosting and can become a challenger. In that scenario, the former champion moves to a DataRobot prediction environment where the deployment can replay the champion's predictions against it.