Evaluate experiments (Comparison tab)¶
Model Comparison is a Preview feature that is on by default.
Feature flag: Enable Use Case Leaderboard Compare
There are two Leaderboard display options available in Workbench:
The Comparison tab (this page), which allows you to compare up to three models of the same type—for example, all binary or all regression—from any number of experiments within a single Use Case. (Models must be of the same type because accuracy metrics between different types are not directly comparable.)
The Experiment tab page, which helps to understand and evaluate models from a single non-time-aware experiment.
The model comparison tool, accessed by clicking the Comparison tab at the top of the Leaderboard, provides a filtered list of models from one or more Experiments in your Use Case. You can change the filtering logic, select the Leaderboard sorting order, and choose up to three models of the same target type to view side-by-side. The tab provides both model insight and model lineage comparison.
See the associated considerations for important additional information.
The following sections describe the major components of the Comparison Leaderboard:
Set up filtering¶
Once you click Comparison, from the tab or the breadcrumb dropdown, the initial display shows the most accurate model from each experiment in the active Use Case. There are three basic controls for setting up a comparison:
|Breadcrumbs||Click to display a list of up to 50 experiments in the Use Case that are available for comparison. The list displays experiments, sorted by creation date.|
|Filters||Set criteria for the Leaderboard model list.|
|Checkbox selectors||Select the models—up to three—to be used in the model insight and model lineage comparison.|
Select Filter models to set the criteria for the models that Workbench displays on the Comparison Leaderboard.
The Comparison Leaderboard lists all models that meet your filtering criteria, grouped by target type (binary classification, regression). Note that the standard sorting logic applies.
There are three ways to determine which models from all your experiments are added to the Comparison Leaderboard:
Comparison filters are saved when you navigate to another page. This allows you to resume your comparison when returning to this page. If the displayed models do not appear in the current Leaderboard model list, DataRobot displays the message "Model does not match applied filters." Use the Remove from comparison option in the model's model actions options to remove it from the comparison.
Filter by experiment details¶
When toggled on, Filter by experiment details limits the Leaderboard to show only models from experiments that were trained on the selected dataset or target. This filter is applied after any additional filters are applied. Use this filter, for example, if you want to compare only models that use one or more specific datasets.
Filter by accuracy¶
Enabling Filter by accuracy adds the most accurate models, per experiment, to the Leaderboard. Select additional filters within this category to constrain how DataRobot selects the most accurate models. Accuracy rankings are based on the configured optimization metric (e.g., LogLoss for binary classification models, RMSE for regression models).
Additional criteria are described below:
|Return up to||Sets the number of models to compare from each experiment. DataRobot returns models based on highest accuracy.|
|Model sample size||Sets the sample size that returned models were trained on.|
|Model family||Returns only models that are part of the selected model family. Start typing to autocomplete the model family name.|
|Must support Scoring Code||When checked, returns only models that support Scoring Code export, allowing you to use DataRobot-generated models outside the platform.|
Model family selections¶
The Comparison Leaderboard can return models based on their "family." Below is a list of families run during Autopilot with example, but not all, member models.
|Gradient Boosting Machine||Light Gradient Boosting on ElasticNet Predictions, eXtreme Gradient Boosted Trees Classifier|
|ElasticNet Generalized Linear Model||Elastic-Net Classifier, Generalized Additive2|
|Rule induction||RuleFit Classifier|
|Random Forest||RandomForest Classifier or Regressor|
The following families of models, built via the Repository, are available as filters:
- Adaptive Boosting
- Decision Tree
- K Nearest Neighbors
- Naive Bayes
- Support Vector Machine
- Two Stage
Filter by starred models¶
Enabling Filter by starred adds all starred models from experiments, unless the set of experiments is reduced by Filter by experiment details. This filter does not impact any selections from Filter by accuracy.
Once filtering is applied, the Leaderboard redisplays to show the top 20 results from each target type, inclusive across experiments. If more than 20 models meet filtering criteria, a Load more link is available and will load up to 20 additional models (with an option to load more if required).
In the example below, filtering included the top three most accurate models from each experiment in the Use Case. The Use Case includes the 10K diabetes data trained as 1) a binary classification and 2) a regression experiment.
The Leaderboard separates binary and regression models in the list, displaying the applied partition and metric for the grouping and the source experiment for each model. Changing either value updates the display of models for that target type.
Use the checkboxes to the left of the model name to select that model for the comparison display. You can compare up to three models, but they must be of the same target type (all binary or all regression).
Model comparison display¶
The Comparison page, which begins to populate when you select the first model, shows up to three selected models, side-by-side. Once selected, models will remain on this page until removed, even if Leaderboard filtering is changed. DataRobot provides a warning, "Model does not match applied filters," if subsequent filtering excludes a selected model.
The following sections describe the actions available from the Comparison page:
|Model actions||Take actions on an individual model.|
|Insights||View insights for up to three models of the same target type side-by-side.|
|Lineage||View general model and performance information.|
Use the three dots next to the model name to take one of the following actions for the selected model:
|Open in experiment||Navigates to the Model overview page.|
|Make predictions||Navigates to the Make Predictions page.|
|Create app||Builds a model package and then opens the tools for creating an application.|
|Generate compliance report||Creates an editable compliance template for the model, providing documentation that the model works as intended, is appropriate for its intended business purpose, and is conceptually sound.|
|Remove from comparison||Removes the selected model from the Comparison page. If you change your filters and no longer see the model on the Leaderboard, you can still remove it using this action.|
The comparison view displays supported insights for up to three models. Choose the models to compare using the checkboxes in the Leaderboard listing. Note that not all insights are available for every model and some insights require additional computation before displaying.
For classification experiments, the ROC Curve tab plots the true positive rate against the false positive rate for each of the three experiments' in a single plot. The accompanying table, a confusion matrix, helps evaluate accuracy by comparing actual versus predicted values.
Optionally, you can:
- Select a different data partition. If scoring has not been computed for that partition (for a given model), a Score link becomes available in the matrix to initiate computations.
- Adjust metric display units to change the display in the confusion matrix between absolute numbers and percentages.
- Adjust the display threshold used to compute metrics. Any changes only impact the ROC plot and the confusion matrix, they does not change the prediction threshold for the models.
To help visualize model effectiveness, the Lift Chart depicts how well a model segments the target population and how well the model performs for different ranges of values of the target variable.
- Hover on any point to display the predicted and actual scores for rows in that bin.
- Use the controls to change the criteria for the display.
The Lift Chart shows side-by-side lift charts for up to three models. Optionally, select a data partition, number of bins, and sort bin order.
Feature Impact provides a high-level visualization that identifies which features are most strongly driving model decisions. It is available for all model types and is an on-demand feature, meaning that for all but models prepared for deployment, you must initiate a calculation to see the results.
- Hover on feature names and bars for additional information.
- Use Sort by to change the display to sort by impact or feature name. When using the comparison tool, DataRobot calculates impact for any uncalculated models when opening the insight.
Accuracy Metrics displays a table of accuracy scores for each calculated partition of each model. You can change the applied metric from the dropdown above the display.
The Lineage section provides metadata for the models selected for comparison. Options provide details of model training input:
|Datasets||Metadata on the dataset, features, and feature list.|
|Experiments||Metadata on experiment settings and creation.|
|Model blueprints||Visualizations of each model's preprocessing steps (tasks), modeling algorithms, and post-processing steps.|
|Model info||Additional general and performance information about the model.|
|Hide lineage values that are the same for selected models||Toggle to control lineage output.|
Datasets in lineage¶
The Datasets tab provides a variety of data-related information:
|Dataset name||The name of the dataset used for the model, including a link to preview the data in the dataset explorer.|
|Added to Use Case||The date the dataset was added to the Use Case and the user who added it.|
|Rows||The number of rows in the dataset.|
|Dataset size||The size of the dataset.|
|Feature list name and description||The feature list used to build the model as well as a description.|
|Feature list content||The total number of features in the applied feature list as well as a breakdown, and count, of features by type.|
|Missing values||Metadata of the missing values for the data used to build the model, including total number of missing values and affected features, as well counts for individual features.|
Experiments in lineage¶
The Experiments tab provides a variety of experiment setup-related information, similar to the information displayed in the Setup tab:
|Experiment||The name of the experiment.|
|Created||The experiment's creation date and creator.|
|Target feature||Target feature information, including the target, resulting project type, and the positive class (for binary classification). Additionally it shows the mean (average) score for regression projects and for binary, the mean of the target after the target was converted to a numerical target. For example, if the target is [yes, no, no] and
|Partitioning method||Details of the partitioning done for the experiment, either the default or modified.|
|Optimization metric||The optimization metric used to define how to score the experiment's models. You can change the metric the Leaderboard is sorted by, but the metric displayed in the summary is the one used in the experiment as the optimization metric.|
|Additional settings||Settings for the configuration parameters available under Additional settings—monotonic constraints, weight, offset, exposure, and count of events.|
Model info in lineage¶
The Model info tab provides a variety of general model and performance information:
|Blueprint description||Lists the pre- and post-processing tasks visualized in the model blueprint, with an option to view a graphical representation of the blueprint.|
|Blueprint family||Lists the blueprint family, which can be used as part of comparison board filtering.|
|Model size||Reports the total size of files DataRobot uses to store the model. This number can be especially useful for Self-Managed AI Platform deployments, which require you to download and store the model.|
|Sample size||Reports the size, represented as a number of rows and a percentage, used to train the model. When DataRobot has downsampled the project, Sample Size reports the number of rows in the minority class rather than the total number of rows used to train the model.|
|Time to predict 1,000 rows||Displays the estimated time, in seconds, to score 1000 rows from the dataset. The actual prediction time can vary depending on how and where the model is deployed.|
Consider the following when working with the model comparison tool functionality:
Time-aware projects are not supported.
A maximum of 10 models are returned, based on model filter settings, per experiment.
Models with "N/A" scores or scores not calculated (such as CV scores not calculated) are sorted at the bottom of the models list.
Each target-type section has a limit of 100 models.
If the same model type from different experiments have the exact same score, the model from the most recently created experiment appears first in the sort order.