This section describes how to work with models after your initial model build.
Add models from the Leaderboard¶
Once the Leaderboard is populated, you can create additional models by creating a new model or retraining an existing model. In both cases, once you submit your changes you can see the request's progress in the Worker Queue.
Use Add New Model¶
To create a new model from the Leaderboard:
Click the Add New Model link at the top of the Leaderboard.
Note that this method functions the same as the Run Task button in the Repository.
Retrain a model¶
You can retrain an existing Leaderboard model by changing the sample size or feature list.
Change sample size¶
To use a different number of rows or percentage of data, click the plus sign () next to the reported sample size:
Set a new value and click Run with new sample size. Note that when setting a new sample size, above a certain point (which is determined by the size of the dataset) DataRobot forces a frozen run. To increase sample size in larger datasets without a frozen run, create the new model from the Repository. Be aware that this method may use more RAM and have implications on system performance.
When configuring sample sizes to retrain models in projects with large row counts, DataRobot recommends requesting sample sizes using integer row counts or the Snap To Validation option instead of percentages. This is because percentages map to many actual possible row counts and only one of which is the actual sample size for “up to validation.” For example, if a project has 199,408 rows and you request a 64% sample size, any number of rows between 126,625 rows and 128,618 rows maps to 64% of the data. Using integer row counts or the “Snap-to” options avoids ambiguity around how many rows of data you want the model to use.
Change feature list¶
To change the feature list for a specific model, click the feature list icon () and select a new list. You can also rerun Autopilot on a new feature list to rebuild all models.
Blending lets you combine the predictions of multiple models, which may lead to better results than running models individually. DataRobot can automatically create blender models as part of Autopilot if the Create blenders from top model advanced option is enabled. This option is off by default.
Why create blended models?
- To create more accurate models.
- To use multiple blueprints.
- To leverage the wisdom of crowds principal.
Consider the following before creating a blended model:
- Blend between two and eight models, based on different algorithms and with high accuracy.
- Although blenders often increase accuracy, they also require more time to create and score.
- Because the final model is more complex, blended models can be more difficult to interpret and communicate. Use the Understand tab insights to aid in interpretation.
DataRobot supports the following blending methods for non-time aware projects:
|Average Blend (AVG)||Regression, Binary Classification, Multiclass||N/A|
|Median Blend (MED)||Regression, Binary Classification, Multiclass||N/A|
|Partial Least Squares Blend (PLS)||Regression, Binary Classification||Not available on large datasets (slim run)|
|Generalized Linear Model Blend (GLM)||Regression, Binary Classification||Not available on large datasets (slim run)|
|Elastic Net Blend (ENET)||Regression, Binary Classification, Multiclass||Not available on large datasets (slim run)|
|Mean Absolute Error-Minimizing Weighted Average Blend (MAE)||Regression||Only available for projects using MAE as the project metric; not available on large datasets (slim run)|
|Mean Absolute Error-Minimizing Weighted Average Blend with L1 Penalty (MAEL1)||Regression||Only available for projects using MAE as the project metric; not available on large datasets (slim run)|
|Random Forest Blend (RF)||Multiclass||Deprecated|
|Light Gradient Boosting Machine Blend (LGBM)||Regression, Binary Classification, Multiclass||Deprecated|
A single model blender is like a "calibration" step. Calibration is DataRobot's attempt to improve the model such that the distribution and behavior of predicted probability values are close to distribution and behavior of probability values observed in the training data.
GLM, ENET, and PLS blenders learn an intercept and a coefficient. That is, they "add a number to every prediction" and "multiply every prediction by a number." Sometimes, a simple addition or multiplication can yield a small improvement in a model's results. Blenders that require training (all except AVG and MED) use stacking to ensure an out-of-sample prediction (and avoid misleadingly high accuracy). An entire LGBM or TF model is fit with a single prediction input and from that can learn a complex non-linear transform of the single prediction. For AVG and MED blenders, creating a single model blender is not useful as it results in an exact duplicate of the parent model.
See below for time-aware blender information.
For each target point, the Average and Median blenders calculate the average or median values of the predictions of the selected individual models. GLM, Elastic Net, and PLS blenders are essentially a second layer of models on the top of the existing models. They use the predictions of the selected models as predictors, while keeping the same target as individual models.
Create a blended model¶
Follow these steps to create a blended model.
Using the checkboxes on the left side of the model Leaderboard, select two or more models. (See the note on single-model blenders, above, to use blending as an additional calibration method.)
Click the model menu icon at the top left of the Leaderboard, then select one of the blending options listed under Blending. (Hovering over a menu item displays a description of the blending option.)
A new job appears in the Worker Queue while the blended model is processed. The name indicates the blender type and the models selected to create the blender.
When processing is complete, the new blended model displays in the list on the Leaderboard.
Alternatively, make changes to feature lists and sample size from the Leaderboard:
Blenders for time-aware projects¶
Time-aware models, because they do not use stacked predictions, have different blenders available:
|Blender (Code)||Project type||Description|
|Average Blend (AVG)||OTV, time series||Average of prediction between different models|
|Median Blend (MED)||OTV, time series||Median of prediction between different models|
|Average Blend by Forecast Distance (FD_AVG)||Time series||From the selected models, provides per forecast distance averages for the three top models. Only available for projects with two or more forecast distances; to blend by forecast distance, at least four models need to be selected|
|ENET Blend by Forecast Distance (FD_ENET)||Time series||Elastic Net model per forecast distance to combine predictions; only available for projects with two or more forecast distances|
While some models do better on short term predictions (the next few steps into the future) and others do better at long term predictions (further into the future), time series projects add forecast distance blending options. With forecast distance blenders, DataRobot blends models differently for each forecast distances in order to use the best blueprints for each.
The forecast distance blenders are disabled when the forecast distance in equal to 1.
When using Average Blender by Forecast Distance, you must select four or more models. If fewer than four are selected, the blender averages model predictions instead of forecast-distance based predictions.
Add models from selected¶
After DataRobot creates models and populates the Leaderboard, you can retrain selected models using different settings. For example, you can run using a different feature list or sample size, or select either single-fold or up to five-fold cross-validation.
Use the following steps to re-run one or more selected models.
You cannot use Add models from selected to change settings on blended models.
Use the checkboxes on the left side of the model names on the Leaderboard to select one or more models.
From the menu, select Model processing > Add models from selected.
Use the resulting box at the top of the Leaderboard to specify a feature list, sample size, and/or number of cross-validation (CV) runs.
Click Run Models to retrain the selected models with the specified parameters. New jobs appear in the Worker Queue while DataRobot processes the models.
You can delete models listed on the Leaderboard using the instructions below. Note that when you delete a model in this way, it is deleted from the Leaderboard but not from the underlying project database. Because of this, the model remains available to any parent project components (for example, blender models or Word Cloud).
To delete models:
Using the checkboxes on the left side of the model Leaderboard, select one or more models.
From the menu, select Model processing > Delete selected model.
Click Delete to confirm model deletion.