This section describes how to work with models after your initial model build.
Add models from the Leaderboard¶
Once the Leaderboard is populated, there are two ways to create additional models from the board—using the Add New Model menu option or by retraining an existing model at a different sample size. In both cases, once you submit your changes you can see the request's progress in the Worker Queue.
Use Add New Model¶
You can create a new model from the Leaderboard. This mechanism is the same as the functionality available using the Run button in the Repository. To create a new model from the Leaderboard:
Click Add New Model at the top of the Leaderboard.
The new model appears in the list on the Leaderboard.
Retrain a model¶
To retrain a Leaderboard model using a different number of rows or percentage of data, click the plus sign next to the reported sample size:
Set a new value and click Run with new sample size. Note that when setting a new sample size, above a certain point (which is determined by the size of the dataset) DataRobot forces a frozen run. To increase sample size in larger datasets without a frozen run, create the new model from the Repository. Be aware that this method may use more RAM and have implications on system performance.
When configuring sample sizes to retrain models in projects with large row counts, DataRobot recommends requesting sample sizes using integer row counts or the Snap To Validation option instead of percentages. This is because percentages map to many actual possible row counts and only one of which is the actual sample size for “up to validation.” For example, if a project has 199,408 rows and you request a 64% sample size, any number of rows between 126,625 rows and 128,618 rows maps to 64% of the data. Using integer row counts or the “Snap-to” options avoids ambiguity around how many rows of data you want the model to use.
Blending lets you combine the predictions of multiple models, which often leads to better results than running models individually. DataRobot automatically creates blender models when you run Autopilot.
Why create blended models?
- To create more accurate models.
- To use multiple blueprints.
- To leverage the wisdom of crowds principal.
Consider the following before creating a blended model:
- It is recommended to blend models with different algorithms and high accuracy.
- Although blenders often increase accuracy, they also require more time to create and score.
- Because the final model is more complex, blended models can be more difficult to interpret and communicate. Use the Understand tab insights to aid in interpretation.
DataRobot supports the following blending methods for non-time aware projects:
|Average Blend (AVG)||Regression, Binary Classification, Multiclass||N/A|
|Median Blend (MED)||Regression, Binary Classification, Multiclass||N/A|
|Partial Least Squares Blend (PLS)||Regression, Binary Classification||Not available on large datasets (slim run)|
|Generalized Linear Model Blend (GLM)||Regression, Binary Classification||Not available on large datasets (slim run)|
|Elastic Net Blend (ENET)||Regression, Binary Classification, Multiclass||Not available on large datasets (slim run)|
|Mean Absolute Error-Minimizing Weighted Average Blend (MAE)||Regression||Only available for projects using MAE as the project metric; not available on large datasets (slim run)|
|Mean Absolute Error-Minimizing Weighted Average Blend with L1 Penalty (MAEL1)||Regression||Only available for projects using MAE as the project metric; not available on large datasets (slim run)|
|Random Forest Blend (RF)||Multiclass||Not available on large datasets (slim run)|
|TensorFlow Blend (TF)||Regression, Binary Classification, Multiclass||Not available on large datasets (slim run)|
|Light Gradient Boosting Machine Blend (LGBM)||Regression, Binary Classification, Multiclass||Not available on large datasets (slim run)|
|Advanced Average Blend (Advanced AVG)||Regression, Binary Classification, Multiclass||Can only be run by autopilot; not available on large datasets (slim run)|
|Advanced Generalized Linear Model Blend (Advanced GLM)||Regression, Binary Classification||Can only be run by autopilot; not available on large datasets (slim run)|
|Advanced Elastic Net Blend (Advanced ENET)||Regression, Binary Classification, Multiclass||Can only be run by autopilot; not available on large datasets (slim run)|
A single model blender is like a "calibration" step. Calibration is DataRobot's attempt to improve the model such that the distribution and behavior of predicted probability values are close to distribution and behavior of probability values observed in the training data.
GLM, ENET, and PLS blenders learn an intercept and a coefficient. That is, they "add a number to every prediction" and "multiply every prediction by a number." Sometimes, a simple addition or multiplication can yield a small improvement in a model's results. Blenders that require training (all except AVG and MED) use stacking to ensure an out-of-sample prediction (and avoid misleadingly high accuracy). An entire LGBM or TF model is fit with a single prediction input and from that can learn a complex non-linear transform of the single prediction. For AVG and MED blenders, creating a single model blender is not useful as it results in an exact duplicate of the parent model.
See below for time-aware blender information.
For each target point, the Average and Median blenders calculate the average or median values of the predictions of the selected individual models. GLM, Elastic Net, and PLS blenders are essentially a second layer of models on the top of the existing models. They use the predictions of the selected models as predictors, while keeping the same target as individual models.
Create a blended model¶
Follow these steps to create a blended model.
Using the checkboxes on the left side of the model Leaderboard, select two or more models. (See the note on single-model blenders, above, to use blending as an additional calibration method.)
Click the model menu icon at the top left of the Leaderboard, then select one of the blending options listed under Blending. (Hovering over a menu item displays a description of the blending option.)
A new job appears in the Worker Queue while the blended model is processed. The name indicates the blender type and the models selected to create the blender.
When processing is complete, the new blended model displays in the list on the Leaderboard.
You can re-run blended models with a new sample size and different feature lists. Mark the checkbox for your blended model, choose the "Run Selected Model" option from the Menu, and make your desired changes before selecting Run Model.
Alternatively, make your changes to feature lists and sample size from the Leaderboard:
Blenders for time-aware projects¶
Time-aware models, because they do not use stacked predictions, have different blenders available:
|Blender (Code)||Project type||Description|
|Average Blend (AVG)||OTV, time series||Average of prediction between different models|
|Median Blend (MED)||OTV, time series||Median of prediction between different models|
|Average Blend by Forecast Distance (FD_AVG)||Time series||From the selected models, provides per forecast distance averages for the three top models. Only available for projects with two or more forecast distances; to blend by forecast distance, at least four models need to be selected|
|ENET Blend by Forecast Distance (FD_ENET)||Time series||Elastic Net model per forecast distance to combine predictions; only available for projects with two or more forecast distances|
While some models do better on short term predictions (the next few steps into the future) and others do better at long term predictions (further into the future), time series projects add forecast distance blending options. With forecast distance blenders, DataRobot blends models differently for each forecast distances in order to use the best blueprints for each.
The forecast distance blenders are disabled when the forecast distance in equal to 1.
When using Average Blender by Forecast Distance, you must select four or more models. If fewer than four are selected, the blender averages model predictions instead of forecast-distance based predictions.
Run selected models¶
After DataRobot creates your models and populates the Leaderboard, you can retrain selected models using different settings. For example, you can run using a different feature list or sample size, or select either single fold or up to five-fold cross-validation.
You cannot use Run Selected Models on blended models.
Use the following steps to run one or more selected models.
Use the checkboxes on the left side of the model names on the Leaderboard to select one or more models.
Click the model menu icon at the top left of the Leaderboard, then select Run Selected Model(s).
Use the resulting box at the top of the Leaderboard to specify a feature list, sample size, and/or number of cross-validation (CV) runs.
Click Run Task(s) to retrain the selected models with the specified parameters. New jobs appear in the Worker Queue while DataRobot processes the models.
You can delete models listed on the Leaderboard using the instructions below. Note that when you delete a model in this way, it is deleted from the Leaderboard but not from the underlying project database. Because of this, the model remains available to any parent project components (for example, blender models or Word Cloud).
To delete models:
Using the checkboxes on the left side of the model Leaderboard, select one or more models.
Click the model menu icon at the top left of the Leaderboard.
Click Delete Selected Model(s).
Click Delete to confirm model deletion.