DataRobot provides a mechanism to select the partitioning method and parameters used for model validation. DataRobot selects the “optimal” modeling method based on the size of your data as the default option. Generally, it is best to leave the default selection, but you can modify the method through the Advanced options link.
Partitioning describes the method DataRobot uses to “clump” observations (or rows) together for evaluation and model building. DataRobot supports the following partitioning methods, described below:
- Partition Feature
- Date/Time for OTV or for time-series
View the reference documentation for examples of each partitioning method.
If you selected to set up time-aware modeling on the Start screen, all partitioning methods except Date/Time are disabled. Additionally, not all partition types support smart downsampling.
There are two validation type selections available to you—k-fold cross-validation and training/validation/holdout. See the data partitioning explanation for a detailed description of validation type selections.
Configure model validation¶
To use partitioning and model validation, follow these general steps:
Select a target variable (what you want to predict).
Once you enter the variable, the Advanced options link becomes available. Click the link to display the selections.
In the Partitioning Options section, choose and configure, if required, the partitioning method. (The methods are described below.) For example:
Select a modeling option (Run models using). As applicable, type in the box or use the sliders to change the number of cross-validation folds, the validation percentage, and the holdout percentage. For example:
See below for a description of available partitioning methods.
The following sections provide background detail on partitioning methods. See above for instructions on how to configure partitioning. See also the information on stacked predictions and how DataRobot selects the validation partition.
The section on validation types describes methods for using your data to validate models; the sections below describe options for partitioning your data. Note that the choice of partitioning method and validation type is dependent on the target feature and/or partition column. In other words, not all selections will always display.
For all partitioning methods except Partition Feature and date/time partitioning, the following table describes the meaning of the model validation types. Type in the box or use the sliders to change the number of cross-validation folds, the validation percentage, and the holdout percentages.
|Cross-Validation||Specifies the number of folds and the holdout percentage. The cross-validation score is the average of the scores for the individual partitions.|
|Training-Validation-Holdout||Specifies percentages for the training, validation, and holdout splits.|
Random partitioning (Random)¶
With Random partitioning, DataRobot randomly assigns observations (rows) to the training, validation, and holdout sets.
Column-based partitioning (Partition Feature)¶
The Partition Feature option creates a 1:1 mapping between values of this feature and validation partitions. Each unique value receives its own partition, and all rows with that value are placed in that partition. Although
date cannot be selected as the target for a project, it can be selected as a partition feature. The column or feature you select must have at least two, and no more than 100 values; those with one unique value cannot be used.
DataRobot recommends the use of the Partition Feature option for features that have no more than 25 unique values. For features with more than 25 unique values, use Group Partitioning.
You can, however, manually re-group large sets of unique values into a new feature in order to use that data with the Partition Feature option. For example, if you have a feature with 20,000 unique user IDs, you can group those IDs into 25 regions. As a new feature, those regions are your 25 unique values. You can then use the Partition Feature option with your new feature (which associates those regions with your 20,000 user IDs).
Additionally, the recommended modeling validation type depends on how many unique values your feature has. If your partition feature contains 2-3 unique values, use the training/validation/holdout split. If your partition feature contains closer to 10-25 unique values, DataRobot recommends using cross-validation instead.
The modeling validation types for Partition Feature have a slightly different meaning:
|Cross-Validation||Select a value from the selected partition column that will specify the holdout set. DataRobot uses the split with the largest number of samples for the validation partition (the computed Validation score on the Leaderboard). The Cross-Validation score—evaluated on all partitions that are not a part of the holdout—is the average of those individual partition scores. If the partition column has fewer than three values, the holdout set is disabled.|
|Training-Validation-Holdout||For training, validation, and holdout, set the value from the selected partition column that specifies that set.|
Group partitioning (Group)¶
With the Group partitioning method, all rows with the same single value for the selected feature are guaranteed to be in the same training or test set. Each partition can contain more than one value for the feature, but each individual value will be automatically grouped together by DataRobot. The application returns an error message if your Group ID feature will not provide an informative result. The error occurs when the feature chosen for group partitioning has a cardinality of less than 3 times the number of cross-validation folds selected.
DataRobot recommends the use of the Group partitioning option for features with more than 25 unique values. For features with less than 25 unique values, use Partition Feature. Additionally, DataRobot recommends a very evenly distributed set of unique values for group partitioning.
Date/time partitioning allows you to order partitions based on time and is part of DataRobot's time-aware modeling capabilities. See a more complete description of date/time partitioning in the Out-of-Time Validation (OTV) or time series sections.
Ratio-preserved partitioning (Stratified)¶
Observations (rows) are randomly assigned to training, validation, and holdout sets, preserving (as close as possible to) the same ratio of values for the prediction target as in the original data. If Run models using is set to Train-Validate-Holdout, each partition is assigned the same ratio. If set to Cross-Validation, the ratio is preserved both 1) across each CV fold and 2) relative to the training partition. This selection is available for zero-boosted regression problems and binary classification problems.