Modeling > Modeling reference > Modeling details > Data partitioning and validation

Data partitioning and validation¶

Partitioning creates segments of training data, broken down to maximize accuracy. The following sections describe the segments that DataRobot creates.

Note

To evaluate and select models, consider only the Validation and Cross-Validation scores. Use the Holdout score for a final estimate of model performance only after you have selected your best model. (DataRobot Classic only: To make sure that the Holdout score does not inadvertently affect your model selection, DataRobot “hides” the score behind the padlock icon.) After you have selected your optimal model, score it using the holdout data.

Validation types¶

To maximize accuracy, DataRobot separates data into training, validation, and holdout data. The segments (splits) of the dataset are defined as follows:

Split	Description
Training	The training set is data used to build the models. Things such as linear model coefficients and the splits of a decision tree are derived from information in the training set.
Validation	The validation (or testing) set is data that is not part of the training set; it is used to evaluate a model’s performance using data it has not seen before. Since this data was not used to build the model, it can provide an unbiased estimate of a model’s accuracy. You often compare the results of validation when selecting a model.
Holdout	Because the process of training a series of models and then selecting the “best” based on the validation score can yield an overly optimistic estimate of a model’s performance, DataRobot uses the holdout set as an extra check against this selection bias. The holdout data is unavailable to models during the training and validation process. After selecting a model, you can score your model using this data as another check.

When creating splits, DataRobot uses five folds by default. With each round of training, DataRobot uses increasing amounts of data, split across those folds. For full Autopilot, the first round uses a random sample comprised of 16% of the training data in the first four partitions. The next round uses 32% of the training data, and the final round 64%. The Validation and Holdout partitions never change. (Other modeling modes may use different sample sizes).

You can visualize data partitioning like this:

And cross-validation like this:

DataRobot uses "stacked predictions" for the validation partition when creating "out-of-sample" predictions on training data.

What are stacked predictions?¶

Without some kind of manipulation, predictions from training data would appear to have misleadingly high accuracy. To address this, DataRobot uses a technique called stacked predictions for the training dataset.

With stacked predictions, DataRobot builds multiple models on different subsets of the training data. The prediction for any row is made using a model that excluded that data from training. In this way, each prediction is effectively an "out-of-sample" prediction.

To do this, DataRobot runs cross validation for each model and then "stacks" the out-of-fold predictions. For example, three-fold cross validation can be represented as follows. Note that every row from the training data is present in the stack, but because of the methodology, the "stack" provides out-of-sample predictions on the training data.

Consider a sample of downloaded predictions:

DataRobot makes it obvious which is the holdout partition—the validation partition is labeled as 0.

Stacked prediction considerations¶

Any dataset that exceeds 800MB results in a project containing models that do not have stacked predictions (no out-of-sample predictions). For models that have not been trained into either validation or holdout, all scores and insights is available. Otherwise:

Validation scores are not available for models trained into Validation; Holdout scores are not available for models trained into Holdout.
For models trained into Validation but not into holdout:
- Holdout score is available, Validation score is not.
- Lift Chart (and other charts in the Evaluate tab) is available for Holdout, but is not available for Validation.
- Prediction Explanation preview is available but won’t show a distribution chart
For models trained into both Validation and Holdout (a model trained on 100% of the data):
- No metric scores are available.
- No Lift Chart (or other charts in the Evaluate tab) are available.
Whether insights with in-sample predictions are computed is based on a variety of criteria; DataRobot displays a message explaining what is affected based on the sampling and partition into which the model was trained.
For models trained into Validation only or Validation and Holdout, the Prediction Explanation preview is available but does not show a prediction distribution chart (because there are no stacked predictions, only in-sample predictions).

Validation scores¶

The Leaderboard lists all models that DataRobot created (automatically or manually) and the model's scores. Scores are displayed in all or some of the following columns:

Validation
Cross-Validation (CV)
Holdout

The presence or absence of a particular column depends on the type of validation partition that you chose at the start of the project.

By default, DataRobot creates a 20% holdout and five-fold cross-validation. If you use these defaults, DataRobot displays values in all three columns.

Scores in the Validation column are calculated using a model's trained predictions against the first validation partition. That is, it uses a "single-fold" of data. The Cross-Validation partition is a mean of the (by default) five scores calculated on five different training/validation partitions.

Understand validation types¶

Model validation has two important purposes. First, you use validation to pick the best model from all the models built for a given dataset. Then, once picked, validation helps you to decide whether the model is accurate enough to suit your needs. The following sections describe methods for using your data to validate models.

K-fold cross-validation (CV)¶

Note

DataRobot disables cross validation when dataset size is greater than 800MB. As a result, train-validation-holdout (TVH) is the only supported partitioning method.

The performance of a predictive model usually increases as the size of the training set increases. Also, model performance estimates are more consistent if the validation set is large. Therefore, it is best to use as much data as possible for both training and validation. CV is generally useful for smaller datasets where you would not otherwise have enough useful data using TVH. In other words, use this method to maximize the data available for each of these sets. This process involves:

Separating the data into two or more sections, called “folds.”
Creating one model per fold, with the data assigned to that fold used for validation and the rest of the data used for training.

The benefit to this approach is that all of the data is used for scoring and if enough folds are used, most of the data is used for training.

Pros: This method provides a better estimate of model performance.
Cons: Because of its multiple passes, CV is computationally intensive and takes longer to run.

To compensate for the overhead when working with large datasets, DataRobot first trains models on a smaller part of the data and uses only one cross-validation fold to evaluate model performance.

Then, for the highest performing models, DataRobot increases the subset sizes. In the end, only the best models are trained on the total cross-validation partition. For those models, DataRobot completes k-fold cross-validation training and scoring. As a result, the mean score of complete cross-validation for a model is displayed in the Cross-Validation column. Those models that did not perform well will not have a cross-validation score. Instead, because they only had a "one-fold" validation, their score is reported in the Validation column. You can initiate complete CV model evaluation manually for those models by clicking Run in the model's Cross-Validation column.

Notes on usage:

If the dataset is greater than or equal to 50k rows, DataRobot does not run cross-validation automatically. (This limit is configuration-dependent for self-managed users.) To initiate, click Run in the model's Cross-Validation column.
CV requires:
- A minimum of 10 records in the training partition.
- A minimum of three records in the validation partition.

Training, validation, and holdout (TVH)¶

With the TVH method, the default validation method for datasets larger than 800MB, DataRobot builds and evaluates predictive models by partitioning datasets into three distinct sections: training, validation, and holdout. Predictions are based on a single pass over the data.

Pros: This method is faster than cross-validation because it only makes one pass on each dataset to score the data.
Cons: For the same reason that it is faster, it is also moderately less accurate.

For projects larger than 800MB (non-time-aware only), the training partition percentage is not scaled down. The validation and holdout partitions are set to default sizes of 80MB and 100MB respectively and do not change unless you manually do so (both have a maximum size of 400 MB). The validation and holdout percentages, therefore, scale down with a larger training partition. The percentage of the training partition is comprised of the remaining percentage after accounting for the validation and holdout percentages.

For example, say you have a 900MB project. If the validation and holdout partitions are at the default sizes of 80MB and 100MB respectively, then the validation percentage will be 9% and the holdout percentage will be 11.1%. The training partition will comprise the remaining 720MB as a percentage: 80%.

Note that for time-aware projects, the TVH method is not applicable. They instead use date/time partitioning.

Examples: partitioning methods¶

The examples below provide an illustration of how different partitioning methods work in DataRobot non-time-aware projects. All examples describe a binary classification problem: predicting loan defaults.

Random partitioning¶

Rows for each partition are selected at random, without taking target values into account.

State	Loan_purpose	Is_bad_loan (target)	Possible outcome (TVH)	Possible outcome (5-fold CV)
AR	debt consolidation	0	Training	Fold 1
AZ	debt consolidation	0	Training	Fold 5
AZ	home improvement	1	Validation	Fold 4
AZ	credit card	1	Training	Fold 4
CO	credit card	0	Training	Fold 3
CO	home improvement	0	Training	Fold 2
CO	home improvement	0	Validation	Fold 1
CT	small business	1	Training	Holdout
GA	credit card	0	Training	Fold 3
ID	small business	0	Training	Fold 2
IL	small business	0	Training	Holdout
IN	home improvement	1	Holdout	Fold 5
IN	debt consolidation	1	Holdout	Fold 3
KY	credit card	0	Training	Holdout

Stratified partitioning¶

For stratified partitioning, each partition (T, V, H, or each CV fold) has a similar proportion of positive and negative target examples, unlike the previous example with random partitioning.

State	Loan_purpose	Is_bad_loan (target)	Possible outcome (TVH)	Possible outcome (5-fold CV)
AR	debt consolidation	1	Training	Fold 1
AZ	debt consolidation	0	Training	Fold 5
AZ	home improvement	1	Validation	Fold 4
AZ	credit card	0	Training	Fold 4
CO	credit card	1	Training	Fold 3
CO	home improvement	0	Training	Fold 2
CO	home improvement	0	Validation	Fold 1
CT	small business	1	Training	Holdout
GA	credit card	0	Training	Fold 3
ID	small business	1	Training	Fold 2
IL	small business	0	Training	Holdout
IN	home improvement	1	Holdout	Fold 5
IN	debt consolidation	1	Training	Fold 3
KY	credit card	0	Holdout	Holdout

Group partitioning¶

In this example of group partitioning, State is used as the group column. Note how the rows for the same state always end up in the same partition.

State	Loan_purpose	Is_bad_loan (target)	Possible outcome (TVH)	Possible outcome (5-fold CV)
AR	debt consolidation	1	Training	Fold 1
AZ	debt consolidation	0	Training	Fold 5
AZ	home improvement	1	Training	Fold 5
AZ	credit card	0	Training	Fold 5
CO	credit card	1	Validation	Fold 3
CO	home improvement	0	Validation	Fold 3
CO	home improvement	0	Validation	Fold 3
CT	small business	1	Training	Fold 1
GA	credit card	0	Training	Fold 2
ID	small business	1	Training	Fold 2
IL	small business	0	Holdout	Holdout
IN	home improvement	1	Holdout	Fold 4
IN	debt consolidation	1	Training	Fold 4
KY	credit card	0	Training	Holdout

Partition feature partitioning¶

The partition feature method uses either TVH or CV.

TVH: The three unique values of “My_partition_id” directly correspond to assigned partitions.

State	Loan_purpose	Is_bad_loan (target)	My_partition_id (partition feature)	Outcome
AR	debt consolidation	1	my_train	Training
AZ	debt consolidation	0	my_train	Training
AZ	home improvement	1	my_train	Training
AZ	credit card	0	my_val	Validation
CO	credit card	1	my_val	Validation
CO	home improvement	0	my_val	Validation
CO	home improvement	0	my_train	Training
CT	small business	1	my_train	Training
GA	credit card	0	my_train	Training
ID	small business	1	my_train	Training
IL	small business	0	my_holdout	Holdout
IN	home improvement	1	my_holdout	Holdout
IN	debt consolidation	1	HO	Holdout
KY	credit card	0	HO	Holdout

CV: The seven unique values of My_partition_id directly correspond to seven created partitions.

State	Loan_purpose	Is_bad_loan (target)	My_partition_id (partition feature)	Outcome
AR	debt consolidation	1	P1	Fold 1
AZ	debt consolidation	0	P1	Fold 1
AZ	home improvement	1	P2	Fold 2
AZ	credit card	0	P2	Fold 2
CO	credit card	1	P3	Fold 3
CO	home improvement	0	P3	Fold 3
CO	home improvement	0	P4	Fold 4
CT	small business	1	P4	Fold 4
GA	credit card	0	P5	Fold 5
ID	small business	1	P5	Fold 5
IL	small business	0	P6	Fold 6
IN	home improvement	1	P6	Fold 6

Data partitioning and validation¶

Validation types¶

What are stacked predictions?¶

Stacked prediction considerations¶

Validation scores¶

Understand validation types¶

K-fold cross-validation (CV)¶

Training, validation, and holdout (TVH)¶

Examples: partitioning methods¶

Random partitioning¶

Stratified partitioning¶

Group partitioning¶

Partition feature partitioning¶

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?