Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Data partitioning and validation

Partitioning creates segments of training data, broken down to maximize accuracy. The following sections describe the segments that DataRobot creates.

Note

To evaluate and select models, consider only the Validation and Cross-Validation scores. Use the Holdout score for a final estimate of model performance only after you have selected your best model. (DataRobot Classic only: To make sure that the Holdout score does not inadvertently affect your model selection, DataRobot “hides” the score behind the padlock icon.) After you have selected your optimal model, score it using the holdout data.

Validation types

To maximize accuracy, DataRobot separates data into training, validation, and holdout data. The segments (splits) of the dataset are defined as follows:

Split Description
Training The training set is data used to build the models. Things such as linear model coefficients and the splits of a decision tree are derived from information in the training set.
Validation The validation (or testing) set is data that is not part of the training set; it is used to evaluate a model’s performance using data it has not seen before. Since this data was not used to build the model, it can provide an unbiased estimate of a model’s accuracy. You often compare the results of validation when selecting a model.
Holdout Because the process of training a series of models and then selecting the “best” based on the validation score can yield an overly optimistic estimate of a model’s performance, DataRobot uses the holdout set as an extra check against this selection bias. The holdout data is unavailable to models during the training and validation process. After selecting a model, you can score your model using this data as another check.

When creating splits, DataRobot uses five folds by default. With each round of training, DataRobot uses increasing amounts of data, split across those folds. For full Autopilot, the first round uses a random sample comprised of 16% of the training data in the first four partitions. The next round uses 32% of the training data, and the final round 64%. The Validation and Holdout partitions never change. (Other modeling modes may use different sample sizes).

You can visualize data partitioning like this:

And cross-validation like this:

DataRobot uses "stacked predictions" for the validation partition when creating "out-of-sample" predictions on training data.

What are stacked predictions?

Without some kind of manipulation, predictions from training data would appear to have misleadingly high accuracy. To address this, DataRobot uses a technique called stacked predictions for the training dataset.

With stacked predictions, DataRobot builds multiple models on different subsets of the training data. The prediction for any row is made using a model that excluded that data from training. In this way, each prediction is effectively an "out-of-sample" prediction.

To do this, DataRobot runs cross validation for each model and then "stacks" the out-of-fold predictions. For example, three-fold cross validation can be represented as follows. Note that every row from the training data is present in the stack, but because of the methodology, the "stack" provides out-of-sample predictions on the training data.

Consider a sample of downloaded predictions:

DataRobot makes it obvious which is the holdout partition—the validation partition is labeled as 0.

Stacked prediction considerations

Any dataset that exceeds 800MB results in a project containing models that do not have stacked predictions (no out-of-sample predictions). For models that have not been trained into either validation or holdout, all scores and insights is available. Otherwise:

  • Validation scores are not available for models trained into Validation; Holdout scores are not available for models trained into Holdout.

  • For models trained into Validation but not into holdout:

    • Holdout score is available, Validation score is not.
    • Lift Chart (and other charts in the Evaluate tab) is available for Holdout, but is not available for Validation.
    • Prediction Explanation preview is available but won’t show a distribution chart
  • For models trained into both Validation and Holdout (a model trained on 100% of the data):

    • No metric scores are available.
    • No Lift Chart (or other charts in the Evaluate tab) are available.
  • Whether insights with in-sample predictions are computed is based on a variety of criteria; DataRobot displays a message explaining what is affected based on the sampling and partition into which the model was trained.

  • For models trained into Validation only or Validation and Holdout, the Prediction Explanation preview is available but does not show a prediction distribution chart (because there are no stacked predictions, only in-sample predictions).

Validation scores

The Leaderboard lists all models that DataRobot created (automatically or manually) and the model's scores. Scores are displayed in all or some of the following columns:

  • Validation
  • Cross-Validation (CV)
  • Holdout

The presence or absence of a particular column depends on the type of validation partition that you chose at the start of the project.

By default, DataRobot creates a 20% holdout and five-fold cross-validation. If you use these defaults, DataRobot displays values in all three columns.

Scores in the Validation column are calculated using a model's trained predictions against the first validation partition. That is, it uses a "single-fold" of data. The Cross-Validation partition is a mean of the (by default) five scores calculated on five different training/validation partitions.

Understand validation types

Model validation has two important purposes. First, you use validation to pick the best model from all the models built for a given dataset. Then, once picked, validation helps you to decide whether the model is accurate enough to suit your needs. The following sections describe methods for using your data to validate models.

K-fold cross-validation (CV)

Note

DataRobot disables cross validation when dataset size is greater than 800MB. As a result, train-validation-holdout (TVH) is the only supported partitioning method.

The performance of a predictive model usually increases as the size of the training set increases. Also, model performance estimates are more consistent if the validation set is large. Therefore, it is best to use as much data as possible for both training and validation. Use the cross-validation method to maximize the data available for each of these sets. This process involves:

  1. Separating the data into two or more sections, called “folds”
  2. Creating one model per fold, with the data assigned to that fold used for validation and the rest of the data used for training.

The benefit to this approach is that all of the data is used for scoring and if enough folds are used, most of the data is used for training.

  • Pros: This method provides a better estimate of model performance.
  • Cons: Because of its multiple passes, CV is computationally intensive and takes longer to run.

To compensate for the overhead when working with large datasets, DataRobot first trains models on a smaller part of the data and uses only one cross-validation fold to evaluate model performance.

Then, for the highest performing models, DataRobot increases the subset sizes. In the end, only the best models are trained on the total cross-validation partition. For those models, DataRobot completes k-fold cross-validation training and scoring. As a result, the mean score of complete cross-validation for a model is displayed in the Cross-Validation column. Those models that did not perform well will not have a cross-validation score. Instead, because they only had a "one-fold" validation, their score is reported in the Validation column. You can initiate complete CV model evaluation manually for those models by clicking Run in the model's Cross-Validation column.

If the dataset is greater than or equal to 50k rows, DataRobot does not run cross-validation automatically. To initiate, click Run in the model's Cross-Validation column. If the dataset is larger than 800MB, cross-validation is not allowed. Instead, DataRobot uses TVH (described below). CV is generally useful for smaller datasets where you would not otherwise have enough useful data using TVH.

Training, validation, and holdout (TVH)

With the TVH method, the default validation method for datasets larger than 800MB, DataRobot builds and evaluates predictive models by partitioning datasets into three distinct sections: training, validation, and holdout. Predictions are based on a single pass over the data.

  • Pros: This method is faster than cross-validation because it only makes one pass on each dataset to score the data.

  • Cons: For the same reason that it is faster, it is also moderately less accurate.

For projects larger than 800MB (non-time-aware only), the training partition percentage is not scaled down. The validation and holdout partitions are set to default sizes of 80MB and 100MB respectively and do not change unless you manually do so (both have a maximum size of 400 MB). The validation and holdout percentages, therefore, scale down with a larger training partition. The percentage of the training partition is comprised of the remaining percentage after accounting for the validation and holdout percentages.

For example, say you have a 900MB project. If the validation and holdout partitions are at the default sizes of 80MB and 100MB respectively, then the validation percentage will be 9% and the holdout percentage will be 11.1%. The training partition will comprise the remaining 720MB as a percentage: 80%.

Note that for time-aware projects, the TVH method is not applicable. They instead use date/time partitioning.

Examples: partitioning methods

The examples below provide an illustration of how different partitioning methods work in DataRobot non-time-aware projects. All examples describe a binary classification problem: predicting loan defaults.

Random partitioning

Rows for each partition are selected at random, without taking target values into account.

State Loan_purpose Is_bad_loan (target) Possible outcome (TVH) Possible outcome (5-fold CV)
AR debt consolidation 0 Training Fold 1
AZ debt consolidation 0 Training Fold 5
AZ home improvement 1 Validation Fold 4
AZ credit card 1 Training Fold 4
CO credit card 0 Training Fold 3
CO home improvement 0 Training Fold 2
CO home improvement 0 Validation Fold 1
CT small business 1 Training Holdout
G​A credit card 0 Training Fold 3
ID small business 0 Training Fold 2
IL small business 0 Training Holdout
IN home improvement 1 Holdout Fold 5
IN debt consolidation 1 Holdout Fold 3
KY credit card 0 Training Holdout

Stratified partitioning

For stratified partitioning, each partition (T, V, H, or each CV fold) has a similar proportion of positive and negative target examples, unlike the previous example with random partitioning.

State Loan_purpose Is_bad_loan (target) Possible outcome (TVH) Possible outcome (5-fold CV)
AR debt consolidation 1 Training Fold 1
AZ debt consolidation 0 Training Fold 5
AZ home improvement 1 Validation Fold 4
AZ credit card 0 Training Fold 4
CO credit card 1 Training Fold 3
CO home improvement 0 Training Fold 2
CO home improvement 0 Validation Fold 1
CT small business 1 Training Holdout
G​A credit card 0 Training Fold 3
ID small business 1 Training Fold 2
IL small business 0 Training Holdout
IN home improvement 1 Holdout Fold 5
IN debt consolidation 1 Training Fold 3
KY credit card 0 Holdout Holdout

Group partitioning

In this example of group partitioning, State is used as the group column. Note how the rows for the same state always end up in the same partition.

State Loan_purpose Is_bad_loan (target) Possible outcome (TVH) Possible outcome (5-fold CV)
AR debt consolidation 1 Training Fold 1
AZ debt consolidation 0 Training Fold 5
AZ home improvement 1 Training Fold 5
AZ credit card 0 Training Fold 5
CO credit card 1 Validation Fold 3
CO home improvement 0 Validation Fold 3
CO home improvement 0 Validation Fold 3
CT small business 1 Training Fold 1
G​A credit card 0 Training Fold 2
ID small business 1 Training Fold 2
IL small business 0 Holdout Holdout
IN home improvement 1 Holdout Fold 4
IN debt consolidation 1 Training Fold 4
KY credit card 0 Training Holdout

Partition feature partitioning

The partition feature method uses either TVH or CV.

TVH: The three unique values of “My_partition_id” directly correspond to assigned partitions.

State Loan_purpose Is_bad_loan (target) My_partition_id (partition feature) Outcome
AR debt consolidation 1 my_train Training
AZ debt consolidation 0 my_train Training
AZ home improvement 1 my_train Training
AZ credit card 0 my_val Validation
CO credit card 1 my_val Validation
CO home improvement 0 my_val Validation
CO home improvement 0 my_train Training
CT small business 1 my_train Training
G​A credit card 0 my_train Training
ID small business 1 my_train Training
IL small business 0 my_holdout Holdout
IN home improvement 1 my_holdout Holdout
IN debt consolidation 1 HO Holdout
KY credit card 0 HO Holdout

CV: The seven unique values of My_partition_id directly correspond to seven created partitions.

State Loan_purpose Is_bad_loan (target) My_partition_id (partition feature) Outcome
AR debt consolidation 1 P1 Fold 1
AZ debt consolidation 0 P1 Fold 1
AZ home improvement 1 P2 Fold 2
AZ credit card 0 P2 Fold 2
CO credit card 1 P3 Fold 3
CO home improvement 0 P3 Fold 3
CO home improvement 0 P4 Fold 4
CT small business 1 P4 Fold 4
G​A credit card 0 P5 Fold 5
ID small business 1 P5 Fold 5
IL small business 0 P6 Fold 6
IN home improvement 1 P6 Fold 6

Updated December 21, 2023