Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Data partitioning and validation

You should evaluate and select models using only the Validation and Cross-Validation scores. Use the Holdout score for a final estimate of model performance only after you have selected your best model. To make sure that the Holdout score does not inadvertently affect your model selection, DataRobot “hides” the score behind the padlock icon. After you have selected your optimal model, you can score it using the holdout data. The following section describes the dataset segments.

Validation types

To maximize accuracy, DataRobot separates data into training, validation, and holdout data. The segments (splits) of the dataset are defined as follows:

Split Description
Training The training set is data used to build the models. Things such as linear model coefficients and the splits of a decision tree are derived from information in the training set.
Validation The validation (or testing) set is data that is not part of the training set; it is used to evaluate a model’s performance using data it has not seen before. Since this data was not used to build the model, it can provide an unbiased estimate of a model’s accuracy. You often compare the results of validation when selecting a model.
Holdout Because the process of training a series of models and then selecting the “best” based on the validation score can yield an overly optimistic estimate of a model’s performance, DataRobot uses the holdout set as an extra check against this selection bias. The holdout data is unavailable to models during the training and validation process. After selecting a model, you can score your model using this data as another check.

You can visualize data partitioning like this:

See the description of stacked predictions for an understanding of which fold DataRobot uses for the validation partition when creating "out-of-sample" predictions on training data.

Validation scores

The Leaderboard lists all models that DataRobot created (automatically or manually) and the model's scores. Scores are displayed in all or some of following columns:

  • Validation
  • Cross-Validation (CV)
  • Holdout

The presence or absence of a particular column depends on the type of validation partition that you chose at the start of the project.

By default, DataRobot creates a 20% holdout and five-fold cross-validation. If you use these defaults, DataRobot displays values in all three columns.

Scores in the Validation column are calculated using a model's trained predictions against the first validation partition. That is, it uses a "single-fold" of data. The Cross-Validation partition is a mean of the (by default) five scores calculated on five different training/validation partitions.

Understand validation types

Model validation has two important purposes. First, you use validation to pick the best model from all the models built for a given dataset. Then, once picked, validation helps you to decide whether the model is accurate enough to suit your needs. The following sections describe methods for using your data to validate models.

K-fold cross-validation (CV)

Performance of a predictive model usually increases as the size of the training set increases. Also, model performance estimates are more consistent if the validation set is large. Therefore, it is best to use as much data as possible for both training and validation. Use the cross-validation method to maximize the data available for each of these sets. This process involves:

  1. separating the data into two or more sections, called “folds”
  2. creating one model per fold, with the data assigned to that fold used for validation and the rest of the data used for training.

The benefit to this approach is that all of the data is used for scoring and if enough folds are used, most of the data is used for training.

  • Pros: This method provides a better estimate of model performance.
  • Cons: Because of its multiple passes, CV is computationally intensive and takes longer to run.

To compensate for the overhead when working with large datasets, DataRobot first trains models on a smaller part of the data and uses only one cross-validation fold to evaluate model performance.

Then, for the highest performing models, DataRobot increases the subset sizes. In the end, only the best models are trained on the total cross-validation partition. For those models, DataRobot completes k-fold cross-validation training and scoring. As a result, the mean score of complete cross-validation for a model is displayed in the Cross-Validation column. Those models that did not perform well will not have a cross-validation score. Instead, because they only had a "one-fold" validation, their score is reported in the Validation column. You can initiate complete CV model evaluation manually for those models by clicking Run in the model's Cross-Validation column.

If the dataset is greater than or equal to 50k rows, DataRobot does not run cross-validation automatically. To initiate, click Run in the model's Cross-Validation column. If the dataset is larger than 800MB, cross-validation is not allowed. Instead, DataRobot uses TVH (described below). CV is generally useful for smaller datasets where you would not otherwise have enough useful data using TVH.

Training, validation, and holdout (TVH)

With the TVH method, the default validation method for datasets larger than 800MB, DataRobot builds and evaluates predictive models by partitioning datasets into the three distinct sections: training, validation, and holdout. Predictions are based on a single pass over the data.

  • Pros: This method is faster than cross-validation because it only makes one pass on each dataset to score the data.

  • Cons: For the same reason that it is faster, it is also moderately less accurate.

For projects larger than 800MB (non-time-aware only), the training partition percentage is not scaled down. The validation and holdout partitions are set to default sizes of 80MB and 100MB respectively, and do not change unless you manually do so (both have a maximum size of 400 MB). The validation and holdout percentages, therefore, scale down with a larger training partition. The percentage of the training partition is comprised of the remaining percentage after accounting for the validation and holdout percentages.

For example, say you have an 900 MB project. If the validation and holdout partitions are at the default sizes of 80MB and 100MB respectively, then the validation percentage will be 9% and the holdout percentage will be 11.1%. The training partition will comprise the remaining 720MB as a percentage: 80%.

Note that for time-aware projects, the TVH method is not applicable. They instead use date/time partitioning.

Examples: partitioning methods

The examples below provide an illustration of how different partitioning methods work in DataRobot non-time aware projects. All examples describe a binary classification problem: predicting loan defaults.

Random partitioning

Rows for each partition are selected at random, without taking target values into account.

State Loan_purpose Is_bad_loan (target) Possible outcome (TVH) Possible outcome (5-fold CV)
AR debt consolidation 0 Training Fold 1
AZ debt consolidation 0 Training Fold 5
AZ home improvement 1 Validation Fold 4
AZ credit card 1 Training Fold 4
CO credit card 0 Training Fold 3
CO home improvement 0 Training Fold 2
CO home improvement 0 Validation Fold 1
CT small business 1 Training Holdout
GA credit card 0 Training Fold 3
ID small business 0 Training Fold 2
IL small business 0 Training Holdout
IN home improvement 1 Holdout Fold 5
IN debt consolidation 1 Holdout Fold 3
KY credit card 0 Training Holdout

Stratified partitioning

For stratified partitioning, each partition (T, V, H, or each CV fold) has a similar proportion of positive and negative target examples, unlike the previous example with random partitioning.

State Loan_purpose Is_bad_loan (target) Possible outcome (TVH) Possible outcome (5-fold CV)
AR debt consolidation 1 Training Fold 1
AZ debt consolidation 0 Training Fold 5
AZ home improvement 1 Validation Fold 4
AZ credit card 0 Training Fold 4
CO credit card 1 Training Fold 3
CO home improvement 0 Training Fold 2
CO home improvement 0 Validation Fold 1
CT small business 1 Training Holdout
GA credit card 0 Training Fold 3
ID small business 1 Training Fold 2
IL small business 0 Training Holdout
IN home improvement 1 Holdout Fold 5
IN debt consolidation 1 Training Fold 3
KY credit card 0 Holdout Holdout

Group partitioning

In this example of group partitioning, State is used as the group column. Note how the rows for the same state always end up in the same partition.

State Loan_purpose Is_bad_loan (target) Possible outcome (TVH) Possible outcome (5-fold CV)
AR debt consolidation 1 Training Fold 1
AZ debt consolidation 0 Training Fold 5
AZ home improvement 1 Training Fold 5
AZ credit card 0 Training Fold 5
CO credit card 1 Validation Fold 3
CO home improvement 0 Validation Fold 3
CO home improvement 0 Validation Fold 3
CT small business 1 Training Fold 1
GA credit card 0 Training Fold 2
ID small business 1 Training Fold 2
IL small business 0 Holdout Holdout
IN home improvement 1 Holdout Fold 4
IN debt consolidation 1 Training Fold 4
KY credit card 0 Training Holdout

Partition feature partitioning

The partition feature method uses either TVH or CV.

TVH: The three unique values of “My_partition_id” directly correspond to assigned partitions.

State Loan_purpose Is_bad_loan (target) My_partition_id (partition feature) Outcome
AR debt consolidation 1 my_train Training
AZ debt consolidation 0 my_train Training
AZ home improvement 1 my_train Training
AZ credit card 0 my_val Validation
CO credit card 1 my_val Validation
CO home improvement 0 my_val Validation
CO home improvement 0 my_train Training
CT small business 1 my_train Training
GA credit card 0 my_train Training
ID small business 1 my_train Training
IL small business 0 my_holdout Holdout
IN home improvement 1 my_holdout Holdout
IN debt consolidation 1 HO Holdout
KY credit card 0 HO Holdout

CV: The seven unique values of My_partition_id directly correspond to seven created partitions.

State Loan_purpose Is_bad_loan (target) My_partition_id (partition feature) Outcome
AR debt consolidation 1 P1 Fold 1
AZ debt consolidation 0 P1 Fold 1
AZ home improvement 1 P2 Fold 2
AZ credit card 0 P2 Fold 2
CO credit card 1 P3 Fold 3
CO home improvement 0 P3 Fold 3
CO home improvement 0 P4 Fold 4
CT small business 1 P4 Fold 4
GA credit card 0 P5 Fold 5
ID small business 1 P5 Fold 5
IL small business 0 P6 Fold 6
IN home improvement 1 P6 Fold 6

Updated October 26, 2021
Back to top