Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Add training data to a custom model

To enable feature drift tracking for a model deployment, you must add training data. To do this, assign training data to a model version. The method for providing training and holdout datasets for unstructured custom inference models requires you to upload the training and holdout datasets separately. Additionally, these datasets cannot include a partition column.

File size warning

The file size limit for custom model training data uploaded to DataRobot is 1.5GB.

Considerations for training data prediction rows count

Training data uploaded to a custom model is used to compute Feature Impact, drift baselines, and Prediction Explanation previews. To perform these calculations, DataRobot automatically splits the uploaded training data into partitions for training, validation, and holdout (i.e., T/V/H) in a 60/20/20 ratio. Alternatively, you can manually provide a partition column in the training dataset to assign predictions, row-by-row, to the training (T), validation (V), or holdout (H) partitions.

Prediction Explanations require 100 rows in the validation partition, which—if you don’t define your own partitioning—requires the provided training dataset to contain a minimum of 500 rows. If the training data and partition ratio (defined automatically or manually) result in a validation partition containing fewer than 100 rows, Prediction Explanations are not calculated. While you can still register and deploy the model—and the deployment can make predictions—if you request predictions with explanations, the deployment returns an error.

Prediction Explanation support for custom models

For custom models, only XEMP explanations are supported. See the XEMP considerations for more requirements.

To assign training data to a custom model version:

  1. In Model Registry > Custom Model Workshop, in the Models list, select the model you want to add training data to.

  2. On the Assemble tab, next to Datasets:

    • If the model version doesn't have training data assigned, click Assign:

    • If the model version does have training data assigned, click the edit icon , and in the Change Training Data dialog box, click the delete icon to remove the existing training data.

  3. In the Add Training Data (or Change Training Data) dialog box, click and drag a training dataset file into the Training Data box, or click Choose file and do either of the following:

    • Click Local file, select a file from your local storage, and then click Open.

    • Click AI Catalog, select a training dataset you previously uploaded to DataRobot, and click Use this dataset.

    Include features required for scoring

    The columns in a custom model's training data indicate which features are included in scoring requests to the deployed custom model; therefore, once training data is available, any features not included in the training dataset aren't sent to the model. Available as a preview feature, when you assemble a custom model in the NextGen experience, you can disable this behavior using the Column filtering setting.

  4. (Optional) Specify the column name containing partitioning info for your data (based on training/validation/holdout partitioning). If you plan to deploy the custom model and monitor its data drift and accuracy, specify the holdout partition in the column to establish an accuracy baseline.

  5. When the upload is complete, click Add Training Data.

    Training data assignment error

    If the training data assignment fails, an error message appears in the new custom model version under Datasets. While this error is active, you can't create a model package to deploy the affected version. To resolve the error and deploy the model package, reassign training data to create a new version, or create a new version and then assign training data.


Updated October 28, 2024