Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Add training data to a custom model

Deprecation notice

Currently, you assign training data directly to a custom model, meaning every version of that model uses the same data; however, this assignment method is deprecated and scheduled for removal. This assignment method remains the default during the deprecation period, even for newly created models, to support backward compatibility; to prepare for the removal of this method, you should convert your custom models to assign training data to a model version:

To enable feature drift tracking for a model deployment, you must add training data. To do this, assign training data to a model version. The method for providing training and holdout datasets for unstructured custom inference models requires you to upload the training and holdout datasets separately. Additionally, these datasets cannot include a partition column.

File size warning

When adding training data to a custom model, the training data can be subject to a frozen run to conserve RAM and CPU resources, limiting the file size of the training dataset to 1.5GB.

To convert your custom models to assign training data to a model version:

  1. In Model Registry > Custom Model Workshop, in the Models list, select the model you want to add training data to.

  2. To assign training data to a custom model's versions, you must convert the model. On the Assemble tab, locate the Training data for model versions alert and click Permanently convert:

    Training data assignment method conversion

    Converting a model's training data assignment method is a one-way action. It cannot be reverted. After conversion, you can't assign training data at the model level. This change applies to the UI and the API. If your organization has any automation depending on "per model" training data assignment, before you convert a model, you should update any related automation to support the new workflow. As an alternative, you can create a new custom model to convert to the "per version" training data assignment method and maintain the deprecated "per model" method on the model required for the automation; however, you should update your automation before the deprecation process is complete to avoid gaps in functionality.

    If the model was already assigned training data, after you convert the model, the Datasets section contains information about the existing training dataset.

  3. On the Assemble tab, next to Datasets:

    • If the model version doesn't have training data assigned, click Assign:

    • If the model version does have training data assigned, click the edit icon (), and, in the Change Training Data dialog box, click the delete icon () to remove the existing training data.

  4. In the Add Training Data (or Change Training Data) dialog box, click and drag a training dataset file into the Training Data box, or click Choose file and do either of the following:

    • Click Local file, select a file from your local storage, and then click Open.

    • Click AI Catalog, select a training dataset you previously uploaded to DataRobot, and click Use this dataset.

    Include features required for scoring

    The columns in a custom model's training data indicate which features are included in scoring requests to the deployed custom model; therefore, once training data is available, any features not included in the training dataset aren't sent to the model. This requirement does not apply to predictions made while testing a custom model. Available as a preview feature, when you assemble a custom model in the NextGen experience, you can disable this behavior using the Column filtering setting.

  5. (Optional) Specify the column name containing partitioning info for your data (based on training/validation/holdout partitioning). If you plan to deploy the custom model and monitor its data drift and accuracy, specify the holdout partition in the column to establish an accuracy baseline.

    Specify partition column

    You can track data drift and accuracy without specifying a partition column; however, in that scenario, DataRobot won't have baseline values. The selected partition column should only include the values T, V, or H.

  6. When the upload is complete, click Add Training Data.

    Training data assignment error

    If the training data assignment fails, an error message appears in the new custom model version under Datasets. While this error is active, you can't create a model package to deploy the affected version. To resolve the error and deploy the model package, reassign training data to create a new version, or create a new version and then assign training data.

Deprecation notice

Currently, you assign training data directly to a custom model, meaning every version of that model uses the same data; however, this assignment method is deprecated and scheduled for removal. It remains the default method during the deprecation period, even for newly created models, to support backward compatibility.

This deprecated workflow is scheduled for removal and should not be used:

  1. In Model Registry > Custom Model Workshop, in the Models list, select the model you want to add training data to.

  2. Click the Model Info tab and then click Add Training Data (due to the upcoming removal of this method, you should instead prepare to Permanently convert the custom model).

    The Add Training Data dialog box appears, prompting you to upload training data.

  3. Click Choose file to upload training data. (Optional) You can specify the column name containing the partitioning information for your data (based on training/validation/holdout partitioning). If you plan to deploy the custom model and monitor its accuracy, specify the holdout partition in the column to establish an accuracy baseline. You can still track accuracy without specifying a partition column; however, there will be no accuracy baseline. When the upload is complete, click Add Training Data.

    Include features required for scoring

    The columns in a custom model's training data indicate which features are included in scoring requests to the deployed custom model; therefore, once training data is available, any features not included in the training dataset aren't sent to the model. This requirement does not apply to predictions made while testing a custom model. Available as a preview feature, when you assemble a custom model in the NextGen experience, you can disable this behavior using the Column filtering setting.


Updated March 22, 2024