Add training data to a custom model¶
To enable feature drift tracking for a model deployment, you must add training data. To do this, assign training data to a model version. The method for providing training and holdout datasets for unstructured custom inference models requires you to upload the training and holdout datasets separately. Additionally, these datasets cannot include a partition column.
File size warning
The file size limit for custom model training data uploaded to DataRobot is 1.5GB.
To assign training data to a custom model version:
-
In Model Registry > Custom Model Workshop, in the Models list, select the model you want to add training data to.
-
On the Assemble tab, next to Datasets:
-
If the model version doesn't have training data assigned, click Assign:
-
If the model version does have training data assigned, click the edit icon , and in the Change Training Data dialog box, click the delete icon to remove the existing training data.
-
-
In the Add Training Data (or Change Training Data) dialog box, click and drag a training dataset file into the Training Data box, or click Choose file and do either of the following:
-
Click Local file, select a file from your local storage, and then click Open.
-
Click AI Catalog, select a training dataset you previously uploaded to DataRobot, and click Use this dataset.
Include features required for scoring
The columns in a custom model's training data indicate which features are included in scoring requests to the deployed custom model; therefore, once training data is available, any features not included in the training dataset aren't sent to the model. Available as a preview feature, when you assemble a custom model in the NextGen experience, you can disable this behavior using the Column filtering setting.
-
-
(Optional) Specify the column name containing partitioning info for your data (based on training/validation/holdout partitioning). If you plan to deploy the custom model and monitor its data drift and accuracy, specify the holdout partition in the column to establish an accuracy baseline.
Features requiring partition columns
In the following situations, specifying a partition column is required:
-
To enable Prediction Explanations for a custom model.
Only XEMP explanations are supported, and they require at least 100 non-duplicated rows in the validation set to compute. For more Prediction Explanation requirements, see the XEMP considerations.
-
To provide a baseline for drift and accuracy tracking.
While you can track data drift and accuracy without specifying a partition column, in that scenario, DataRobot won't have baseline values. The selected partition column should only include the values
T
,V
, orH
.
-
-
When the upload is complete, click Add Training Data.
Training data assignment error
If the training data assignment fails, an error message appears in the new custom model version under Datasets. While this error is active, you can't create a model package to deploy the affected version. To resolve the error and deploy the model package, reassign training data to create a new version, or create a new version and then assign training data.