The model-metadata.yaml file is used to specify additional information about a custom task or a custom inference model, such as:
Supported input/output data types that validate, when composing a blueprint, whether a task's input/output requirements match the neighboring tasks.
The environment ID/model ID of a task or model when running drum push.
To define metadata, create a model-metadata.yaml file and put it in the top level of the task/model directory. In most cases, it can be skipped, but it is required for custom transform tasks when a custom task outputs non-numeric data. The model-metadata.yaml is located in the same folder as custom.py.
The sections below show how to define metadata for custom models and tasks. For more information, you can review complete examples in the DRUM repository for custom models and tasks.
The following table describes options that are available to tasks and/or inference models. The parameters are required when using drum push to supply information about the model/task/version to create. Some of the parameters are also required outside of drum push for compatibility reasons.
Note
The modelID parameter adds a new version to a pre-existing custom model or task with the specified ID. Because of this, all options that configure a new base-level custom model or task are ignored when passed alongside this parameter. However, at this time, these parameters still must be included.
Option
When required
Task or inference model
Description
name
Always
Both
A string, preferably unique for easy searching, that drum push uses as the custom model title.
type
Always
Both
A string, either training (for custom tasks) or inference (for custom inference models).
environmentID
Always
Both
A hash of the execution environment to use while running your custom model or task. You can find a list of available execution environments in Model Registry > Custom Model Workshop > Environments. Expand the environment and click on the Environment Info tab to view and copy the file ID. Required for drum push only.
targetType
Always
Both
A string indicating the type of target. Must be one of: • binary • regression • anomaly • unstructured (inference models only) • multiclass • textgeneration (inference models only) • transform (transform tasks only)
modelID
Optional
Both
After creating a model or task, it is best practice to use versioning to add code while iterating. To create a new version instead of a new model or task, use this field to link the custom model/task you created. The ID (hash) is available from the UI, via the URL of the custom model or task. Used with drum push only.
description
Optional
Both
A searchable field. If modelID is set, use the UI to change a model/task description. Used with drum push only.
majorVersion
Optional
Both
Specifies whether the model version you are creating should be a major (True, the default) or minor (False) version update. For example, if the previous model version is 2.3, a major version update would create version 3.0; a minor version update would create version 2.4. Used for drum push only.
targetName
Always
Model
In inferenceModel, a string indicating the column in your data that the model is predicting.
positiveClassLabel / negativeClassLabel
For binary classification models
Model
In inferenceModel, when your model predicts probability, the positiveClassLabel dictates what class the prediction corresponds to.
predictionThreshold
Optional (binary classification models only).
Model
In inferenceModel, the cutoff point between 0 and 1 that dictates which label will be chosen as the predicted label.
trainOnProject
Optional
Task
A hash with the ID of the project (PID) to train the model or version on. When using drum push to test and upload a custom estimator task, you have an option to train a single-task blueprint immediately after the estimator is successfully uploaded into DataRobot. The trainOnProject option specifies the project on which to train that blueprint.
The schema validation system, which is defined under the typeSchema field in model_metadata.yaml, is used to define the expected input and output data requirements for a given custom task. By including the optional input_requirements and output_requirements fields, you can specify exactly the kind of data a custom task expects or outputs. DataRobot displays the specified conditions in the blueprint editor to indicate whether the neighboring tasks match. It also uses them during blueprint training to validate whether the task's data format matches the conditions. Supported conditions include:
data type
data sparsity
number of columns
support of missing values
Note
Be aware that output_requirements are only supported for custom transform tasks and must be omitted for estimators.
The sections below describe allowed conditions and values. Unless noted otherwise, a single entry is all that is required for input and/or output requirements.
The data_types field specifies the data types that are expected, or those that are specifically disallowed. A single data type or a list is allowed for input_requirements; only a single data type is allowed as output_requirements.
Allowed values are NUM, TXT, IMG, DATE, CAT, DATE_DURATION, COUNT_DICT, and GEO.
The conditions used for data_types are:
EQUALS: All of the listed data types are required in the dataframe. Missing or unexpected types raise an error.
IN: All of the listed data types are supported, but not all are required to be present.
NOT_EQUALS: The data type for the input dataframe may not be this value.
NOT_IN: None of the listed data types are supported by the task.
The number_of_columns field specifies whether a specific minimum or maximum number of columns is required. The value should be a non-negative integer.
For time-consuming tasks, specifying a maximum number of columns can help keep performance reasonable. The number_of_columns field allows multiple entries to create ranges of allowed values. Some conditions only allow a single entry (see the example).
The conditions used for number_of_columns in a dataframe are:
EQUALS: The number of columns must exactly match the value. No additional conditions allowed.
IN: Multiple possible acceptable values are possible. The values are provided as a list in the value field. No additional conditions allowed.
NOT_EQUALS: The number of columns must not be the specified value.
GREATER_THAN: The number of columns must be greater than the value provided.
LESS_THAN: The number of columns must be less than the value provided.
NOT_GREATER_THAN: The number of columns must be less than or equal to the value provided.
NOT_LESS_THAN: The number of columns must be greater than or equal to the value provided.
The default output data type is NUM. If any of these values are not appropriate for the task, a schema must be supplied in model-metadata.yaml (which is required for custom transform tasks that output non-numeric data).
When running drum fit or drum push, the full set of validation is run automatically. Verification first checks that the supplied typeSchema items meet the required format. Any format issues must be addressed before the task can be trained locally or on DataRobot. After format validation, the input dataset used for fit is compared against the supplied input_requirements specifications. Following task training, the output of the task is compared to the output_requirements and an error is reported if a mismatch is present.