# Model metadata and validation schema

> Model metadata and validation schema - How to use the model-metadata.yaml file to specify additional
> information about a custom task or a custom inference model.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-05-01T23:10:48.111052+00:00` (UTC).

## Primary page

- [Model metadata and validation schema](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html): Full documentation for this topic (HTML).

## Sections on this page

- [General metadata parameters](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#general-metadata-parameters): In-page section heading.
- [Inference model metadata (inferenceModel)](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#inference-model-metadata): In-page section heading.
- [Validation schema and fields](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#validation-schema-and-fields): In-page section heading.
- [data_types](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#data-types): In-page section heading.
- [sparse](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#sparse): In-page section heading.
- [number_of_columns](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#number-of-columns): In-page section heading.
- [contains_missing](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#contains-missing): In-page section heading.
- [Default schema](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#default-schema): In-page section heading.
- [Running checks locally](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#running-checks-locally): In-page section heading.
- [Ignore validation](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#ignore-validation): In-page section heading.

## Related documentation

- [Reference documentation](https://docs.datarobot.com/en/docs/reference/index.html): Linked from this page.
- [Predictive AI reference](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/index.html): Linked from this page.
- [Composable ML reference](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/index.html): Linked from this page.

## Documentation content

# Model metadata and validation schema

The `model-metadata.yaml` file is used to specify additional information about a custom task or a custom inference model, such as:

- Supported input/output data types that validate, when composing a blueprint, whether a task's input/output requirements match the neighboring tasks.
- The environment ID/model ID of a task or model when runningdrum push.

To define metadata, create a `model-metadata.yaml` file and put it in the top level of the task/model directory. In most cases, it can be skipped, but it is required for custom transform tasks when a custom task outputs non-numeric data.  The `model-metadata.yaml` is located in the same folder as `custom.py`.

The sections below show how to define metadata for custom models and tasks. For more information, you can review complete examples in the DRUM repository for [custom models](https://github.com/datarobot/datarobot-user-models/blob/master/model_templates/python3_sklearn/model-metadata.yaml) and [tasks](https://github.com/datarobot/datarobot-user-models/blob/master/task_templates/1_transforms/1_python_missing_values/model-metadata.yaml).

## General metadata parameters

The following table describes options that are available to tasks and/or inference models. The parameters are required when using `drum push` to supply information about the model/task/version to create. Some of the parameters are also required outside of `drum push` for compatibility reasons.

> [!NOTE] Note
> The `modelID` parameter adds a new version to a pre-existing custom model or task with the specified ID. Because of this, all options that configure a new base-level custom model or task are ignored when passed alongside this parameter. However, at this time, these parameters still must be included.

| Option | When required | Task or inference model | Description |
| --- | --- | --- | --- |
| name | Always | Both | A string, preferably unique for easy searching, that drum push uses as the custom model title. |
| type | Always | Both | A string, either training (for custom tasks) or inference (for custom inference models). |
| environmentID | Always | Both | A hash of the execution environment to use while running your custom model or task. You can find a list of available execution environments in Model Registry > Custom Model Workshop > Environments. Expand the environment and click on the Environment Info tab to view and copy the file ID. Required for drum push only. |
| targetType | Always | Both | A string indicating the type of target. Must be one of: binaryregressionanomaly unstructured (inference models only)multiclasstextgeneration (inference models only)agenticworkflow (inference models only) transform (transform tasks only) |
| modelID | Optional | Both | After creating a model or task, it is best practice to use versioning to add code while iterating. To create a new version instead of a new model or task, use this field to link the custom model/task you created. The ID (hash) is available from the UI, via the URL of the custom model or task. Used with drum push only. |
| description | Optional | Both | A searchable field. If modelID is set, use the UI to change a model/task description. Used with drum push only. |
| majorVersion | Optional | Both | Specifies whether the model version you are creating should be a major (True, the default) or minor (False) version update. For example, if the previous model version is 2.3, a major version update would create version 3.0; a minor version update would create version 2.4. Used for drum push only. |
| targetName | For binary and multiclass (in inferenceModel) | Model | In inferenceModel, the name of the column the model predicts. For multiclass, use the same name as Target name in the Workshop and the same order of classes as Target classes for classLabels. |
| positiveClassLabel / negativeClassLabel | For binary classification models | Model | In inferenceModel, when your model predicts probability, the positiveClassLabel dictates what class the prediction corresponds to. |
| classLabels | For multiclass classification models | Model | In inferenceModel, a list of class names (strings). The list order must match the order of predicted class probabilities your model returns (for example, the column order of probability outputs). Use the same labels as the Target classes you configure for the custom model in the Workshop. |
| predictionThreshold | Optional (binary classification models only). | Model | In inferenceModel, the cutoff point between 0 and 1 that dictates which label will be chosen as the predicted label. |
| trainOnProject | Optional | Task | A hash with the ID of the project (PID) to train the model or version on. When using drum push to test and upload a custom estimator task, you have an option to train a single-task blueprint immediately after the estimator is successfully uploaded into DataRobot. The trainOnProject option specifies the project on which to train that blueprint. |

## Inference model metadata (inferenceModel)

For structured inference models, target and class-label settings belong under the top-level key `inferenceModel` in `model-metadata.yaml`. If you omit fields that DataRobot or DRUM require for your `targetType`, builds, tests, or deployments can fail.

| targetType | Required under inferenceModel | Notes |
| --- | --- | --- |
| binary | targetName, positiveClassLabel, negativeClassLabel | Optional: predictionThreshold. |
| multiclass | targetName, classLabels | classLabels is a YAML list of class names in the same order as your model’s probability outputs. |
| regression | (often none) | Many regression templates work without an inferenceModel block; follow your environment and DRUM requirements. |
| anomaly, unstructured, textgeneration, … | Follow template / DRUM | See examples for your target type. |

Workshop-generated file: On the Registry Workshop Assemble tab, Create model-metadata.yaml produces a starter file for your model’s target type. For multiclass, that file includes `inferenceModel` with `targetName` and `classLabels` (aligned with your Target classes), matching what you need for a successful deployment.

## Validation schema and fields

The schema validation system, which is defined under the `typeSchema` field in `model_metadata.yaml`, is used to define the expected input and output data requirements for a given custom task. By including the optional `input_requirements` and `output_requirements` fields, you can specify exactly the kind of data a custom task expects or outputs. DataRobot displays the specified conditions in the blueprint editor to indicate whether the neighboring tasks match. It also uses them during blueprint training to validate whether the task's data format matches the conditions. Supported conditions include:

- data type
- data sparsity
- number of columns
- support of missing values

> [!NOTE] Note
> Be aware that `output_requirements` are only supported for custom transform tasks and must be omitted for estimators.

The sections below describe allowed conditions and values. Unless noted otherwise, a single entry is all that is required for input and/or output requirements.

### data_types

The `data_types` field specifies the data types that are expected, or those that are specifically disallowed. A single data type or a list is allowed for `input_requirements`; only a single data type is allowed as `output_requirements`.

Allowed values are NUM, TXT, IMG, DATE, CAT, DATE_DURATION, COUNT_DICT, and GEO.

The conditions used for `data_types` are:

- EQUALS: All of the listed data types are required in the dataframe. Missing or unexpected types raise an error.
- IN: All of the listed data types are supported, but not all are required to be present.
- NOT_EQUALS: The data type for the input dataframe may not be this value.
- NOT_IN: None of the listed data types are supported by the task.

### sparse

The `sparse` field defines whether the task supports sparse data as an input or if the task can create output data that is in a sparse format.

- A condition of EQUALS must always be included in sparsity specifications.

For input, the following values apply:

- FORBIDDEN: The task cannot handle a sparse matrix format, and will fail if one is provided.
- SUPPORTED: The model must support both a dense dataframe and a sparse dataframe in CSR format. Either could be passed in from preceding tasks.
- REQUIRED: This task only supports a sparse matrix as input and cannot use a dense matrix.  DRUM will load the matrix into a sparse dataframe.

For task output, the following values apply:

- NEVER: The task can never output a sparse dataframe.
- DYNAMIC: The task can output either a dense or sparse matrix.
- ALWAYS: The task will always output a sparse matrix.
- IDENTITY: The task can output either a sparse or dense matrix, and the sparsity will match the input matrix.

### number_of_columns

The `number_of_columns` field specifies whether a specific minimum or maximum number of columns is required. The value should be a non-negative integer.

For time-consuming tasks, specifying a maximum number of columns can help keep performance reasonable. The `number_of_columns` field allows multiple entries to create ranges of allowed values. Some conditions only allow a single entry (see the [example](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#typeSchema-example)).

The conditions used for `number_of_columns` in a dataframe are:

- EQUALS: The number of columns must exactly match the value. No additional conditions allowed.
- IN: Multiple possible acceptable values are possible. The values are provided as a list in the value field. No additional conditions allowed.
- NOT_EQUALS: The number of columns must not be the specified value.
- GREATER_THAN: The number of columns must be greater than the value provided.
- LESS_THAN: The number of columns must be less than the value provided.
- NOT_GREATER_THAN: The number of columns must be less than or equal to the value provided.
- NOT_LESS_THAN: The number of columns must be greater than or equal to the value provided.

The value must be a non-negative integer.

### contains_missing

The `contains_missing` field specifies whether a task can accept missing data or whether a task can output missing values.

- A condition of EQUALS must always be used.

For input, the following values apply to the input dataframe:

- FORBIDDEN: The task cannot accept missing values/NA.
- SUPPORTED: The task is capable of dealing with missing values.

For task output, the following values apply:

- NEVER: The task can never output missing values.
- DYNAMIC: The task can output missing values.

### Default schema

When a schema isn't supplied for a task, DataRobot uses the default schema, which allows [sparse data](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#sparse-data) and [missing values](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/cml-ref/cml-validation.html#contains-missing) in the input. By default:

```
name: default-transform-model-metadata
type: training
targetType: transform
typeSchema:
  input_requirements:
    - field: data_types
      condition: IN
      value:
        - NUM
        - CAT
        - TXT
        - DATE
        - DATE_DURATION
    - field: sparse
      condition: EQUALS
      value: SUPPORTED
    - field: contains_missing
      condition: EQUALS
      value: SUPPORTED

   output_requirements:
    - field: data_types
      condition: EQUALS
      value: NUM
    - field: sparse
      condition: EQUALS
      value: DYNAMIC
    - field: contains_missing
      condition: EQUALS
      value: DYNAMIC
```

The default output data type is NUM. If any of these values are not appropriate for the task, a schema must be supplied in `model-metadata.yaml` (which is required for custom transform tasks that output non-numeric data).

### Running checks locally

When running `drum fit` or `drum push`, the full set of validation is run automatically. Verification first checks that the supplied `typeSchema` items meet the required format. Any format issues must be addressed before the task can be trained locally or on DataRobot. After format validation, the input dataset used for fit is compared against the supplied `input_requirements` specifications. Following task training, the output of the task is compared to the `output_requirements` and an error is reported if a mismatch is present.

#### Ignore validation

During task development, it might be useful to disable validation. To ignore errors, use the following with `drum fit` or `drum push`:

```
--disable-strict-validation
```