# Configure data sources

> Configure data sources - The DataRobotSource and BatchDataRobotSource methods connect to DataRobot
> to fetch selected data from the DataRobot platform. The DataFrameSource method wraps any
> pd.DataFrame to create a library-compatible source.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-04-24T16:03:56.254688+00:00` (UTC).

## Primary page

- [Configure data sources](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html): Full documentation for this topic (HTML).

## Sections on this page

- [Configure a DataRobot source](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#configure-a-datarobot-source): In-page section heading.
- [Export prediction data](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#export-prediction-data): In-page section heading.
- [Export actuals data](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#export-actuals-data): In-page section heading.
- [Export training data](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#export-training-data): In-page section heading.
- [Export combined data](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#export-combined-data): In-page section heading.
- [Configure a DataRobot batch deployment source](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#configure-a-datarobot-batch-deployment-source): In-page section heading.
- [Configure a DataFrame source](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#configure-a-dataframe-source): In-page section heading.
- [Set the TimeBucket](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#set-the-timebucket): In-page section heading.
- [Provide additional DataRobot deployment properties](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#provide-additional-datarobot-deployment-properties): In-page section heading.

## Related documentation

- [Developer documentation](https://docs.datarobot.com/en/docs/api/index.html): Linked from this page.
- [Code-first tools](https://docs.datarobot.com/en/docs/api/code-first-tools/index.html): Linked from this page.
- [Model Metrics](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/index.html): Linked from this page.
- [API keys and tools](https://docs.datarobot.com/en/docs/platform/acct-settings/api-key-mgmt.html#api-key-management): Linked from this page.
- [Metric Evaluator](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-metric-evaluator.html): Linked from this page.

## Documentation content

# Configure data sources

The most commonly used data source is `DataRobotSource`. This data source connects to DataRobot to fetch selected prediction data from the DataRobot platform. Three additional default data sources are available: `DataRobotSource`, `BatchDataRobotSource`, and `DataFrameSource`.

> [!WARNING] Time series support
> The [DataRobot Model Metrics (DMM)](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/index.html) library does not support time series models, specifically [data export](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-data-sources.html#export-prediction-data) for time series models. To export and retrieve data, use the [DataRobot API client](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/reference/mlops/data_exports.html).

## Configure a DataRobot source

`DataRobotSource` connects to DataRobot to fetch selected prediction data from the DataRobot platform. Initialize `DataRobotSource` with the following mandatory parameters:

```
from dmm.data_source import DataRobotSource

source = DataRobotSource(
    base_url=DATAROBOT_ENDPOINT,
    token=DATAROBOT_API_TOKEN,
    deployment_id=deployment_id,
    start=start_of_export_window,
    end=end_of_export_window,
)
```

You can also provide the `base_url` and `token` parameters as environment variables: `os.environ['DATAROBOT_ENDPOINT']` and `os.environ['DATAROBOT_API_TOKEN']`

```
from dmm.data_source import DataRobotSource

source = DataRobotSource(
    deployment_id=deployment_id,
    start=start_of_export_window,
    end=end_of_export_window,
)
```

The following example initializes `DataRobotSource` with all parameters:

```
from dmm.data_source import DataRobotSource

source = DataRobotSource(
    base_url=DATAROBOT_ENDPOINT,
    token=DATAROBOT_API_TOKEN,
    client=None,
    deployment_id=deployment_id,
    model_id=model_id,
    start=start_of_export_window,
    end=end_of_export_window,
    max_rows=10000,
    delete_exports=False,
    use_cache=False,
    actuals_with_matched_predictions=True,
)
```

| Parameter | Description |
| --- | --- |
| base_url: str | The DataRobot API URL; for example, https://app.datarobot.com/api/v2. |
| token: str | A DataRobot API token from API keys and tools. |
| client: Optional[DataRobotClient] | Use the DataRobotClient object instead of base_url and token. |
| deployment_id: str | The ID of the deployment evaluated by the custom metric. |
| model_id: Optional[str] | The ID of the model evaluated by the custom metric. If you don't specify a model ID, the champion model ID is used. |
| start: datetime | The start of the export window. Define the date you want to start retrieving data from. |
| end: datetime | The end of the export window. Define the date you want to retrieve data until. |
| max_rows: Optional[int] | The maximum number of rows to fetch at once when the requested data doesn't fit into memory. |
| delete_exports: Optional[bool] | Whether to automatically delete datasets with exported data created in the AI Catalog. True configures for deletion; the default value is False. |
| use_cache: Optional[bool] | Whether to use existing datasets stored in the AI Catalog for time ranges included in previous exports. True uses datasets used in previous exports; the default value is False. |
| actuals_with_matched_predictions: Optional[bool] | Whether to allow actuals export without matched predictions. False does not allow unmatched export; the default value is True. |

#### Export prediction data

The `get_prediction_data` method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the `max_rows` parameter. This method returns data until the data source is exhausted.

```
prediction_df_1, prediction_chunk_id_1 = source.get_prediction_data()

print(prediction_df_1.head(5).to_string())
print(f"chunk id: {prediction_chunk_id_1}")

   DR_RESERVED_PREDICTION_TIMESTAMP  DR_RESERVED_PREDICTION_VALUE_high  DR_RESERVED_PREDICTION_VALUE_low date_non_unique date_random  id       年月日
0  2023-09-13 11:02:51.248000+00:00                           0.697782                          0.302218      1950-10-01  1949-01-27   1  1949-01-01
1  2023-09-13 11:02:51.252000+00:00                           0.581351                          0.418649      1959-04-01  1949-02-03   2  1949-02-01
2  2023-09-13 11:02:51.459000+00:00                           0.639347                          0.360653      1954-05-01  1949-03-28   3  1949-03-01
3  2023-09-13 11:02:51.459000+00:00                           0.627727                          0.372273      1951-09-01  1949-04-07   4  1949-04-01
4  2023-09-13 11:02:51.664000+00:00                           0.591612                          0.408388      1951-03-01  1949-05-16   5  1949-05-01
chunk id: 0
```

When the data source is exhausted, `None` and `-1` are returned:

```
prediction_df_2, prediction_chunk_id_2 = source.get_prediction_data()

print(prediction_df_2)
print(prediction_chunk_id_2)

None
chunk id: -1
```

The `reset` method resets the exhausted data source, allowing it to iterate from the beginning:

```
source.reset()
```

The `get_all_prediction_data` method returns all prediction data available for a data source object in a single DataFrame:

```
prediction_df = source.get_all_prediction_data()
```

### Export actuals data

The `get_actuals_data` method returns a chunk of actuals data with the appropriate chunk ID the returned data chunk is a pandas DataFrame with the number of rows respecting the `max_rows` parameter. This method returns data until the data source is exhausted.

```
actuals_df_1, actuals_chunk_id_1 = source.get_actuals_data()

print(actuals_df_1.head(5).to_string())
print(f"chunk id: {actuals_chunk_id_1}")

     association_id                  timestamp label  actuals  predictions predicted_class
0                 1  2023-09-13 11:00:00+00:00   low        0     0.302218            high
194              57  2023-09-13 11:00:00+00:00   low        1     0.568564             low
192              56  2023-09-13 11:00:00+00:00   low        1     0.569865             low
190              55  2023-09-13 11:00:00+00:00   low        0     0.473282            high
196              58  2023-09-13 11:00:00+00:00   low        1     0.573861             low
chunk id: 0
```

To return raw data in the format of data from postgresql, set the `return_original_column_names` parameter to `True`:

```
actuals_df_1, actuals_chunk_id_1 = source.get_actuals_data()

print(actuals_df_1.head(5).to_string())
print(f"chunk id: {actuals_chunk_id_1}")

     id                  timestamp label  actuals         y predicted_class
0     1  2023-09-13 11:00:00+00:00   low        0  0.302218            high
194  57  2023-09-13 11:00:00+00:00   low        1  0.568564             low
192  56  2023-09-13 11:00:00+00:00   low        1  0.569865             low
190  55  2023-09-13 11:00:00+00:00   low        0  0.473282            high
196  58  2023-09-13 11:00:00+00:00   low        1  0.573861             low
chunk id: 0
```

To return all actuals data available for a source object in a single DataFrame, use the `get_all_actuals_data` method:

```
actuals_df = source.get_all_actuals_data()
```

When the data source is exhausted, `None` and `-1` are returned:

```
actuals_df_2, actuals_chunk_id_2 = source.get_actuals_data()

print(actuals_df_2)
print(actuals_chunk_id_2)

None
chunk id: -1
```

The `reset` method resets the exhausted data source, allowing it to iterate from the beginning:

```
source.reset()
```

### Export training data

The `get_training_data` method returns all data used for training in one call. The returned data is a pandas DataFrame:

```
train_df = source.get_training_data()
print(train_df.head(5).to_string())

      y date_random date_non_unique       年月日
0  high  1949-01-27      1950-10-01  1949-01-01
1  high  1949-02-03      1959-04-01  1949-02-01
2   low  1949-03-28      1954-05-01  1949-03-01
3  high  1949-04-07      1951-09-01  1949-04-01
4  high  1949-05-16      1951-03-01  1949-05-01
```

### Export combined data

The `get_data` method returns `combined_data`, which includes merged scoring data, predictions, and matched actuals. This [Metric Evaluator](https://docs.datarobot.com/en/docs/api/code-first-tools/dr-model-metrics/dmm-metric-evaluator.html) uses this method as the main data export method.

```
df, chunk_id_1 = source.get_data()
print(df.head(5).to_string())
print(f"chunk id: {chunk_id_1}")

                          timestamp  predictions date_non_unique date_random  association_id       年月日 predicted_class label  actuals
0  2023-09-13 11:02:51.248000+00:00     0.302218      1950-10-01  1949-01-27               1  1949-01-01            high   low        0
1  2023-09-13 11:02:51.252000+00:00     0.418649      1959-04-01  1949-02-03               2  1949-02-01            high   low        0
2  2023-09-13 11:02:51.459000+00:00     0.360653      1954-05-01  1949-03-28               3  1949-03-01            high   low        1
3  2023-09-13 11:02:51.459000+00:00     0.372273      1951-09-01  1949-04-07               4  1949-04-01            high   low        0
4  2023-09-13 11:02:51.664000+00:00     0.408388      1951-03-01  1949-05-16               5  1949-05-01            high   low        0
chunk id: 0
```

The `get_all_data` returns all combined data available for that source object in a single DataFrame:

```
df = source.get_all_data()
```

## Configure a DataRobot batch deployment source

The `BatchDataRobotSource` interface is for batch deployments. The following example initializes `BatchDataRobotSource` with all parameters:

```
from dmm.data_source import BatchDataRobotSource

source = BatchDataRobotSource(
    base_url=DATAROBOT_ENDPOINT,
    token=DATAROBOT_API_TOKEN,
    client=None,
    deployment_id=deployment_id,
    model_id=model_id,
    batch_ids=batch_ids,
    max_rows=10000,
    delete_exports=False,
    use_cache=False,
)
```

The parameters for this method are analogous to those for `DataRobotSource`. The most important difference is that instead of the time range (start and end), you must provide batch IDs. In addition, a batch source doesn't support actuals export.

The `get_prediction_data` method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the `max_rows` parameter. This method returns data until the data source is exhausted.

```
prediction_df_1, prediction_chunk_id_1 = source.get_prediction_data()
print(prediction_df_1.head(5).to_string())
print(f"chunk id: {prediction_chunk_id_1}")

    AGE       B  CHAS     CRIM     DIS                  batch_id    DR_RESERVED_BATCH_NAME                         timestamp   INDUS  LSTAT  MEDV    NOX  PTRATIO  RAD     RM  TAX    ZN  id
0  65.2  396.90     0  0.00632  4.0900                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    2.31   4.98  24.0  0.538     15.3    1  6.575  296  18.0   1
1  78.9  396.90     0  0.02731  4.9671                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    7.07   9.14  21.6  0.469     17.8    2  6.421  242   0.0   2
2  61.1  392.83     0  0.02729  4.9671                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    7.07   4.03  34.7  0.469     17.8    2  7.185  242   0.0   3
3  45.8  394.63     0  0.03237  6.0622                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    2.18   2.94  33.4  0.458     18.7    3  6.998  222   0.0   4
4  54.2  396.90     0  0.06905  6.0622                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    2.18   5.33  36.2  0.458     18.7    3  7.147  222   0.0   5
chunk id: 0

prediction_df = source.get_all_prediction_data()

source.reset()

df, chunk_id_1 = source.get_data()
```

The `get_training_data` method returns all data used for training in one call. The returned data is a pandas DataFrame:

```
train_df = source.get_training_data()
```

## Configure a DataFrame source

If you aren't exporting data directly from DataRobot, and instead have it downloaded locally (for example), you can load the dataset into `DataFrameSource`. The `DataFrameSource` method wraps any `pd.DataFrame` to create a library-compatible source. This is the easiest way to interact with the library when bringing your own data:

```
source = DataFrameSource(
    df=pd.read_csv("./data_hour_of_week.csv"),
    max_rows=10000,
    timestamp_col="date"
)

df, chunk_id_1 = source.get_data()
print(df.head(5).to_string())
print(f"chunk id: {chunk_id_1}")

                  date         y
0  1959-12-31 23:59:57 -0.183669
1  1960-01-01 01:00:02  0.283993
2  1960-01-01 01:59:52  0.020663
3  1960-01-01 03:00:14  0.404304
4  1960-01-01 03:59:58  1.005252
chunk id: 0
```

In addition, it is possible to create new data source definitions. To define a new data source, you can customize and implement the `DataSourceBase` interface.

## Set the TimeBucket

The `TimeBucket` class enumeration (enum) defines the required data aggregation granularity over time. By default, `TimeBucket` is set to `TimeBucket.ALL`. You can specify any of the following values: `SECOND`, `MINUTE`, `HOUR`, `DAY`, `WEEK`, `MONTH`, `QUARTER`, or `ALL`. To change the `TimeBucket` value, use the `init` method: `source.init(time_bucket)`:

```
# Generate a dummy DataFrame with 2 rows per time bucket (Hour in this scenario)
test_df = gen_dataframe_for_accuracy_metric(
    nr_rows=10,
    rows_per_time_bucket=2,
    prediction_value=1,
    with_actuals=True,
    with_predictions=True,
    time_bucket=TimeBucket.HOUR,
)
print(test_df)
                    timestamp  predictions  actuals
0  01/06/2005 13:00:00.000000            1    0.999
1  01/06/2005 13:00:00.000000            1    0.999
2  01/06/2005 14:00:00.000000            1    0.999
3  01/06/2005 14:00:00.000000            1    0.999
4  01/06/2005 15:00:00.000000            1    0.999
5  01/06/2005 15:00:00.000000            1    0.999
6  01/06/2005 16:00:00.000000            1    0.999
7  01/06/2005 16:00:00.000000            1    0.999
8  01/06/2005 17:00:00.000000            1    0.999
9  01/06/2005 17:00:00.000000            1    0.999

# Use DataFrameSource and load created DataFrame
source = DataFrameSource(
    df=test_df,
    max_rows=10000,
    timestamp_col="timestamp",
)
# Init source with the selected TimeBucket
source.init(TimeBucket.HOUR)
df, _ = source.get_data()
print(df)
                    timestamp predictions actuals
0  01/06/2005 13:00:00.000000           1   0.999
1  01/06/2005 13:00:00.000000           1   0.999
df, _ = source.get_data()
print(df)
                    timestamp predictions actuals
2  01/06/2005 14:00:00.000000           1   0.999
3  01/06/2005 14:00:00.000000           1   0.999

source.init(TimeBucket.DAY)
df, _ = source.get_data()
print(df)
                    timestamp predictions actuals
0  01/06/2005 13:00:00.000000           1   0.999
1  01/06/2005 13:00:00.000000           1   0.999
2  01/06/2005 14:00:00.000000           1   0.999
3  01/06/2005 14:00:00.000000           1   0.999
4  01/06/2005 15:00:00.000000           1   0.999
5  01/06/2005 15:00:00.000000           1   0.999
6  01/06/2005 16:00:00.000000           1   0.999
7  01/06/2005 16:00:00.000000           1   0.999
8  01/06/2005 17:00:00.000000           1   0.999
9  01/06/2005 17:00:00.000000           1   0.999
```

The returned data chunks follow the selected `TimeBucket`. This is helpful in the `MetricEvaluator`. In addition to `TimeBucket`, the source respects the `max_rows` parameter when generating data chunks; for example, using the same dataset as in the example above (but with `max_rows` set to `3`):

```
source = DataFrameSource(
    df=test_df,
    max_rows=3,
    timestamp_col="timestamp",
)
source.init(TimeBucket.DAY)
df, chunk_id = source.get_data()
print(df)
                    timestamp predictions actuals
0  01/06/2005 13:00:00.000000           1   0.999
1  01/06/2005 13:00:00.000000           1   0.999
2  01/06/2005 14:00:00.000000           1   0.999
```

In `DataRobotSource`, you can specify the `TimeBucket` and `max_rows` parameters for all export types except training data export, which is returned in one chunk.

## Provide additional DataRobot deployment properties

The `Deployment` class is a helper class, providing access to relevant deployment properties. This class is used inside `DataRobotSource` to select the appropriate workflow to work with data.

```
import datarobot as dr
from dmm.data_source.datarobot.deployment import Deployment
DataRobotClient()
deployment = Deployment(deployment_id=deployment_id)

deployment_type = deployment.type()
target_column = deployment.target_column()
positive_class_label = deployment.positive_class_label()
negative_class_label = deployment.negative_class_label()
prediction_threshold = deployment.prediction_threshold()
.
.
.
```
