Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Configure data sources

The most commonly used data source is DataRobotSource. This data source connects to DataRobot to fetch selected prediction data from the DataRobot platform. Three additional default data sources are available: DataRobotSource, BatchDataRobotSource, and DataFrameSource.

Configure a DataRobot source

DataRobotSource connects to DataRobot to fetch selected prediction data from the DataRobot platform. Initialize DataRobotSource with the following mandatory parameters:

from dmm.data_source import DataRobotSource

source = DataRobotSource(
    base_url=DATAROBOT_ENDPOINT,
    token=DATAROBOT_API_TOKEN,
    deployment_id=deployment_id,
    start=start_of_export_window,
    end=end_of_export_window,
)

You can also provide the base_url and token parameters as environment variables: os.environ['DATAROBOT_ENDPOINT'] and os.environ['DATAROBOT_API_TOKEN']

from dmm.data_source import DataRobotSource

source = DataRobotSource(
    deployment_id=deployment_id,
    start=start_of_export_window,
    end=end_of_export_window,
)

The following example initializes DataRobotSource with all parameters:

from dmm.data_source import DataRobotSource

source = DataRobotSource(
    base_url=DATAROBOT_ENDPOINT,
    token=DATAROBOT_API_TOKEN,
    client=None,
    deployment_id=deployment_id,
    model_id=model_id,
    start=start_of_export_window,
    end=end_of_export_window,
    max_rows=10000,
    delete_exports=False,
    use_cache=False,
    actuals_with_matched_predictions=True,
)
Parameter Description
base_url: str The DataRobot API URL; for example, https://app.datarobot.com/api/v2.
token: str A DataRobot API token from Developer Tools.
client: Optional[dr.Client] Use the dr.Client object instead of base_url and token.
deployment_id: str The ID of the deployment evaluated by the custom metric.
model_id: Optional[str] The ID of the model evaluated by the custom metric. If you don't specify a model ID, the champion model ID is used.
start: datetime The start of the export window. Define the date you want to start retrieving data from.
end: datetime The end of the export window. Define the date you want to retrieve data until.
max_rows: Optional[int] The maximum number of rows to fetch at once when the requested data doesn't fit into memory.
delete_exports: Optional[bool] Whether to automatically delete datasets with exported data created in the AI Catalog. True configures for deletion; the default value is False.
use_cache: Optional[bool] Whether to use existing datasets stored in the AI Catalog for time ranges included in previous exports. True uses datasets used in previous exports; the default value is False.
actuals_with_matched_predictions: Optional[bool] Whether to allow actuals export without matched predictions. False does not allow unmatched export; the default value is True.

Export prediction data

The get_prediction_data method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows parameter. This method returns data until the data source is exhausted.

prediction_df_1, prediction_chunk_id_1 = source.get_prediction_data()

print(prediction_df_1.head(5).to_string())
print(f"chunk id: {prediction_chunk_id_1}")

   DR_RESERVED_PREDICTION_TIMESTAMP  DR_RESERVED_PREDICTION_VALUE_high  DR_RESERVED_PREDICTION_VALUE_low date_non_unique date_random  id       年月日
0  2023-09-13 11:02:51.248000+00:00                           0.697782                          0.302218      1950-10-01  1949-01-27   1  1949-01-01
1  2023-09-13 11:02:51.252000+00:00                           0.581351                          0.418649      1959-04-01  1949-02-03   2  1949-02-01
2  2023-09-13 11:02:51.459000+00:00                           0.639347                          0.360653      1954-05-01  1949-03-28   3  1949-03-01
3  2023-09-13 11:02:51.459000+00:00                           0.627727                          0.372273      1951-09-01  1949-04-07   4  1949-04-01
4  2023-09-13 11:02:51.664000+00:00                           0.591612                          0.408388      1951-03-01  1949-05-16   5  1949-05-01
chunk id: 0

When the data source is exhausted, None and -1 are returned:

prediction_df_2, prediction_chunk_id_2 = source.get_prediction_data()

print(prediction_df_2)
print(prediction_chunk_id_2)

None
chunk id: -1

The reset method resets the exhausted data source, allowing it to iterate from the beginning:

source.reset()

The get_all_prediction_data method returns all prediction data available for a data source object in a single DataFrame:

prediction_df = source.get_all_prediction_data()

Export actuals data

The get_actuals_data method returns a chunk of actuals data with the appropriate chunk ID the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows parameter. This method returns data until the data source is exhausted.

actuals_df_1, actuals_chunk_id_1 = source.get_actuals_data()

print(actuals_df_1.head(5).to_string())
print(f"chunk id: {actuals_chunk_id_1}")

     association_id                  timestamp label  actuals  predictions predicted_class
0                 1  2023-09-13 11:00:00+00:00   low        0     0.302218            high
194              57  2023-09-13 11:00:00+00:00   low        1     0.568564             low
192              56  2023-09-13 11:00:00+00:00   low        1     0.569865             low
190              55  2023-09-13 11:00:00+00:00   low        0     0.473282            high
196              58  2023-09-13 11:00:00+00:00   low        1     0.573861             low
chunk id: 0

To return raw data in the format of data from postgresql, set the return_original_column_names parameter to True:

actuals_df_1, actuals_chunk_id_1 = source.get_actuals_data()

print(actuals_df_1.head(5).to_string())
print(f"chunk id: {actuals_chunk_id_1}")

     id                  timestamp label  actuals         y predicted_class
0     1  2023-09-13 11:00:00+00:00   low        0  0.302218            high
194  57  2023-09-13 11:00:00+00:00   low        1  0.568564             low
192  56  2023-09-13 11:00:00+00:00   low        1  0.569865             low
190  55  2023-09-13 11:00:00+00:00   low        0  0.473282            high
196  58  2023-09-13 11:00:00+00:00   low        1  0.573861             low
chunk id: 0

To return all actuals data available for a source object in a single DataFrame, use the get_all_actuals_data method:

actuals_df = source.get_all_actuals_data()

When the data source is exhausted, None and -1 are returned:

actuals_df_2, actuals_chunk_id_2 = source.get_actuals_data()

print(actuals_df_2)
print(actuals_chunk_id_2)

None
chunk id: -1

The reset method resets the exhausted data source, allowing it to iterate from the beginning:

source.reset()

Export training data

The get_training_data method returns all data used for training in one call. The returned data is a pandas DataFrame:

train_df = source.get_training_data()
print(train_df.head(5).to_string())

      y date_random date_non_unique       年月日
0  high  1949-01-27      1950-10-01  1949-01-01
1  high  1949-02-03      1959-04-01  1949-02-01
2   low  1949-03-28      1954-05-01  1949-03-01
3  high  1949-04-07      1951-09-01  1949-04-01
4  high  1949-05-16      1951-03-01  1949-05-01

Export combined data

The get_data method returns combined_data, which includes merged scoring data, predictions, and matched actuals. This Metric Evaluator uses this method as the main data export method.

df, chunk_id_1 = source.get_data()
print(df.head(5).to_string())
print(f"chunk id: {chunk_id_1}")

                          timestamp  predictions date_non_unique date_random  association_id       年月日 predicted_class label  actuals
0  2023-09-13 11:02:51.248000+00:00     0.302218      1950-10-01  1949-01-27               1  1949-01-01            high   low        0
1  2023-09-13 11:02:51.252000+00:00     0.418649      1959-04-01  1949-02-03               2  1949-02-01            high   low        0
2  2023-09-13 11:02:51.459000+00:00     0.360653      1954-05-01  1949-03-28               3  1949-03-01            high   low        1
3  2023-09-13 11:02:51.459000+00:00     0.372273      1951-09-01  1949-04-07               4  1949-04-01            high   low        0
4  2023-09-13 11:02:51.664000+00:00     0.408388      1951-03-01  1949-05-16               5  1949-05-01            high   low        0
chunk id: 0

The get_all_data returns all combined data available for that source object in a single DataFrame:

df = source.get_all_data()

Configure a DataRobot batch deployment source

The BatchDataRobotSource interface is for batch deployments. The following example initializes BatchDataRobotSource with all parameters:

from dmm.data_source import BatchDataRobotSource

source = BatchDataRobotSource(
    base_url=DATAROBOT_ENDPOINT,
    token=DATAROBOT_API_TOKEN,
    client=None,
    deployment_id=deployment_id,
    model_id=model_id,
    batch_ids=batch_ids,
    max_rows=10000,
    delete_exports=False,
    use_cache=False,
)

The parameters for this method are analogous to those for DataRobotSource. The most important difference is that instead of the time range (start and end), you must provide batch IDs. In addition, a batch source doesn't support actuals export.

The get_prediction_data method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows parameter. This method returns data until the data source is exhausted.

prediction_df_1, prediction_chunk_id_1 = source.get_prediction_data()
print(prediction_df_1.head(5).to_string())
print(f"chunk id: {prediction_chunk_id_1}")

    AGE       B  CHAS     CRIM     DIS                  batch_id    DR_RESERVED_BATCH_NAME                         timestamp   INDUS  LSTAT  MEDV    NOX  PTRATIO  RAD     RM  TAX    ZN  id
0  65.2  396.90     0  0.00632  4.0900                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    2.31   4.98  24.0  0.538     15.3    1  6.575  296  18.0   1
1  78.9  396.90     0  0.02731  4.9671                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    7.07   9.14  21.6  0.469     17.8    2  6.421  242   0.0   2
2  61.1  392.83     0  0.02729  4.9671                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    7.07   4.03  34.7  0.469     17.8    2  7.185  242   0.0   3
3  45.8  394.63     0  0.03237  6.0622                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    2.18   2.94  33.4  0.458     18.7    3  6.998  222   0.0   4
4  54.2  396.90     0  0.06905  6.0622                <batch_id>                    batch1  2023-06-23 09:47:47.060000+00:00    2.18   5.33  36.2  0.458     18.7    3  7.147  222   0.0   5
chunk id: 0

prediction_df = source.get_all_prediction_data()

source.reset()

df, chunk_id_1 = source.get_data()

The get_training_data method returns all data used for training in one call. The returned data is a pandas DataFrame:

train_df = source.get_training_data()

Configure a DataFrame source

If you aren't exporting data directly from DataRobot, and instead have it downloaded locally (for example), you can load the dataset into DataFrameSource. The DataFrameSource method wraps any pd.DataFrame to create a library-compatible source. This is the easiest way to interact with the library when bringing your own data:

source = DataFrameSource(
    df=pd.read_csv("./data_hour_of_week.csv"),
    max_rows=10000,
    timestamp_col="date"
)

df, chunk_id_1 = source.get_data()
print(df.head(5).to_string())
print(f"chunk id: {chunk_id_1}")

                  date         y
0  1959-12-31 23:59:57 -0.183669
1  1960-01-01 01:00:02  0.283993
2  1960-01-01 01:59:52  0.020663
3  1960-01-01 03:00:14  0.404304
4  1960-01-01 03:59:58  1.005252
chunk id: 0

In addition, it is possible to create new data source definitions. To define a new data source, you can customize and implement the DataSourceBase interface.

Set the TimeBucket

The TimeBucket class enumeration (enum) defines the required data aggregation granularity over time. By default, TimeBucket is set to TimeBucket.ALL. You can specify any of the following values: SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, or ALL. To change the TimeBucket value, use the init method: source.init(time_bucket):

# Generate a dummy DataFrame with 2 rows per time bucket (Hour in this scenario)
test_df = gen_dataframe_for_accuracy_metric(
    nr_rows=10,
    rows_per_time_bucket=2,
    prediction_value=1,
    with_actuals=True,
    with_predictions=True,
    time_bucket=TimeBucket.HOUR,
)
print(test_df)
                    timestamp  predictions  actuals
0  01/06/2005 13:00:00.000000            1    0.999
1  01/06/2005 13:00:00.000000            1    0.999
2  01/06/2005 14:00:00.000000            1    0.999
3  01/06/2005 14:00:00.000000            1    0.999
4  01/06/2005 15:00:00.000000            1    0.999
5  01/06/2005 15:00:00.000000            1    0.999
6  01/06/2005 16:00:00.000000            1    0.999
7  01/06/2005 16:00:00.000000            1    0.999
8  01/06/2005 17:00:00.000000            1    0.999
9  01/06/2005 17:00:00.000000            1    0.999

# Use DataFrameSource and load created DataFrame
source = DataFrameSource(
    df=test_df,
    max_rows=10000,
    timestamp_col="timestamp",
)
# Init source with the selected TimeBucket
source.init(TimeBucket.HOUR)
df, _ = source.get_data()
print(df)
                    timestamp predictions actuals
0  01/06/2005 13:00:00.000000           1   0.999
1  01/06/2005 13:00:00.000000           1   0.999
df, _ = source.get_data()
print(df)
                    timestamp predictions actuals
2  01/06/2005 14:00:00.000000           1   0.999
3  01/06/2005 14:00:00.000000           1   0.999

source.init(TimeBucket.DAY)
df, _ = source.get_data()
print(df)
                    timestamp predictions actuals
0  01/06/2005 13:00:00.000000           1   0.999
1  01/06/2005 13:00:00.000000           1   0.999
2  01/06/2005 14:00:00.000000           1   0.999
3  01/06/2005 14:00:00.000000           1   0.999
4  01/06/2005 15:00:00.000000           1   0.999
5  01/06/2005 15:00:00.000000           1   0.999
6  01/06/2005 16:00:00.000000           1   0.999
7  01/06/2005 16:00:00.000000           1   0.999
8  01/06/2005 17:00:00.000000           1   0.999
9  01/06/2005 17:00:00.000000           1   0.999

The returned data chunks follow the selected TimeBucket. This is helpful in the MetricEvaluator. In addition to TimeBucket, the source respects the max_rows parameter when generating data chunks; for example, using the same dataset as in the example above (but with max_rows set to 3):

source = DataFrameSource(
    df=test_df,
    max_rows=3,
    timestamp_col="timestamp",
)
source.init(TimeBucket.DAY)
df, chunk_id = source.get_data()
print(df)
                    timestamp predictions actuals
0  01/06/2005 13:00:00.000000           1   0.999
1  01/06/2005 13:00:00.000000           1   0.999
2  01/06/2005 14:00:00.000000           1   0.999

In DataRobotSource, you can specify the TimeBucket and max_rows parameters for all export types except training data export, which is returned in one chunk.

Provide additional DataRobot deployment properties

The Deployment class is a helper class, providing access to relevant deployment properties. This class is used inside DataRobotSource to select the appropriate workflow to work with data.

import datarobot as dr
from dmm.data_source.datarobot.deployment import Deployment
dr.Client()
deployment = Deployment(deployment_id=deployment_id)

deployment_type = deployment.type()
target_column = deployment.target_column()
positive_class_label = deployment.positive_class_label()
negative_class_label = deployment.negative_class_label()
prediction_threshold = deployment.prediction_threshold()
.
.
.

Updated July 24, 2024