The most commonly used data source is DataRobotSource. This data source connects to DataRobot to fetch selected prediction data from the DataRobot platform. Three additional default data sources are available: DataRobotSource, BatchDataRobotSource, and DataFrameSource.
DataRobotSource connects to DataRobot to fetch selected prediction data from the DataRobot platform. Initialize DataRobotSource with the following mandatory parameters:
You can also provide the base_url and token parameters as environment variables: os.environ['DATAROBOT_ENDPOINT'] and os.environ['DATAROBOT_API_TOKEN']
Use the DataRobotClient object instead of base_url and token.
deployment_id: str
The ID of the deployment evaluated by the custom metric.
model_id: Optional[str]
The ID of the model evaluated by the custom metric. If you don't specify a model ID, the champion model ID is used.
start: datetime
The start of the export window. Define the date you want to start retrieving data from.
end: datetime
The end of the export window. Define the date you want to retrieve data until.
max_rows: Optional[int]
The maximum number of rows to fetch at once when the requested data doesn't fit into memory.
delete_exports: Optional[bool]
Whether to automatically delete datasets with exported data created in the AI Catalog. True configures for deletion; the default value is False.
use_cache: Optional[bool]
Whether to use existing datasets stored in the AI Catalog for time ranges included in previous exports. True uses datasets used in previous exports; the default value is False.
actuals_with_matched_predictions: Optional[bool]
Whether to allow actuals export without matched predictions. False does not allow unmatched export; the default value is True.
The get_prediction_data method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows parameter. This method returns data until the data source is exhausted.
The get_actuals_data method returns a chunk of actuals data with the appropriate chunk ID the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows parameter. This method returns data until the data source is exhausted.
The get_data method returns combined_data, which includes merged scoring data, predictions, and matched actuals. This Metric Evaluator uses this method as the main data export method.
The parameters for this method are analogous to those for DataRobotSource. The most important difference is that instead of the time range (start and end), you must provide batch IDs. In addition, a batch source doesn't support actuals export.
The get_prediction_data method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows parameter. This method returns data until the data source is exhausted.
If you aren't exporting data directly from DataRobot, and instead have it downloaded locally (for example), you can load the dataset into DataFrameSource. The DataFrameSource method wraps any pd.DataFrame to create a library-compatible source. This is the easiest way to interact with the library when bringing your own data:
In addition, it is possible to create new data source definitions. To define a new data source, you can customize and implement the DataSourceBase interface.
The TimeBucket class enumeration (enum) defines the required data aggregation granularity over time. By default, TimeBucket is set to TimeBucket.ALL. You can specify any of the following values: SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, or ALL. To change the TimeBucket value, use the init method: source.init(time_bucket):
# Generate a dummy DataFrame with 2 rows per time bucket (Hour in this scenario)test_df=gen_dataframe_for_accuracy_metric(nr_rows=10,rows_per_time_bucket=2,prediction_value=1,with_actuals=True,with_predictions=True,time_bucket=TimeBucket.HOUR,)print(test_df)timestamppredictionsactuals001/06/200513:00:00.00000010.999101/06/200513:00:00.00000010.999201/06/200514:00:00.00000010.999301/06/200514:00:00.00000010.999401/06/200515:00:00.00000010.999501/06/200515:00:00.00000010.999601/06/200516:00:00.00000010.999701/06/200516:00:00.00000010.999801/06/200517:00:00.00000010.999901/06/200517:00:00.00000010.999# Use DataFrameSource and load created DataFramesource=DataFrameSource(df=test_df,max_rows=10000,timestamp_col="timestamp",)# Init source with the selected TimeBucketsource.init(TimeBucket.HOUR)df,_=source.get_data()print(df)timestamppredictionsactuals001/06/200513:00:00.00000010.999101/06/200513:00:00.00000010.999df,_=source.get_data()print(df)timestamppredictionsactuals201/06/200514:00:00.00000010.999301/06/200514:00:00.00000010.999source.init(TimeBucket.DAY)df,_=source.get_data()print(df)timestamppredictionsactuals001/06/200513:00:00.00000010.999101/06/200513:00:00.00000010.999201/06/200514:00:00.00000010.999301/06/200514:00:00.00000010.999401/06/200515:00:00.00000010.999501/06/200515:00:00.00000010.999601/06/200516:00:00.00000010.999701/06/200516:00:00.00000010.999801/06/200517:00:00.00000010.999901/06/200517:00:00.00000010.999
The returned data chunks follow the selected TimeBucket. This is helpful in the MetricEvaluator. In addition to TimeBucket, the source respects the max_rows parameter when generating data chunks; for example, using the same dataset as in the example above (but with max_rows set to 3):
In DataRobotSource, you can specify the TimeBucket and max_rows parameters for all export types except training data export, which is returned in one chunk.
Provide additional DataRobot deployment properties¶
The Deployment class is a helper class, providing access to relevant deployment properties. This class is used inside DataRobotSource to select the appropriate workflow to work with data.