Configure data sources¶
The most commonly used data source is DataRobotSource
. This data source connects to DataRobot to fetch selected prediction data from the DataRobot platform. Three additional default data sources are available: DataRobotSource
, BatchDataRobotSource
, and DataFrameSource
.
Configure a DataRobot source¶
DataRobotSource
connects to DataRobot to fetch selected prediction data from the DataRobot platform. Initialize DataRobotSource
with the following mandatory parameters:
from dmm.data_source import DataRobotSource
source = DataRobotSource(
base_url=DATAROBOT_ENDPOINT,
token=DATAROBOT_API_TOKEN,
deployment_id=deployment_id,
start=start_of_export_window,
end=end_of_export_window,
)
You can also provide the base_url
and token
parameters as environment variables: os.environ['DATAROBOT_ENDPOINT']
and os.environ['DATAROBOT_API_TOKEN']
from dmm.data_source import DataRobotSource
source = DataRobotSource(
deployment_id=deployment_id,
start=start_of_export_window,
end=end_of_export_window,
)
The following example initializes DataRobotSource
with all parameters:
from dmm.data_source import DataRobotSource
source = DataRobotSource(
base_url=DATAROBOT_ENDPOINT,
token=DATAROBOT_API_TOKEN,
client=None,
deployment_id=deployment_id,
model_id=model_id,
start=start_of_export_window,
end=end_of_export_window,
max_rows=10000,
delete_exports=False,
use_cache=False,
actuals_with_matched_predictions=True,
)
パラメーター | 説明 |
---|---|
base_url: str |
The DataRobot API URL; for example, https://app.datarobot.com/api/v2 . |
token: str |
A DataRobot API token from Developer Tools. |
client: Optional[dr.Client] |
Use the dr.Client object instead of base_url and token . |
deployment_id: str |
The ID of the deployment evaluated by the custom metric. |
model_id: Optional[str] |
The ID of the model evaluated by the custom metric. If you don't specify a model ID, the champion model ID is used. |
start: datetime |
The start of the export window. Define the date you want to start retrieving data from. |
end: datetime |
The end of the export window. Define the date you want to retrieve data until. |
max_rows: Optional[int] |
The maximum number of rows to fetch at once when the requested data doesn't fit into memory. |
delete_exports: Optional[bool] |
Whether to automatically delete datasets with exported data created in the AI Catalog. True configures for deletion; the default value is False . |
use_cache: Optional[bool] |
Whether to use existing datasets stored in the AI Catalog for time ranges included in previous exports. True uses datasets used in previous exports; the default value is False . |
actuals_with_matched_predictions: Optional[bool] |
Whether to allow actuals export without matched predictions. False does not allow unmatched export; the default value is True . |
Export prediction data¶
The get_prediction_data
method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows
parameter. This method returns data until the data source is exhausted.
prediction_df_1, prediction_chunk_id_1 = source.get_prediction_data()
print(prediction_df_1.head(5).to_string())
print(f"chunk id: {prediction_chunk_id_1}")
DR_RESERVED_PREDICTION_TIMESTAMP DR_RESERVED_PREDICTION_VALUE_high DR_RESERVED_PREDICTION_VALUE_low date_non_unique date_random id 年月日
0 2023-09-13 11:02:51.248000+00:00 0.697782 0.302218 1950-10-01 1949-01-27 1 1949-01-01
1 2023-09-13 11:02:51.252000+00:00 0.581351 0.418649 1959-04-01 1949-02-03 2 1949-02-01
2 2023-09-13 11:02:51.459000+00:00 0.639347 0.360653 1954-05-01 1949-03-28 3 1949-03-01
3 2023-09-13 11:02:51.459000+00:00 0.627727 0.372273 1951-09-01 1949-04-07 4 1949-04-01
4 2023-09-13 11:02:51.664000+00:00 0.591612 0.408388 1951-03-01 1949-05-16 5 1949-05-01
chunk id: 0
When the data source is exhausted, None
and -1
are returned:
prediction_df_2, prediction_chunk_id_2 = source.get_prediction_data()
print(prediction_df_2)
print(prediction_chunk_id_2)
None
chunk id: -1
The reset
method resets the exhausted data source, allowing it to iterate from the beginning:
source.reset()
The get_all_prediction_data
method returns all prediction data available for a data source object in a single DataFrame:
prediction_df = source.get_all_prediction_data()
Export actuals data¶
The get_actuals_data
method returns a chunk of actuals data with the appropriate chunk ID the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows
parameter. This method returns data until the data source is exhausted.
actuals_df_1, actuals_chunk_id_1 = source.get_actuals_data()
print(actuals_df_1.head(5).to_string())
print(f"chunk id: {actuals_chunk_id_1}")
association_id timestamp label actuals predictions predicted_class
0 1 2023-09-13 11:00:00+00:00 low 0 0.302218 high
194 57 2023-09-13 11:00:00+00:00 low 1 0.568564 low
192 56 2023-09-13 11:00:00+00:00 low 1 0.569865 low
190 55 2023-09-13 11:00:00+00:00 low 0 0.473282 high
196 58 2023-09-13 11:00:00+00:00 low 1 0.573861 low
chunk id: 0
To return raw data in the format of data from postgresql, set the return_original_column_names
parameter to True
:
actuals_df_1, actuals_chunk_id_1 = source.get_actuals_data()
print(actuals_df_1.head(5).to_string())
print(f"chunk id: {actuals_chunk_id_1}")
id timestamp label actuals y predicted_class
0 1 2023-09-13 11:00:00+00:00 low 0 0.302218 high
194 57 2023-09-13 11:00:00+00:00 low 1 0.568564 low
192 56 2023-09-13 11:00:00+00:00 low 1 0.569865 low
190 55 2023-09-13 11:00:00+00:00 low 0 0.473282 high
196 58 2023-09-13 11:00:00+00:00 low 1 0.573861 low
chunk id: 0
To return all actuals data available for a source object in a single DataFrame, use the get_all_actuals_data
method:
actuals_df = source.get_all_actuals_data()
When the data source is exhausted, None
and -1
are returned:
actuals_df_2, actuals_chunk_id_2 = source.get_actuals_data()
print(actuals_df_2)
print(actuals_chunk_id_2)
None
chunk id: -1
The reset
method resets the exhausted data source, allowing it to iterate from the beginning:
source.reset()
Export training data¶
The get_training_data
method returns all data used for training in one call. The returned data is a pandas DataFrame:
train_df = source.get_training_data()
print(train_df.head(5).to_string())
y date_random date_non_unique 年月日
0 high 1949-01-27 1950-10-01 1949-01-01
1 high 1949-02-03 1959-04-01 1949-02-01
2 low 1949-03-28 1954-05-01 1949-03-01
3 high 1949-04-07 1951-09-01 1949-04-01
4 high 1949-05-16 1951-03-01 1949-05-01
Export combined data¶
The get_data
method returns combined_data
, which includes merged scoring data, predictions, and matched actuals. This Metric Evaluator uses this method as the main data export method.
df, chunk_id_1 = source.get_data()
print(df.head(5).to_string())
print(f"chunk id: {chunk_id_1}")
timestamp predictions date_non_unique date_random association_id 年月日 predicted_class label actuals
0 2023-09-13 11:02:51.248000+00:00 0.302218 1950-10-01 1949-01-27 1 1949-01-01 high low 0
1 2023-09-13 11:02:51.252000+00:00 0.418649 1959-04-01 1949-02-03 2 1949-02-01 high low 0
2 2023-09-13 11:02:51.459000+00:00 0.360653 1954-05-01 1949-03-28 3 1949-03-01 high low 1
3 2023-09-13 11:02:51.459000+00:00 0.372273 1951-09-01 1949-04-07 4 1949-04-01 high low 0
4 2023-09-13 11:02:51.664000+00:00 0.408388 1951-03-01 1949-05-16 5 1949-05-01 high low 0
chunk id: 0
The get_all_data
returns all combined data available for that source object in a single DataFrame:
df = source.get_all_data()
Configure a DataRobot batch deployment source¶
The BatchDataRobotSource
interface is for batch deployments. The following example initializes BatchDataRobotSource
with all parameters:
from dmm.data_source import BatchDataRobotSource
source = BatchDataRobotSource(
base_url=DATAROBOT_ENDPOINT,
token=DATAROBOT_API_TOKEN,
client=None,
deployment_id=deployment_id,
model_id=model_id,
batch_ids=batch_ids,
max_rows=10000,
delete_exports=False,
use_cache=False,
)
The parameters for this method are analogous to those for DataRobotSource
. The most important difference is that instead of the time range (start and end), you must provide batch IDs. In addition, a batch source doesn't support actuals export.
The get_prediction_data
method returns a chunk of prediction data with the appropriate chunk ID; the returned data chunk is a pandas DataFrame with the number of rows respecting the max_rows
parameter. This method returns data until the data source is exhausted.
prediction_df_1, prediction_chunk_id_1 = source.get_prediction_data()
print(prediction_df_1.head(5).to_string())
print(f"chunk id: {prediction_chunk_id_1}")
AGE B CHAS CRIM DIS batch_id DR_RESERVED_BATCH_NAME timestamp INDUS LSTAT MEDV NOX PTRATIO RAD RM TAX ZN id
0 65.2 396.90 0 0.00632 4.0900 <batch_id> batch1 2023-06-23 09:47:47.060000+00:00 2.31 4.98 24.0 0.538 15.3 1 6.575 296 18.0 1
1 78.9 396.90 0 0.02731 4.9671 <batch_id> batch1 2023-06-23 09:47:47.060000+00:00 7.07 9.14 21.6 0.469 17.8 2 6.421 242 0.0 2
2 61.1 392.83 0 0.02729 4.9671 <batch_id> batch1 2023-06-23 09:47:47.060000+00:00 7.07 4.03 34.7 0.469 17.8 2 7.185 242 0.0 3
3 45.8 394.63 0 0.03237 6.0622 <batch_id> batch1 2023-06-23 09:47:47.060000+00:00 2.18 2.94 33.4 0.458 18.7 3 6.998 222 0.0 4
4 54.2 396.90 0 0.06905 6.0622 <batch_id> batch1 2023-06-23 09:47:47.060000+00:00 2.18 5.33 36.2 0.458 18.7 3 7.147 222 0.0 5
chunk id: 0
prediction_df = source.get_all_prediction_data()
source.reset()
df, chunk_id_1 = source.get_data()
The get_training_data
method returns all data used for training in one call. The returned data is a pandas DataFrame:
train_df = source.get_training_data()
Configure a DataFrame source¶
If you aren't exporting data directly from DataRobot, and instead have it downloaded locally (for example), you can load the dataset into DataFrameSource
. The DataFrameSource
method wraps any pd.DataFrame
to create a library-compatible source. This is the easiest way to interact with the library when bringing your own data:
source = DataFrameSource(
df=pd.read_csv("./data_hour_of_week.csv"),
max_rows=10000,
timestamp_col="date"
)
df, chunk_id_1 = source.get_data()
print(df.head(5).to_string())
print(f"chunk id: {chunk_id_1}")
date y
0 1959-12-31 23:59:57 -0.183669
1 1960-01-01 01:00:02 0.283993
2 1960-01-01 01:59:52 0.020663
3 1960-01-01 03:00:14 0.404304
4 1960-01-01 03:59:58 1.005252
chunk id: 0
In addition, it is possible to create new data source definitions. To define a new data source, you can customize and implement the DataSourceBase
interface.
Set the TimeBucket¶
The TimeBucket
class enumeration (enum) defines the required data aggregation granularity over time. By default, TimeBucket
is set to TimeBucket.ALL
. You can specify any of the following values: SECOND
, MINUTE
, HOUR
, DAY
, WEEK
, MONTH
, QUARTER
, or ALL
. To change the TimeBucket
value, use the init
method: source.init(time_bucket)
:
# Generate a dummy DataFrame with 2 rows per time bucket (Hour in this scenario)
test_df = gen_dataframe_for_accuracy_metric(
nr_rows=10,
rows_per_time_bucket=2,
prediction_value=1,
with_actuals=True,
with_predictions=True,
time_bucket=TimeBucket.HOUR,
)
print(test_df)
timestamp predictions actuals
0 01/06/2005 13:00:00.000000 1 0.999
1 01/06/2005 13:00:00.000000 1 0.999
2 01/06/2005 14:00:00.000000 1 0.999
3 01/06/2005 14:00:00.000000 1 0.999
4 01/06/2005 15:00:00.000000 1 0.999
5 01/06/2005 15:00:00.000000 1 0.999
6 01/06/2005 16:00:00.000000 1 0.999
7 01/06/2005 16:00:00.000000 1 0.999
8 01/06/2005 17:00:00.000000 1 0.999
9 01/06/2005 17:00:00.000000 1 0.999
# Use DataFrameSource and load created DataFrame
source = DataFrameSource(
df=test_df,
max_rows=10000,
timestamp_col="timestamp",
)
# Init source with the selected TimeBucket
source.init(TimeBucket.HOUR)
df, _ = source.get_data()
print(df)
timestamp predictions actuals
0 01/06/2005 13:00:00.000000 1 0.999
1 01/06/2005 13:00:00.000000 1 0.999
df, _ = source.get_data()
print(df)
timestamp predictions actuals
2 01/06/2005 14:00:00.000000 1 0.999
3 01/06/2005 14:00:00.000000 1 0.999
source.init(TimeBucket.DAY)
df, _ = source.get_data()
print(df)
timestamp predictions actuals
0 01/06/2005 13:00:00.000000 1 0.999
1 01/06/2005 13:00:00.000000 1 0.999
2 01/06/2005 14:00:00.000000 1 0.999
3 01/06/2005 14:00:00.000000 1 0.999
4 01/06/2005 15:00:00.000000 1 0.999
5 01/06/2005 15:00:00.000000 1 0.999
6 01/06/2005 16:00:00.000000 1 0.999
7 01/06/2005 16:00:00.000000 1 0.999
8 01/06/2005 17:00:00.000000 1 0.999
9 01/06/2005 17:00:00.000000 1 0.999
The returned data chunks follow the selected TimeBucket
. This is helpful in the MetricEvaluator
. In addition to TimeBucket
, the source respects the max_rows
parameter when generating data chunks; for example, using the same dataset as in the example above (but with max_rows
set to 3
):
source = DataFrameSource(
df=test_df,
max_rows=3,
timestamp_col="timestamp",
)
source.init(TimeBucket.DAY)
df, chunk_id = source.get_data()
print(df)
timestamp predictions actuals
0 01/06/2005 13:00:00.000000 1 0.999
1 01/06/2005 13:00:00.000000 1 0.999
2 01/06/2005 14:00:00.000000 1 0.999
In DataRobotSource
, you can specify the TimeBucket
and max_rows
parameters for all export types except training data export, which is returned in one chunk.
Provide additional DataRobot deployment properties¶
The Deployment
class is a helper class, providing access to relevant deployment properties. This class is used inside DataRobotSource
to select the appropriate workflow to work with data.
import datarobot as dr
from dmm.data_source.datarobot.deployment import Deployment
dr.Client()
deployment = Deployment(deployment_id=deployment_id)
deployment_type = deployment.type()
target_column = deployment.target_column()
positive_class_label = deployment.positive_class_label()
negative_class_label = deployment.negative_class_label()
prediction_threshold = deployment.prediction_threshold()
.
.
.