Using the batch prediction API¶
This notebook demonstrates how to use DataRobot's batch prediction API to score large datasets with a deployed DataRobot model.
The batch prediction API provides flexible intake and output options when scoring large datasets using prediction servers. The API is exposed through the DataRobot Public API and can be consumed using a REST-enabled client or Public API bindings for DataRobot's Python client.
Some features of the batch prediction API include:
- Intake and output configuration.
- Support for streaming local files.
- The ability to initiate scoring while still uploading data and simultaneously downloading the results.
- Scoring large datasets from and to Amazon S3, Azure Blob, and Google Cloud Storage.
- Connecting to external data sources using JDBC with bidirectional streaming of scoring data and results.
- A mix of intake and output options; for example, the ability to score from a local file and return results to an S3 target.
- Protection against prediction server overload with concurrency and size control level options.
- Prediction Explanations (with an option to add thresholds).
- Support for passthrough columns to correlate scored data with source data.
- Prediction warnings in the output.
Requirements¶
- Python version 3.7.3
- DataRobot API version 2.26.0
- A deployed DataRobot model object
Small adjustments may be required depending on the versions of Python and the DataRobot API you are using.
You can also access full documentation of the Python package.
Connect to DataRobot¶
To inititate scoring jobs through the batch prediction API, you need to connect to DataRobot through the datarobot.Client
command. DataRobot recommends providing a configuration file containing your credentials (endpoint and API Key) to connect to DataRobot. For more information about authentication, reference the API Quickstart guide.
import datarobot as dr
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
Set the deployment ID¶
Before proceeding, provide the deployed model's deployment ID (retrieved from the deployment's Overview tab).
deployment_id = "YOUR_DEPLOYMENT_ID"
Determine input and output options¶
DataRobot's batch prediction API allows you to score data from and to multiple sources. You can take advantage of the credentials and data sources you have already established previously through the UI for easy scoring. Credentials are usernames and passwords, while data sources are any databases with which you have previously established a connection (e.g., Snowflake). View the example code below outlining how to query credentials and data sources.
You can reference the full list of DataRobot's supported input and output options.
The snippet below shows how you can query all credentials tied to a DataRobot account.
dr.Credential.list()
[Credential('5e6696ff820e737a5bd78430', 'adam', 'basic'), Credential('5ed55704397e667bb0caf1c8', 'DATAROBOT', 'basic'), Credential('5ed557e8ae4c4f7ccd1f0fda', 'ta_admin', 'basic'), Credential('5ed55e08397e667c2bcaf137', 'SourceCredentials_PredicitonJob_5ed55e07397e667c2bcaf134', 'basic'), Credential('5ed55e08397e667c2bcaf139', 'TargetCredentials_PredicitonJob_5ed55e07397e667c2bcaf134', 'basic'), Credential('5ed6ba3c397e6611f9caf27d', 'SourceCredentials_PredicitonJob_5ed6ba3c397e6611f9caf27a', 'basic'), Credential('5ed6ba3d397e6611f9caf27f', 'TargetCredentials_PredicitonJob_5ed6ba3c397e6611f9caf27a', 'basic')]
The output above returns multiple sets of credentials. The alphanumeric string included in each item of the list is the credentials ID. You can use that ID to access credentials through the API.
The snippet below shows how you can query all data sources tied to a DataRobot account. The second line lists each datastore with an alphanumeric string; that is the datastore ID.
dr.DataStore.list()
print(dr.DataStore.list()[0].id)
5e6696ff820e737a5bd78430
Batch prediction scoring examples¶
The snippets below demonstrate how to score data with the Batch Prediction API. Edit the intake_settings
and output_settings
to suit your needs. You can mix and match until you get the outcome you prefer.
Score from CSV to CSV¶
# Scoring without Prediction Explanations
if False:
dr.BatchPredictionJob.score(
deployment_id,
intake_settings={
"type": "localFile",
"file": "inputfile.csv", # Provide the filepath, Pandas dataframe, or file-like object here
},
output_settings={"type": "localFile", "path": "outputfile.csv"},
)
# Scoring with Prediction Explanations
if False:
dr.BatchPredictionJob.score(
deployment_id,
intake_settings={
"type": "localFile",
"file": "inputfile.csv", # Provide the filepath, Pandas dataframe, or file-like object here
},
output_settings={"type": "localFile", "path": "outputfile.csv"},
max_explanations=3, # Compute Prediction Explanations for the amount of features indicated here
)
Score from S3 to S3¶
if False:
dr.BatchPredictionJob.score(
deployment_id,
intake_settings={
"type": "s3",
"url": "s3://theos-test-bucket/lending_club_scoring.csv", # Provide the URL of your datastore here
"credential_id": "YOUR_CREDENTIAL_ID_FROM_ABOVE", # Provide your credentials here
},
output_settings={
"type": "s3",
"url": "s3://theos-test-bucket/lending_club_scored2.csv",
"credential_id": "YOUR_CREDENTIAL_ID_FROM_ABOVE",
},
)
Score from JDBC to JDBC¶
if False:
dr.BatchPredictionJob.score(
deployment_id,
intake_settings={
"type": "jdbc",
"table": "table_name",
"schema": "public",
"dataStoreId": data_store.id, # Provide the ID of your datastore here
"credentialId": cred.credential_id, # Provide your credentials here
},
output_settings={
"type": "jdbc",
"table": "table_name",
"schema": "public",
"statementType": "insert",
"dataStoreId": data_store.id,
"credentialId": cred.credential_id,
},
)