Portable batch predictions¶
Portable batch predictions (PBP) let you score large amounts of data on disconnected environments.
Before you can use portable batch predictions, you need to configure the Portable Prediction Server (PPS), a DataRobot execution environment for DataRobot model packages (.mlpkg
files) distributed as a self-contained Docker image. Portable batch predictions use the same Docker image as the PPS but run it in a different mode.
Availability information
The Portable Prediction Server is a feature exclusive to DataRobot MLOps. Contact your DataRobot representative for information on enabling it.
Scoring methods¶
Portable batch predictions can use the following adapters to score datasets:
Filesystem
JDBC
AWS S3
Azure Blob
GCS
Snowflake
Synapse
To run portable batch predictions, you need the following artifacts:
After you prepare these artifacts, you can run portable batch predictions. See also additional examples of running portable batch predictions.
Job definitions¶
You can define jobs using a JSON
config file in which you describe prediction_endpoint
, intake_settings
,
output_settings
, timeseries_settings
(optional) for time series scoring, and jdbc_settings
(optional) for JDBC scoring.
Self-Managed AI Platform only: Prediction endpoint SSL configuration
If you need to disable SSL verification for the prediction_endpiont
, you can set ALLOW_SELF_SIGNED_CERTS
to True
. This configuration disables SSL certificate verification for requests made by the application to the web server. This is useful if you have SSL encryption enabled on your cluster and are using certificates that are not signed by a globally trusted Certificate Authority (self-signed).
The prediction_endpoint
describes how to access the PPS and is constructed as <schema>://<hostname>:<port>
, where you define the following attributes:
Attribute | Description |
---|---|
schema |
http or https |
hostname |
The hostname of the instance where your PPS is running |
port |
The port of the prediction API running inside the PPS |
The jdbc_setting
has the following attributes:
Attribute | Description |
---|---|
url |
The URL to connect via the JDBC interface |
class_name |
The class name used as an entry point for JDBC communication |
driver_path |
The path to the JDBC driver on your filesystem (available inside the PBP container) |
template_name |
The name of the template in case of write-back. To obtain the names of the support templates, please contact your DataRobot representative. |
All other parameters are the same as regular Batch Predictions.
The following outlines a JDBC example that scores to and from Snowflake using single-mode PPS running locally and can be defined as a job_definition_jdbc.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"intake_settings": {
"type": "jdbc",
"table": "SCORING_DATA",
"schema": "PUBLIC"
},
"output_settings": {
"type": "jdbc",
"table": "SCORED_DATA",
"statement_type": "create_table",
"schema": "PUBLIC"
},
"passthrough_columns_set": "all",
"include_probabilities": true,
"jdbc_settings": {
"url": "jdbc:snowflake://my_account.snowflakecomputing.com/?warehouse=WH&db=DB&schema=PUBLIC",
"class_name": "net.snowflake.client.jdbc.SnowflakeDriver",
"driver_path": "/tmp/portable_batch_predictions/jdbc/snowflake-jdbc-3.12.0.jar",
"template_name": "Snowflake"
}
}
Credentials environment variables¶
If you are using JDBC or private containers in cloud storage, you can specify the required credentials as environment variables. The following table shows which variables names are used:
Name | Type | Description |
---|---|---|
AWS_ACCESS_KEY_ID |
string | AWS Access key ID |
AWS_SECRET_ACCESS_KEY |
string | AWS Secret access key |
AWS_SESSION_TOKEN |
string | AWS token |
GOOGLE_STORAGE_KEYFILE_PATH |
string | Path to GCP credentials file |
AZURE_CONNECTION_STRING |
string | Azure connection string |
JDBC_USERNAME |
string | Username for JDBC |
JDBC_PASSWORD |
string | Password for JDBC |
SNOWFLAKE_USERNAME |
string | Username for Snowflake |
SNOWFLAKE_PASSWORD |
string | Password for Snowflake |
SYNAPSE_USERNAME |
string | Username for Azure Synapse |
SYNAPSE_PASSWORD |
string | Password for Azure Synapse |
Here's an example of the credentials.env
file used for JDBC scoring:
export JDBC_USERNAME=TEST_USER
export JDBC_PASSWORD=SECRET
Run portable batch predictions¶
Portable batch predictions run inside a Docker container. You need to mount job definitions, files, and datasets (if you are going to score from a host filesystem and set a path inside the container) onto Docker. Using a JDBC job definition and credentials from previous examples, the following outlines a complete example of how to start a portable batch predictions job to score to and from Snowflake.
docker run --rm \
-v /host/filesystem/path/job_definition_jdbc.json:/docker/container/filesystem/path/job_definition_jdbc.json \
--network host \
--env-file /host/filesystem/path/credentials.env \
datarobot-portable-predictions-api batch /docker/container/filesystem/path/job_definition_jdbc.json
Here is another example of how to run a complete end-to-end flow, including PPS and a write-back job status into the DataRobot platform for monitoring progress.
#!/bin/bash
# This snippet starts both the PPS service and PBP job using the same PPS docker image
# available from Developer Tools.
#################
# Configuration #
#################
# Specify path to directory with mlpkg(s) which you can download from deployment
MLPKG_DIR='/host/filesystem/path/mlpkgs'
# Specify job definition path
JOB_DEFINITION_PATH='/host/filesystem/path/job_definition.json'
# Specify path to file with credentials if needed (for cloud storage adapters or JDBC)
CREDENTIALS_PATH='/host/filesystem/path/credentials.env'
# For DataRobot integration, specify API host and Token
API_HOST='https://app.datarobot.com'
API_TOKEN='XXXXXXXX'
# Run PPS service in the background
PPS_CONTAINER_ID=$(docker run --rm -d -p 127.0.0.1:8080:8080 -v $MLPKG_DIR:/opt/ml/model datarobot/datarobot-portable-prediction-api:<version>)
# Wait some time before PPS starts up
sleep 15
# Run PPS in batch mode to start PBP job
docker run --rm -v $JOB_DEFINITION_PATH:/tmp/job_definition.json \
--network host \
--env-file $CREDENTIALS_PATH \
datarobot/datarobot-portable-prediction-api:<version> batch /tmp/job_definition.json
--api_host $API_HOST --api_token $API_TOKEN
# Stop PPS service
docker stop $PPS_CONTAINER_ID
More examples¶
In all of the following examples, assume that PPS is running locally on port 8080
, and the filesystem structure has the following format:
/host/filesystem/path/portable_batch_predictions/
├── job_definition.json
├── credentials.env
├── datasets
| └── intake_dataset.csv
├── output
└── jdbc
└── snowflake-jdbc-3.12.0.jar
Filesystem scoring with single-model mode PPS¶
job_definition.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"intake_settings": {
"type": "filesystem",
"path": "/tmp/portable_batch_predictions/datasets/intake_dataset.csv"
},
"output_settings": {
"type": "filesystem",
"path": "/tmp/portable_batch_predictions/output/results.csv"
}
}
#!/bin/bash
docker run --rm \
--network host \
-v /host/filesystem/path/portable_batch_predictions:/tmp/portable_batch_predictions \
datarobot/datarobot-portable-prediction-api:<version> batch \
/tmp/portable_batch_predictions/job_definition.json
Filesystem scoring with multi-model mode PPS¶
job_definition.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"deployment_id": "lending_club",
"intake_settings": {
"type": "filesystem",
"path": "/tmp/portable_batch_predictions/datasets/intake_dataset.csv"
},
"output_settings": {
"type": "filesystem",
"path": "/tmp/portable_batch_predictions/output/results.csv"
}
}
#!/bin/bash
docker run --rm \
--network host \
-v /host/filesystem/path/portable_batch_predictions:/tmp/portable_batch_predictions \
datarobot/datarobot-portable-prediction-api:<version> batch \
/tmp/portable_batch_predictions/job_definition.json
Filesystem scoring with multi-model mode PPS and integration with DR job status tracking¶
job_definition.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"deployment_id": "lending_club",
"intake_settings": {
"type": "filesystem",
"path": "/tmp/portable_batch_predictions/datasets/intake_dataset.csv"
},
"output_settings": {
"type": "filesystem",
"path": "/tmp/portable_batch_predictions/output/results.csv"
}
}
For the PPS MLPKG, in config.yaml
, specify the deployment ID of the deployment for which you are running the portable batch prediction job.
#!/bin/bash
docker run --rm \
--network host
-v /host/filesystem/path/portable_batch_predictions:/tmp/portable_batch_predictions \
datarobot/datarobot-portable-prediction-api:<version> batch \
/tmp/portable_batch_predictions/job_definition.json \
--api_host https://app.datarobot.com --api_token XXXXXXXXXXXXXXXXXXX
JDBC scoring with single-model mode PPS¶
job_definition.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"deployment_id": "lending_club",
"intake_settings": {
"type": "jdbc",
"table": "INTAKE_TABLE"
},
"output_settings": {
"type": "jdbc",
"table": "OUTPUT_TABLE",
"statement_type": "create_table"
},
"passthrough_columns_set": "all",
"include_probabilities": true,
"jdbc_settings": {
"url": "jdbc:snowflake://your_account.snowflakecomputing.com/?warehouse=SOME_WH&db=MY_DB&schema=MY_SCHEMA",
"class_name": "net.snowflake.client.jdbc.SnowflakeDriver",
"driver_path": "/tmp/portable_batch_predictions/jdbc/snowflake-jdbc-3.12.0.jar",
"template_name": "Snowflake"
}
}
credentials.env
file:
JDBC_USERNAME=TEST
JDBC_PASSWORD=SECRET
#!/bin/bash
docker run --rm \
--network host \
-v /host/filesystem/path/portable_batch_predictions:/tmp/portable_batch_predictions \
--env-file /host/filesystem/path/credentials.env \
datarobot/datarobot-portable-prediction-api:<version> batch \
/tmp/portable_batch_predictions/job_definition.json
S3 scoring with single-model mode PPS¶
job_definition.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"intake_settings": {
"type": "s3",
"url": "s3://intake/dataset.csv",
"format": "csv"
},
"output_settings": {
"type": "s3",
"url": "s3://output/result.csv",
"format": "csv"
}
}
credentials.env
file:
AWS_ACCESS_KEY_ID=XXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=XXXXXXXXXXX
#!/bin/bash
docker run --rm \
--network host \
-v /host/filesystem/path/portable_batch_predictions:/tmp/portable_batch_predictions \
--env-file /path/to/credentials.env \
datarobot/datarobot-portable-prediction-api:<version> batch \
/tmp/portable_batch_predictions/job_definition.json
Snowflake scoring with multi-model mode PPS¶
job_definition.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"deployment_id": "lending_club",
"intake_settings": {
"type": "snowflake",
"table": "INTAKE_TABLE",
"schema": "MY_SCHEMA",
"external_stage": "MY_S3_STAGE_IN_SNOWFLAKE"
},
"output_settings": {
"type": "snowflake",
"table": "OUTPUT_TABLE",
"schema": "MY_SCHEMA",
"external_stage": "MY_S3_STAGE_IN_SNOWFLAKE",
"statement_type": "insert"
},
"passthrough_columns_set": "all",
"include_probabilities": true,
"jdbc_settings": {
"url": "jdbc:snowflake://your_account.snowflakecomputing.com/?warehouse=SOME_WH&db=MY_DB&schema=MY_SCHEMA"
"class_name": "net.snowflake.client.jdbc.SnowflakeDriver",
"driver_path": "/tmp/portable_batch_predictions/jdbc/snowflake-jdbc-3.12.0.jar",
"template_name": "Snowflake"
}
}
credentials.env
file:
# Snowflake creds for JDBC connectivity
SNOWFLAKE_USERNAME=TEST
SNOWFLAKE_PASSWORD=SECRET
# AWS creds needed to access external stage
AWS_ACCESS_KEY_ID=XXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=XXXXXXXXXXX
#!/bin/bash
docker run --rm \
--network host \
-v /host/filesystem/path/portable_batch_predictions:/tmp/portable_batch_predictions \
--env-file /host/filesystem/path/credentials.env \
datarobot/datarobot-portable-prediction-api:<version> batch \
/tmp/portable_batch_predictions/job_definition.json
Time series scoring over Azure Blob with multi-model mode PPS¶
job_definition.json
file:
{
"prediction_endpoint": "http://127.0.0.1:8080",
"deployment_id": "euro_date_ts_mlpkg",
"intake_settings": {
"type": "azure",
"url": "https://batchpredictionsdev.blob.core.windows.net/datasets/euro_date.csv",
"format": "csv"
},
"output_settings": {
"type": "azure",
"url": "https://batchpredictionsdev.blob.core.windows.net/results/output_ts.csv",
"format": "csv"
},
"timeseries_settings":{
"type": "forecast",
"forecast_point": "2007-11-14",
"relax_known_in_advance_features_check": true
}
}
credentials.env
file:
# Azure Blob connection string
AZURE_CONNECTION_STRING='DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=XXX;EndpointSuffix=core.windows.net'
#!/bin/bash
docker run --rm \
--network host \
-v /host/filesystem/path/portable_batch_predictions:/tmp/portable_batch_predictions
--env-file /host/filesystem/path/credentials.env
datarobot/datarobot-portable-prediction-api:<version> batch \
/tmp/portable_batch_predictions/job_definition.json