Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Portable batch predictions

Portable batch predictions (PBP) provide the ability to score large amounts of data on disconnected environments using the Portable Prediction Server (PPS).

Obtain the PBP Docker image

Portable batch predictions use the same docker image as PPS, but runs it in a different mode. Reference the documentation for retrieving a Portable Prediction Server docker image before proceeding.

Scoring methods

Portable batch predictions can use the following adapters to score datasets:

  • Filesystem
  • JDBC
  • AWS S3
  • Azure Blob
  • GCS
  • Snowflake
  • Synapse

To run portable batch predictions, you need the following artifacts:

Job definitions

You can define jobs using a JSON config file in which you describe prediction_endpoint, intake_settings, output_settings, timeseries_settings (optional) for time series scoring, and jdbc_settings (optional) for JDBC scoring.

prediction_endpoint describes how to access PPS and consists from: schema://hostname:port where:

  • schema is http|https
  • hostname is the hostname of the instance where your PPS is running
  • port is the port of the prediction API running inside PPS

jdbc_setting has the following attributes:

  • url - The URL to connect via the JDBC interface
  • class_name - The class name which is used as an entry point for JDBC communication
  • driver_path - The path to the JDBC driver on your filesystem (available inside the PBP container)
  • template_name - The name of template in case of write-back. To obtain the names of the support templates, please contact your DataRobot representative.

All other parameters are the same as regular Batch Predictions.

The following outlines a JDBC example which scores to and from Snowflake using single-mode PPS running locally and can be defined as a job_definition_jdbc.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "intake_settings": {
        "type": "jdbc",
        "table": "SCORING_DATA",
        "schema": "PUBLIC"
    },
    "output_settings": {
        "type": "jdbc",
        "table": "SCORED_DATA",
        "statement_type": "create_table",
        "schema": "PUBLIC"
    },
    "passthrough_columns_set": "all",
    "include_probabilities": true,
    "jdbc_settings": {
        "url": "jdbc:snowflake://my_account.snowflakecomputing.com/?warehouse=WH&db=DB&schema=PUBLIC",
        "class_name": "net.snowflake.client.jdbc.SnowflakeDriver",
        "driver_path": "/tmp/portable_batch_predictions/jdbc/snowflake-jdbc-3.12.0.jar",
        "template_name": "Snowflake"
    }
}

Credentials environment variables

If you are using JDBC or private containers in cloud storage, you can specify the required credentials as environment variables. The following table shows which variables names are used:

Name Type Description
AWS_ACCESS_KEY_ID string AWS Access key ID
AWS_SECRET_ACCESS_KEY string AWS Secret access key
AWS_SESSION_TOKEN string AWS token
GOOGLE_STORAGE_KEYFILE_PATH string Path to GCP credentials file
AZURE_CONNECTION_STRING string Azure connection string
JDBC_USERNAME string Username for JDBC
JDBC_PASSWORD string Password for JDBC
SNOWFLAKE_USERNAME string Username for Snowflake
SNOWFLAKE_PASSWORD string Password for Snowflake
SYNAPSE_USERNAME string Username for Azure Synapse
SYNAPSE_PASSWORD string Password for Azure Synapse

Here's example of the credentials.env file used for JDBC scoring:

export JDBC_USERNAME=TEST_USER
export JDBC_PASSWORD=SECRET

Run portable batch predictions

Portable batch predictions run inside a Docker container. You need to mount job definitions, files, and datasets (if you are going to score from a host filesystem and set a path inside the container) onto Docker. Using a JDBC job definition and credentials from previous examples, the following outlines a complete example of how to start portable batch predictions job to score to and from Snowflake.

$ docker run --rm
    -v /host/filesystem/path/job_defintiion_jdbc.json:/docker/container/filesystem/path/job_definition_jdbc.json
    --network host
    --env-file /host/filesystem/path/credentials.env
    datarobot-portable-predictions-api batch /docker/container/filesystem/path/job_definition_jdbc.json

Here is another example of how to run a complete end-to-end flow including PPS and a write-back job status into the DataRobot platform for monitoring progress.

#!/bin/bash

# This snippet starts both the PPS service and PBP job using the same PPS docker image
# available from Developer Tools.

#################
# Configuration #
#################

# Specify PATH to directory whith mlpkg(s) which you can download from deployment
MLPKG_DIR='/path/to/your/mlpkgs'
# Specify job definition PATH
JOB_DEFINITION_PATH='/path/to/your/job_definition.json'
# Specify PATH to file with credentials if needed (for cloud storage adapters or JDBC)
CREDENTIALS_PATH='/path/to/credentials.env'
# For DataRobot integration, specify API host and Token
API_HOST='https://app.datarobot.com'
API_TOKEN='XXXXXXXX'

# Run PPS service in background
PPS_CONTAINER_ID=$(docker run --rm -d -p 127.0.0.1:8080:8080 -v $MLPKG_DIR:/opt/ml/model datarobot/datarobot-portable-prediction-api:<version>)
# Wait some time before PPS started-up
sleep 15
# Run PPS in batch mode to start PBP job
docker run --rm -v $JOB_DEFINITION_PATH:/tmp/job_definition.json \
           --network host \
           --env-file $CREDENTIALS_PATH datarobot/datarobot-portable-prediction-api:<version> batch /tmp/job_definition.json \
           --api_host $API_HOST --api_token $API_TOKEN
# Stop PPS service
docker stop $PPS_CONTAINER_ID

More examples

In all of the following examples, assume that PPS is running locally on port 8080, and the filesystem structure has the following format:

/path/to/portable_batch_predictions/
|-- job_definition.json
|-- credentials.env
|-- datasets
|   `-- intake_dataset.csv
|-- output
|-- jdbc
|   `-- snowflake-jdbc-3.12.0.jar

Filesystem scoring with single model mode PPS

job_definition.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "intake_settings": {
        "type": "filesystem",
        "path": "/tmp/portable_batch_predictions/datasets/intake_dataset.csv",
    },
    "output_settings": {
        "type": "filesystem",
        "path": "/tmp/portable_batch_predictions/output/results.csv"
    }
}
   #!/bin/bash

    docker run --rm
               --network host
               -v /path/to/portable_batch_predictions:/tmp/portable_batch_predictions
               datarobot/datarobot-portable-prediction-api:<version> batch /tmp/portable_batch_predictions/job_definition.json

Filesystem scoring with multi model mode PPS

job_definition.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "deployment_id": "lending_club",
    "intake_settings": {
        "type": "filesystem",
        "path": "/tmp/portable_batch_predictions/datasets/intake_dataset.csv",
    },
    "output_settings": {
        "type": "filesystem",
        "path": "/tmp/portable_batch_predictions/output/results.csv"
    }
}
   #!/bin/bash

    docker run --rm
               --network host
               -v /path/to/portable_batch_predictions:/tmp/portable_batch_predictions
               datarobot/datarobot-portable-prediction-api:<version> batch /tmp/portable_batch_predictions/job_definition.json

Filesystem scoring with multi model mode PPS and integration with DR job status tracking

job_definition.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "deployment_id": "lending_club",
    "intake_settings": {
        "type": "filesystem",
        "path": "/tmp/portable_batch_predictions/datasets/intake_dataset.csv",
    },
    "output_settings": {
        "type": "filesystem",
        "path": "/tmp/portable_batch_predictions/output/results.csv"
    }
}

For the PPS MLPKG, specify in config.yaml the deployment ID for the deployment which you are running the portable batch prediction job.

   #!/bin/bash

    docker run --rm
               --network host
               -v /path/to/portable_batch_predictions:/tmp/portable_batch_predictions
               datarobot/datarobot-portable-prediction-api:<version> batch /tmp/portable_batch_predictions/job_definition.json
               --api_host https://app.datarobot.com --api_token XXXXXXXXXXXXXXXXXXX

JDBC scoring with single model mode PPS

job_definition.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "deployment_id": "lending_club",
    "intake_settings": {
        "type": "jdbc",
        "table": "INTAKE_TABLE",
    },
    "output_settings": {
        "type": "jdbc",
        "table": "OUTPUT_TABLE",
        "statement_type": "create_table",
    },
    "passthrough_columns_set": "all",
    "include_probabilities": true,
    "jdbc_settings": {
        "url": "jdbc:snowflake://your_account.snowflakecomputing.com/?warehouse=SOME_WH&db=MY_DB&schema=MY_SCHEMA",
        "class_name": "net.snowflake.client.jdbc.SnowflakeDriver",
        "driver_path": "/tmp/portable_batch_predictions/jdbc/snowflake-jdbc-3.12.0.jar",
        "template_name": "Snowflake"
    }
}

credentials.env file:

JDBC_USERNAME=TEST
JDBC_PASSWORD=SECRET
   #!/bin/bash

    docker run --rm
               --network host
               -v /path/to/portable_batch_predictions:/tmp/portable_batch_predictions
               --env-file ~/DevApps/datarobot/demo/portable_batch_predictions/credentials.env
               datarobot/datarobot-portable-prediction-api:<version> batch /tmp/portable_batch_predictions/job_definition.json

S3 scoring with single model mode PPS

job_definition.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "intake_settings": {
        "type": "s3",
        "url": "s3://intake/dataset.csv",
        "format": "csv"
    },
    "output_settings": {
        "type": "s3",
        "url": "s3://output/result.csv",
        "format": "csv"
    }
}

credentials.env file:

AWS_ACCESS_KEY_ID=XXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=XXXXXXXXXXX
   #!/bin/bash

    docker run --rm
               --network host
               -v /path/to/portable_batch_predictions:/tmp/portable_batch_predictions
               --env-file ~/DevApps/datarobot/demo/portable_batch_predictions/credentials.env
               datarobot/datarobot-portable-prediction-api:<version> batch /tmp/portable_batch_predictions/job_definition.json

Snowflake scoring with multi model mode PPS

job_definition.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "deployment_id": "lending_club",
    "intake_settings": {
        "type": "snowflake",
        "table": "INTAKE_TABLE",
        "schema": "MY_SCHEMA",
        "external_stage": "MY_S3_STAGE_IN_SNOWFLAKE"
    },
    "output_settings": {
        "type": "snowflake",
        "table": "OUTPUT_TABLE",
        "schema": "MY_SCHEMA",
        "external_stage": "MY_S3_STAGE_IN_SNOWFLAKE",
        "statement_type": "insert"
    },
    "passthrough_columns_set": "all",
    "include_probabilities": true,
    "jdbc_settings": {
        "url": "jdbc:snowflake://your_account.snowflakecomputing.com/?warehouse=SOME_WH&db=MY_DB&schema=MY_SCHEMA"
        "class_name": "net.snowflake.client.jdbc.SnowflakeDriver",
        "driver_path": "/Users/andriy.popovych/DevApps/datarobot/demo/portable_batch_predictions/jdbc/snowflake-jdbc-3.12.0.jar",
        "template_name": "Snowflake"
    }
}

credentials.env file:

# Snowflake creds for JDBC connectivity
SNOWFLAKE_USERNAME=TEST
SNOWFLAKE_PASSWORD=SECRET
# AWS creds needed to access external stage
AWS_ACCESS_KEY_ID=XXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=XXXXXXXXXXX
   #!/bin/bash

    docker run --rm
               --network host
               -v /path/to/portable_batch_predictions:/tmp/portable_batch_predictions
               --env-file ~/DevApps/datarobot/demo/portable_batch_predictions/credentials.env
               datarobot/datarobot-portable-prediction-api:<version> batch /tmp/portable_batch_predictions/job_definition.json

Time series scoring over Azure Blob with multi model mode PPS

job_definition.json file:

{
    "prediction_endpoint": "http://127.0.0.1:8080",
    "deployment_id": "euro_date_ts_mlpkg",
    "intake_settings": {
        "type": "azure",
        "url": "https://batchpredictionsdev.blob.core.windows.net/datasets/euro_date.csv",
        "format": "csv"
    },
    "output_settings": {
        "type": "azure",
        "url": "https://batchpredictionsdev.blob.core.windows.net/results/output_ts.csv",
        "format": "csv"
    },
    "timeseries_settings":{
        "type": "forecast",
        "forecast_point": "2007-11-14",
        "relax_known_in_advance_features_check": true
    }
}

credentials.env file:

# Azure Blob connection string
AZURE_CONNECTION_STRING=='DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=XXX;EndpointSuffix=core.windows.net
   #!/bin/bash

    docker run --rm
               --network host
               -v /path/to/portable_batch_predictions:/tmp/portable_batch_predictions
               --env-file ~/DevApps/datarobot/demo/portable_batch_predictions/credentials.env
               datarobot/datarobot-portable-prediction-api:<version> batch /tmp/portable_batch_predictions/job_definition.json

Updated December 2, 2021
Back to top