Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Using the batch prediction API

This notebook demonstrates how to use DataRobot's batch prediction API to score large datasets with a deployed DataRobot model.

The batch prediction API provides flexible intake and output options when scoring large datasets using prediction servers. The API is exposed through the DataRobot Public API and can be consumed using a REST-enabled client or Public API bindings for DataRobot's Python client.

Some features of the batch prediction API include:

  • Intake and output configuration.
  • Support for streaming local files.
  • The ability to initiate scoring while still uploading data and simultaneously downloading the results.
  • Scoring large datasets from and to Amazon S3, Azure Blob, and Google Cloud Storage.
  • Connecting to external data sources using JDBC with bidirectional streaming of scoring data and results.
  • A mix of intake and output options; for example, the ability to score from a local file and return results to an S3 target.
  • Protection against prediction server overload with concurrency and size control level options.
  • Prediction Explanations (with an option to add thresholds).
  • Support for passthrough columns to correlate scored data with source data.
  • Prediction warnings in the output.

Requirements

Small adjustments may be required depending on the versions of Python and the DataRobot API you are using.

You can also access full documentation of the Python package.

Connect to DataRobot

To inititate scoring jobs through the batch prediction API, you need to connect to DataRobot through the datarobot.Client command. DataRobot recommends providing a configuration file containing your credentials (endpoint and API Key) to connect to DataRobot. For more information about authentication, reference the API Quickstart guide.

import datarobot as dr

dr.Client(config_path='/path/to/drconfig.yaml')

Set the deployment ID

Before proceeding, provide the deployed model's deployment ID (retrieved from the deployment's Overview tab).

deployment_id = "YOUR_DEPLOYMENT_ID"

Determine input and output options

DataRobot's batch prediction API allows you to score data from and to multiple sources. You can take advantage of the credentials and data sources you have already established previously through the UI for easy scoring. Credentials are usernames and passwords, while data sources are any databases with which you have previously established a connection (e.g., Snowflake). View the example code below outlining how to query credentials and data sources.

You can reference the full list of DataRobot's supported input and output options.

The snippet below shows how you can query all credentials tied to a DataRobot account.

dr.Credential.list()
[Credential('5e6696ff820e737a5bd78430', 'adam', 'basic'),
 Credential('5ed55704397e667bb0caf1c8', 'DATAROBOT', 'basic'),
 Credential('5ed557e8ae4c4f7ccd1f0fda', 'ta_admin', 'basic'),
 Credential('5ed55e08397e667c2bcaf137', 'SourceCredentials_PredicitonJob_5ed55e07397e667c2bcaf134', 'basic'),
 Credential('5ed55e08397e667c2bcaf139', 'TargetCredentials_PredicitonJob_5ed55e07397e667c2bcaf134', 'basic'),
 Credential('5ed6ba3c397e6611f9caf27d', 'SourceCredentials_PredicitonJob_5ed6ba3c397e6611f9caf27a', 'basic'),
 Credential('5ed6ba3d397e6611f9caf27f', 'TargetCredentials_PredicitonJob_5ed6ba3c397e6611f9caf27a', 'basic')]

The output above returns multiple sets of credentials. The alphanumeric string included in each item of the list is the credentials ID. You can use that ID to access credentials through the API.

The snippet below shows how you can query all data sources tied to a DataRobot account. The second line lists each datastore with an alphanumeric string; that is the datastore ID.

dr.DataStore.list()
print(dr.DataStore.list()[0].id)
5e6696ff820e737a5bd78430

Batch prediction scoring examples

The snippets below demonstrate how to score data with the Batch Prediction API. Edit the intake_settings and output_settings to suit your needs. You can mix and match until you get the outcome you prefer.

Score from CSV to CSV

# Scoring without Prediction Explanations
dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 'localFile',
        'file': 'inputfile.csv' # Provide the filepath, Pandas dataframe, or file-like object here
    },
    output_settings={
        'type': 'localFile',
        'file': 'outputfile.csv'
    }
)

#Scoring with Prediction Explanations
dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 'localFile',
        'file': 'inputfile.csv' # Provide the filepath, Pandas dataframe, or file-like object here
    },
    output_settings={
        'type': 'localFile',
        'file': 'outputfile.csv'
    },

    max_explanations=3 #Compute Prediction Explanations for the amount of features indicated here

)

Score from S3 to S3

dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 's3',
        'url': 's3://theos-test-bucket/lending_club_scoring.csv', #Provide the URL of your datastore here
        'credential_id': 'YOUR_CREDENTIAL_ID_FROM_ABOVE', # Provide your credentials here
    },
    output_settings={
        'type': 's3',
        'url': 's3://theos-test-bucket/lending_club_scored2.csv',
        'credential_id': 'YOUR_CREDENTIAL_ID_FROM_ABOVE'
    }
)

Score from JDBC to JDBC

dr.BatchPredictionJob.score(
    deployment_id,

    intake_settings = {
    'type': 'jdbc',
    'table': 'table_name',
    'schema': 'public',
    'dataStoreId': data_store.id, #Provide the ID of your datastore here
    'credentialId': cred.credential_id # Provide your credentials here
    },

    output_settings = {
        'type': 'jdbc',
        'table': 'table_name',
        'schema': 'public',
        'statementType': 'insert',
        'dataStoreId': data_store.id,
        'credentialId': cred.credential_id
    }
)

Updated June 9, 2022
Back to top