Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Batch Prediction API

The Batch Prediction API provides flexible options for intake and output when scoring large datasets using the prediction servers you have already deployed. The API is exposed through the DataRobot Public API. To access the API documentation, sign in to DataRobot, click the question mark on the upper right, and select API Documentation. The API can be consumed using either any REST-enabled client or the DataRobot Python Public API bindings.

The main features of the API are:

  • Flexible options for intake and output:
    • Stream local files and start scoring while still uploading—while simultaneously downloading the results.
    • Score large data sets from and to S3.
    • Read datasets from the AI Catalog.
    • Connect to external data sources using JDBC with bidirectional streaming of scoring data and results.
    • Mix intake and output options, for example scoring from a local file to an S3 target.
  • Protection against prediction server overload with a concurrency control level option.
  • Inclusion of Prediction Explanations (with an option to add thresholds).
  • Support for passthrough columns to correlate scored data with source data.
  • Addition of prediction warnings in the output.
  • The ability to make predictions with files greater than 1GB via the API.

Limits

Item Managed AI Cloud On-premise or private/hybrid cloud
Job runtime limit 4 hours Unlimited
Local file intake size 10 GB Unlimited
Local file write size Unlimited Unlimited
S3 intake size Unlimited Unlimited
S3 write size 100 GB 100 GB (configurable)
Azure intake size 4.75 TB 4.75 TB
Azure write size 195 GB 195 GB
GCP intake size 5 TB 5TB
GCP write size 5 TB 5TB
JDBC intake size Unlimited Unlimited
JDBC output size Unlimited Unlimited
Concurrent jobs 1 per prediction instance 1 per installation
Stored data retention time
For local file adapters
48 hours 48 hours (configurable)

Concurrent jobs

To ensure that the prediction server does not get overloaded, DataRobot will only run one job per prediction instance. Further jobs are queued and started as soon as previous jobs complete.

Data pipeline

A Batch Prediction job is a data pipeline consisting of:

Data Intake > Concurrent Scoring > Data Output

On creation, the job's intakeSettings and outputSettings define the data intake and data output part of the pipeline. You can configure any combination of intake and output options. For both, the defaults are local file intake and output, meaning you will have to issue a separate PUT request with the data to score and subsequently download the scored data.

Data sources supported for batch predictions

The following table shows the data source support for batch predictions.

Name Driver version Intake support Output support DataRobot version validated
AWS Athena 2.0 2.0.5 yes no 7.3
Google BigQuery 1.2.4 yes yes 7.3
InterSystems 3.2.0 yes no 7.3
kdb+ - yes yes 7.3
Microsoft SQL Server 7.4.1 yes yes 6.0
MySQL 5.1.44 yes yes 6.0
Oracle 11.2.0 yes yes 7.3
PostgreSQL 42.2.20 yes yes 6.0
Presto 0.263.1 yes no 7.3
Redshift 1.2.10.1009 yes yes 6.0
SAP HANA 2.4.70 yes no 7.3
Snowflake 3.12.0 yes yes 6.2
Synapse 8.4.1 yes yes 7.3
Teradata 17.10.00.23 yes yes 7.3
TreasureData 0.5.10 yes yes 7.3

For further information, see:

Concurrent scoring

When scoring, the data you supply is split into chunks and scored concurrently on the prediction instance specified by the deployment. To control the level of concurrency, modify the numConcurrent parameter at job creation.

Job states

When working with batch predictions, each prediction job can be in one of four states:

  • INITIALIZING: The job has been successfully created and is either:
    • Waiting for CSV data to be pushed (if local file intake).
    • Waiting for a processing slot on the prediction server.
  • RUNNING: Scoring the dataset on prediction servers has started.
  • ABORTED: The job was aborted because either:
    • It had an invalid configuration.
    • DataRobot encountered 20% or 100MB of invalid scoring data that resulted in a prediction error.
  • COMPLETED: The dataset has been scored and:
    • You can now download the scored data (if local file output).
    • Otherwise the data has been written to the destination.

Store credentials securely

Some sources or targets for scoring may require DataRobot to authenticate on your behalf (for example, if your database requires that you pass a username and password for login). To ensure proper storage of these credentials, you must have data credentials enabled.

DataRobot uses the following credential types and properties:

Adapter Credential Type Property
S3 intake / output s3 awsAccessKeyId
awsSecretAccessKey
awsSessionToken (optional)
JDBC intake / output basic username
password

To use a stored credential, you must pass the associated credentialId in either intakeSettings or outputSettings as described below for each of the adapters.

CSV format

For any intake or output options that deal with reading or writing CSV files, you can use a custom format by specifying the following in csvSettings:

Parameter Example Description
delimiter , Optional. The delimiter character to use. Default: , (comma). To specify TAB as a delimiter, use the string tab.
quotechar " Optional. The character to use for quoting fields containing the delimiter. Default: ".
encoding utf-8 Optional. Encoding for the CSV file. For example (but not limited to): shift_jis, latin_1 or mskanji. Default: utf-8.
Any Python supported encoding can be used.

The same format will be used for both intake and output. See a complete example.

Model monitoring

The Batch Prediction API integrates well with DataRobot's model monitoring capabilities:

  • If you have enabled data drift tracking for your deployment, any predictions run through the Batch Prediction API will be tracked.
  • If you have enabled target drift tracking for you deployment, the output will contain the desired Association ID to be used for reporting actuals.

Should you need to run a non-production dataset against your deployment, you can turn off drift tracking for a single job by providing the following parameter:

Parameter Example Description
skipDriftTracking true Optional. Skip data and target drift tracking for this job. Default: false.

Override the default prediction instance

Under normal circumstances, the prediction server used for scoring will be the default prediction server that your model was deployed to. It is however possible to override it If you have access to multiple prediction servers, you can override the default behavior by using the following properties in the predictionInstance option:

Parameter Example Description
hostName 192.0.2.4 Sets the hostname to use instead of the default hostname from the prediction server the model was deployed to.
sslEnabled false Optional. Use SSL (HTTPS) to access the prediction server. Default: true.
apiKey NWU...IBn2w Optional. Use an API key different from the job creator's key to authenticate against the new prediction server.
datarobotKey 154a8abb-cbde-4e73-ab3b-a46c389c337b Optional. If running in a Managed AI Cloud environment, specify the per-organization DataRobot key for the prediction server.

Find the key on the Deployments> Predictions > Prediction API tab or by contacting your DataRobot representative.

Here's a complete example:

job_details = {
    'deploymentId': deployment_id,
    'intakeSettings': {'type': 'localFile'},
    'outputSettings': {'type': 'localFile'},
    'predictionInstance': {
        'hostName': '192.0.2.4',
        'sslEnabled': False,
        'apiKey': 'NWUQ9w21UhGgerBtOC4ahN0aqjbjZ0NMhL1e5cSt4ZHIBn2w',
        'datarobotKey': '154a8abb-cbde-4e73-ab3b-a46c389c337b',
    },
}

Consistent scoring with updated model

If you deploy a new model after a job has been queued, DataRobot will still use the model that was deployed at the time of job creation for the entire job. Every row will be scored with the same model.

API Reference

The Public API

The Batch Prediction API is part of the DataRobot Public API which you can access in DataRobot by clicking the question mark on the upper right, and selecting API Documentation.

The Python API Client

You can use the Python Public API Client to interface with the Batch Prediction API.


Updated November 10, 2021
Back to top