Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Batch Prediction API

The Batch Prediction API provides flexible options for intake and output when scoring large datasets using the prediction servers you have already deployed. The API is exposed through the DataRobot Public API. The API can be consumed using either any REST-enabled client or the DataRobot Python Public API bindings.

For more information about Batch Prediction REST API routes, view the DataRobot REST API reference documentation.

The main features of the API are:

  • Flexible options for intake and output:
    • Stream local files and start scoring while still uploading—while simultaneously downloading the results.
    • Score large datasets from and to S3.
    • Read datasets from the AI Catalog.
    • Connect to external data sources using JDBC with bidirectional streaming of scoring data and results.
    • Mix intake and output options, for example, scoring from a local file to an S3 target.
  • Protection against prediction server overload with a concurrency control level option.
  • Inclusion of Prediction Explanations (with an option to add thresholds).
  • Support for passthrough columns to correlate scored data with source data.
  • Addition of prediction warnings in the output.
  • The ability to make predictions with files greater than 1GB via the API.

For more information about making batch prediction settings for time series, reference the time series documentation.

Limits

Item AI Platform (SaaS) Self-managed AI Platform (VPC or on-prem)
Job runtime limit 4 hours* Unlimited
Local file intake size Unlimited Unlimited
Local file write size Unlimited Unlimited
S3 intake size Unlimited Unlimited
S3 write size 100GB 100GB (configurable)
Azure intake size 4.75TB 4.75TB
Azure write size 195GB 195GB
GCP intake size 5TB 5TB
GCP write size 5TB 5TB
JDBC intake size Unlimited Unlimited
JDBC output size Unlimited Unlimited
Concurrent jobs 1 per prediction instance 1 per installation
Stored data retention time

For local file adapters
48 hours 48 hours (configurable)

* Feature Discovery projects have a job runtime limit of 6 hours.

Concurrent jobs

To ensure that the prediction server does not get overloaded, DataRobot will only run one job per prediction instance. Further jobs are queued and started as soon as previous jobs complete.

Data pipeline

A Batch Prediction job is a data pipeline consisting of:

Data Intake > Concurrent Scoring > Data Output

On creation, the job's intakeSettings and outputSettings define the data intake and data output part of the pipeline. You can configure any combination of intake and output options. For both, the defaults are local file intake and output, meaning you will have to issue a separate PUT request with the data to score and subsequently download the scored data.

Data sources supported for batch predictions

The following table shows the data source support for batch predictions.

Name Driver version Intake support Output support DataRobot version validated
AWS Athena 2.0 2.0.35 yes no 7.3
Databricks 2.6.40 yes yes 9.2
Exasol 7.0.14 yes yes 8.0
Google BigQuery 1.2.4 yes yes 7.3
InterSystems 3.2.0 yes no 7.3
kdb+ - yes yes 7.3
Microsoft SQL Server 12.2.0 yes yes 6.0
MySQL 8.0.32 yes yes 6.0
Oracle 11.2.0 yes yes 7.3
PostgreSQL 42.5.1 yes yes 6.0
Presto* 0.216 yes yes 8.0
Redshift 2.1.0.14 yes yes 6.0
SAP HANA 2.20.17 yes yes 7.3 (intake support only)
10.1 (intake and output support)
Snowflake 3.15.1 yes yes 6.2
Synapse 12.4.1 yes yes 7.3
Teradata** 17.10.00.23 yes yes 7.3
TreasureData 0.5.10 yes no 7.3

*Presto requires the use of auto commit: true for many of the underlying connectors which can delay writes.

**For output to Teradata, DataRobot only supports ANSI mode.

For further information, see:

Concurrent scoring

When scoring, the data you supply is split into chunks and scored concurrently on the prediction instance specified by the deployment. To control the level of concurrency, modify the numConcurrent parameter at job creation.

Job states

When working with batch predictions, each prediction job can be in one of four states:

  • INITIALIZING: The job has been successfully created and is either:
    • Waiting for CSV data to be pushed (if local file intake).
    • Waiting for a processing slot on the prediction server.
  • RUNNING: Scoring the dataset on prediction servers has started.
  • ABORTED: The job was aborted because either:
    • It had an invalid configuration.
    • DataRobot encountered 20% or 100MB of invalid scoring data that resulted in a prediction error.
  • COMPLETED: The dataset has been scored and:
    • You can now download the scored data (if local file output).
    • Otherwise the data has been written to the destination.

Store credentials securely

Some sources or targets for scoring may require DataRobot to authenticate on your behalf (for example, if your database requires that you pass a username and password for login). To ensure proper storage of these credentials, you must have data credentials enabled.

DataRobot uses the following credential types and properties:

Adapter Credential Type Property
S3 intake / output s3 awsAccessKeyId
awsSecretAccessKey
awsSessionToken (optional)
JDBC intake / output basic username
password

To use a stored credential, you must pass the associated credentialId in either intakeSettings or outputSettings as described below for each of the adapters.

CSV format

For any intake or output options that deal with reading or writing CSV files, you can use a custom format by specifying the following in csvSettings:

Parameter Example Description
delimiter , (Optional) The delimiter character to use. Default: , (comma). To specify TAB as a delimiter, use the string tab.
quotechar " (Optional) The character to use for quoting fields containing the delimiter. Default: ".
encoding utf-8 (Optional) Encoding for the CSV file. For example (but not limited to): shift_jis, latin_1 or mskanji. Default: utf-8.

Any Python supported encoding can be used.

The same format will be used for both intake and output. See a complete example.

Model monitoring

The Batch Prediction API integrates well with DataRobot's model monitoring capabilities:

  • If you have enabled data drift tracking for your deployment, any predictions run through the Batch Prediction API will be tracked.
  • If you have enabled target drift tracking for your deployment, the output will contain the desired association ID to be used for reporting actuals.

Should you need to run a non-production dataset against your deployment, you can turn off drift and accuracy tracking for a single job by providing the following parameter:

Parameter Example Description
skipDriftTracking true (Optional) Skip data drift, target drift, and accuracy tracking for this job. Default: false.

Override the default prediction instance

Under normal circumstances, the prediction server used for scoring will be the default prediction server that your model was deployed to. It is however possible to override it If you have access to multiple prediction servers, you can override the default behavior by using the following properties in the predictionInstance option:

Parameter Example Description
hostName 192.0.2.4 Sets the hostname to use instead of the default hostname from the prediction server the model was deployed to.
sslEnabled false (Optional) Use SSL (HTTPS) to access the prediction server. Default: true.
apiKey NWU...IBn2w (Optional) Use an API key different from the job creator's key to authenticate against the new prediction server.
datarobotKey 154a8abb-cbde-4e73-ab3b-a46c389c337b (Optional) If running in a managed AI Platform environment, specify the per-organization DataRobot key for the prediction server.

Find the key on the Deployments> Predictions > Prediction API tab or by contacting your DataRobot representative.

Here's a complete example:

job_details = {
    'deploymentId': deployment_id,
    'intakeSettings': {'type': 'localFile'},
    'outputSettings': {'type': 'localFile'},
    'predictionInstance': {
        'hostName': '192.0.2.4',
        'sslEnabled': False,
        'apiKey': 'NWUQ9w21UhGgerBtOC4ahN0aqjbjZ0NMhL1e5cSt4ZHIBn2w',
        'datarobotKey': '154a8abb-cbde-4e73-ab3b-a46c389c337b',
    },
}

Consistent scoring with updated model

If you deploy a new model after a job has been queued, DataRobot will still use the model that was deployed at the time of job creation for the entire job. Every row will be scored with the same model.

Template variables

Sometimes it can be useful to specify dynamic parameters in your batch jobs, such as in Job Definitions. You can use jinja's variable syntax (double curly braces) to print the value of the following parameters:

Variable Description
current_run_time datetime object for current UTC time (datetime.utcnow())
current_run_timestamp Milliseconds from Unix epoch (integer)
last_scheduled_run_time datetime object for the start of last job instantiated from the same job definition
next_scheduled_run_time datetime object for the next scheduled start of job from the same job definition
last_completed_run_time datetime object for when the previously scheduled job finished scoring

The above variables can be used in the following fields:

Field Condition
intake_settings.query For JDBC, Synapse, and Snowflake adapters
output_settings.table For JDBC, Synapse, Snowflake, and BigQuery adapters, when statement type is create_table or create_table_if_not_exists is marked true
output_settings.url For S3, GCP, and Azure adapters

You should specify the URL as: gs://bucket/output-<added-string-with-double-curly-braces>.csv.

Note

To ensure that most databases understand the replacements mentioned above, DataRobot strips microseconds off the ISO-8601 format timestamps.

API Reference

The Public API

The Batch Prediction API is part of the DataRobot REST API. Reference this documentation for more information about how to work with batch predictions.

The Python API Client

You can use the Python Public API Client to interface with the Batch Prediction API.


Updated August 15, 2024