DataRobot API resources > API reference documentation > Batch Prediction API

Batch Prediction API¶

The Batch Prediction API provides flexible options for intake and output when scoring large datasets using the prediction servers you have already deployed. The API is exposed through the DataRobot Public API. The API can be consumed using either any REST-enabled client or the DataRobot Python Public API bindings.

For more information about Batch Prediction REST API routes, view the DataRobot REST API reference documentation.

The main features of the API are:

Flexible options for intake and output:
- Stream local files and start scoring while still uploading—while simultaneously downloading the results.
- Score large datasets from and to S3.
- Read datasets from the AI Catalog.
- Connect to external data sources using JDBC with bidirectional streaming of scoring data and results.
- Mix intake and output options, for example, scoring from a local file to an S3 target.
Protection against prediction server overload with a concurrency control level option.
Inclusion of Prediction Explanations (with an option to add thresholds).
Support for passthrough columns to correlate scored data with source data.
Addition of prediction warnings in the output.
The ability to make predictions with files greater than 1GB via the API.

For more information about making batch prediction settings for time series, reference the time series documentation.

Limits¶

Item	AI Platform (SaaS)	Self-managed AI Platform (VPC or on-prem)
Job runtime limit	4 hours*	Unlimited
Local file intake size	Unlimited	Unlimited
Local file write size	Unlimited	Unlimited
S3 intake size	Unlimited	Unlimited
S3 write size	100GB	100GB (configurable)
Azure intake size	4.75TB	4.75TB
Azure write size	195GB	195GB
GCP intake size	5TB	5TB
GCP write size	5TB	5TB
JDBC intake size	Unlimited	Unlimited
JDBC output size	Unlimited	Unlimited
Concurrent jobs	1 per prediction instance	1 per installation
Stored data retention time For local file adapters	48 hours	48 hours (configurable)

* Feature Discovery projects have a job runtime limit of 6 hours.

Concurrent jobs¶

To ensure that the prediction server does not get overloaded, DataRobot will only run one job per prediction instance. Further jobs are queued and started as soon as previous jobs complete.

Data pipeline¶

A Batch Prediction job is a data pipeline consisting of:

Data Intake > Concurrent Scoring > Data Output

On creation, the job's intakeSettings and outputSettings define the data intake and data output part of the pipeline. You can configure any combination of intake and output options. For both, the defaults are local file intake and output, meaning you will have to issue a separate PUT request with the data to score and subsequently download the scored data.

Data sources supported for batch predictions¶

The following table shows the data source support for batch predictions.

Name	Driver version	Intake support	Output support	DataRobot version validated
AWS Athena 2.0	2.0.35	yes	no	7.3
Databricks***	2.6.40	yes	yes	9.2
Exasol	7.0.14	yes	yes	8.0
Google BigQuery	1.2.4	yes	yes	7.3
InterSystems	3.2.0	yes	no	7.3
kdb+	-	yes	yes	7.3
Microsoft SQL Server	12.2.0	yes	yes	6.0
MySQL	8.0.32	yes	yes	6.0
Oracle	11.2.0	yes	yes	7.3
PostgreSQL	42.5.1	yes	yes	6.0
Presto*	0.216	yes	yes	8.0
Redshift	2.1.0.14	yes	yes	6.0
SAP HANA	2.20.17	yes	yes	7.3 (intake support only) 10.1 (intake and output support)
Snowflake	3.15.1	yes	yes	6.2
Synapse	12.4.1	yes	yes	7.3
Teradata**	17.10.00.23	yes	yes	7.3
TreasureData	0.5.10	yes	no	7.3

*Presto requires the use of auto commit: true for many of the underlying connectors which can delay writes.

**For output to Teradata, DataRobot only supports ANSI mode.

***Only the Databricks JDBC driver supports batch predictions.

For further information, see:

Supported intake options
Supported output options
Output format schema

Concurrent scoring¶

When scoring, the data you supply is split into chunks and scored concurrently on the prediction instance specified by the deployment. To control the level of concurrency, modify the numConcurrent parameter at job creation.

Job states¶

When working with batch predictions, each prediction job can be in one of four states:

INITIALIZING: The job has been successfully created and is either:
- Waiting for CSV data to be pushed (if local file intake).
- Waiting for a processing slot on the prediction server.
RUNNING: Scoring the dataset on prediction servers has started.
ABORTED: The job was aborted because either:
- It had an invalid configuration.
- DataRobot encountered 20% or 100MB of invalid scoring data that resulted in a prediction error.
COMPLETED: The dataset has been scored and:
- You can now download the scored data (if local file output).
- Otherwise the data has been written to the destination.

Store credentials securely¶

Some sources or targets for scoring may require DataRobot to authenticate on your behalf (for example, if your database requires that you pass a username and password for login). To ensure proper storage of these credentials, you must have data credentials enabled.

DataRobot uses the following credential types and properties:

Adapter	Credential Type	Property
S3 intake / output	s3	awsAccessKeyId awsSecretAccessKey awsSessionToken (optional)
JDBC intake / output	basic	username password

To use a stored credential, you must pass the associated credentialId in either intakeSettings or outputSettings as described below for each of the adapters.

CSV format¶

For any intake or output options that deal with reading or writing CSV files, you can use a custom format by specifying the following in csvSettings:

Parameter	Example	Description
`delimiter`	`,`	(Optional) The delimiter character to use. Default: `,` (comma). To specify TAB as a delimiter, use the string `tab`.
`quotechar`	`"`	(Optional) The character to use for quoting fields containing the delimiter. Default: `"`.
`encoding`	`utf-8`	(Optional) Encoding for the CSV file. For example (but not limited to): `shift_jis`, `latin_1` or `mskanji`. Default: `utf-8`. Any Python supported encoding can be used.

The same format will be used for both intake and output. See a complete example.

Model monitoring¶

The Batch Prediction API integrates well with DataRobot's model monitoring capabilities:

If you have enabled data drift tracking for your deployment, any predictions run through the Batch Prediction API will be tracked.
If you have enabled target drift tracking for your deployment, the output will contain the desired association ID to be used for reporting actuals.

Should you need to run a non-production dataset against your deployment, you can turn off drift and accuracy tracking for a single job by providing the following parameter:

Parameter	Example	Description
`skipDriftTracking`	`true`	(Optional) Skip data drift, target drift, and accuracy tracking for this job. Default: `false`.

Override the default prediction instance¶

Under normal circumstances, the prediction server used for scoring will be the default prediction server that your model was deployed to. It is however possible to override it If you have access to multiple prediction servers, you can override the default behavior by using the following properties in the predictionInstance option:

Parameter	Example	Description
`hostName`	`192.0.2.4`	Sets the hostname to use instead of the default hostname from the prediction server the model was deployed to.
`sslEnabled`	`false`	(Optional) Use SSL (HTTPS) to access the prediction server. Default: `true`.
`apiKey`	`NWU...IBn2w`	(Optional) Use an API key different from the job creator's key to authenticate against the new prediction server.
`datarobotKey`	`154a8abb-cbde-4e73-ab3b-a46c389c337b`	(Optional) If running in a managed AI Platform environment, specify the per-organization DataRobot key for the prediction server. Find the key on the Deployments> Predictions > Prediction API tab or by contacting your DataRobot representative.

Here's a complete example:


job_details = {
    'deploymentId': deployment_id,
    'intakeSettings': {'type': 'localFile'},
    'outputSettings': {'type': 'localFile'},
    'predictionInstance': {
        'hostName': '192.0.2.4',
        'sslEnabled': False,
        'apiKey': 'NWUQ9w21UhGgerBtOC4ahN0aqjbjZ0NMhL1e5cSt4ZHIBn2w',
        'datarobotKey': '154a8abb-cbde-4e73-ab3b-a46c389c337b',
    },
}

Consistent scoring with updated model¶

If you deploy a new model after a job has been queued, DataRobot will still use the model that was deployed at the time of job creation for the entire job. Every row will be scored with the same model.

Template variables¶

Sometimes it can be useful to specify dynamic parameters in your batch jobs, such as in Job Definitions. You can use jinja's variable syntax (double curly braces) to print the value of the following parameters:

Variable	Description
`current_run_time`	`datetime` object for current UTC time (`datetime.utcnow()`)
`current_run_timestamp`	Milliseconds from Unix epoch (integer)
`last_scheduled_run_time`	`datetime` object for the start of last job instantiated from the same job definition
`next_scheduled_run_time`	`datetime` object for the next scheduled start of job from the same job definition
`last_completed_run_time`	`datetime` object for when the previously scheduled job finished scoring

The above variables can be used in the following fields:

Field	Condition
`intake_settings.query`	For JDBC, Synapse, and Snowflake adapters
`output_settings.table`	For JDBC, Synapse, Snowflake, and BigQuery adapters, when statement type is `create_table` or `create_table_if_not_exists` is marked true
`output_settings.url`	For S3, GCP, and Azure adapters

You should specify the URL as: gs://bucket/output-<added-string-with-double-curly-braces>.csv.

Note

To ensure that most databases understand the replacements mentioned above, DataRobot strips microseconds off the ISO-8601 format timestamps.

API Reference¶

The Public API¶

The Batch Prediction API is part of the DataRobot REST API. Reference this documentation for more information about how to work with batch predictions.

The Python API Client¶

You can use the Python Public API Client to interface with the Batch Prediction API.