Batch Prediction API¶
The Batch Prediction API provides flexible options for intake and output when scoring large datasets using the prediction servers you have already deployed. The API is exposed through the DataRobot Public API. The API can be consumed using either any REST-enabled client or the DataRobot Python Public API bindings.
For more information about Batch Prediction REST API routes, view the DataRobot REST API reference documentation.
The main features of the API are:
- Flexible options for intake and output:
- Stream local files and start scoring while still uploading—while simultaneously downloading the results.
- Score large datasets from and to S3.
- Read datasets from the AI Catalog.
- Connect to external data sources using JDBC with bidirectional streaming of scoring data and results.
- Mix intake and output options, for example, scoring from a local file to an S3 target.
- Protection against prediction server overload with a concurrency control level option.
- Inclusion of Prediction Explanations (with an option to add thresholds).
- Support for passthrough columns to correlate scored data with source data.
- Addition of prediction warnings in the output.
- The ability to make predictions with files greater than 1GB via the API.
For more information about making batch prediction settings for time series, reference the time series documentation.
Limits¶
Item | AI Platform (SaaS) | Self-managed AI Platform (VPC or on-prem) |
---|---|---|
Job runtime limit | 4 hours* | Unlimited |
Local file intake size | Unlimited | Unlimited |
Local file write size | Unlimited | Unlimited |
S3 intake size | Unlimited | Unlimited |
S3 write size | 100GB | 100GB (configurable) |
Azure intake size | 4.75TB | 4.75TB |
Azure write size | 195GB | 195GB |
GCP intake size | 5TB | 5TB |
GCP write size | 5TB | 5TB |
JDBC intake size | Unlimited | Unlimited |
JDBC output size | Unlimited | Unlimited |
Concurrent jobs | 1 per prediction instance | 1 per installation |
Stored data retention time For local file adapters |
48 hours | 48 hours (configurable) |
* Feature Discovery projects have a job runtime limit of 6 hours.
Concurrent jobs¶
To ensure that the prediction server does not get overloaded, DataRobot will only run one job per prediction instance. Further jobs are queued and started as soon as previous jobs complete.
Data pipeline¶
A Batch Prediction job is a data pipeline consisting of:
Data Intake > Concurrent Scoring > Data Output
On creation, the job's intakeSettings
and outputSettings
define the data intake and data output part of the pipeline.
You can configure any combination of intake and output options.
For both, the defaults are local file intake and output, meaning you will have to issue a separate PUT
request with the data to score and subsequently download the scored data.
Data sources supported for batch predictions¶
The following table shows the data source support for batch predictions.
Name | Driver version | Intake support | Output support | DataRobot version validated |
---|---|---|---|---|
AWS Athena 2.0 | 2.0.35 | yes | no | 7.3 |
Databricks*** | 2.6.40 | yes | yes | 9.2 |
Exasol | 7.0.14 | yes | yes | 8.0 |
Google BigQuery | 1.2.4 | yes | yes | 7.3 |
InterSystems | 3.2.0 | yes | no | 7.3 |
kdb+ | - | yes | yes | 7.3 |
Microsoft SQL Server | 12.2.0 | yes | yes | 6.0 |
MySQL | 8.0.32 | yes | yes | 6.0 |
Oracle | 11.2.0 | yes | yes | 7.3 |
PostgreSQL | 42.5.1 | yes | yes | 6.0 |
Presto* | 0.216 | yes | yes | 8.0 |
Redshift | 2.1.0.14 | yes | yes | 6.0 |
SAP HANA | 2.20.17 | yes | yes | 7.3 (intake support only) 10.1 (intake and output support) |
Snowflake | 3.15.1 | yes | yes | 6.2 |
Synapse | 12.4.1 | yes | yes | 7.3 |
Teradata** | 17.10.00.23 | yes | yes | 7.3 |
TreasureData | 0.5.10 | yes | no | 7.3 |
*Presto requires the use of auto commit: true
for many of the underlying connectors which can delay writes.
**For output to Teradata, DataRobot only supports ANSI mode.
***Only the Databricks JDBC driver supports batch predictions.
For further information, see:
- Supported intake options
- Supported output options
- Output format schema
Concurrent scoring¶
When scoring, the data you supply is split into chunks and scored concurrently on the prediction instance specified by the deployment.
To control the level of concurrency, modify the numConcurrent
parameter at job creation.
Job states¶
When working with batch predictions, each prediction job can be in one of four states:
INITIALIZING
: The job has been successfully created and is either:- Waiting for CSV data to be pushed (if local file intake).
- Waiting for a processing slot on the prediction server.
RUNNING
: Scoring the dataset on prediction servers has started.ABORTED
: The job was aborted because either:- It had an invalid configuration.
- DataRobot encountered 20% or 100MB of invalid scoring data that resulted in a prediction error.
COMPLETED
: The dataset has been scored and:- You can now download the scored data (if local file output).
- Otherwise the data has been written to the destination.
Store credentials securely¶
Some sources or targets for scoring may require DataRobot to authenticate on your behalf (for example, if your database requires that you pass a username and password for login). To ensure proper storage of these credentials, you must have data credentials enabled.
DataRobot uses the following credential types and properties:
Adapter | Credential Type | Property |
---|---|---|
S3 intake / output | s3 | awsAccessKeyId awsSecretAccessKey awsSessionToken (optional) |
JDBC intake / output | basic | username password |
To use a stored credential, you must pass the associated credentialId
in either intakeSettings
or outputSettings
as described below for each of the adapters.
CSV format¶
For any intake or output options that deal with reading or writing CSV files, you can use a custom format by specifying the following in csvSettings
:
Parameter | Example | Description |
---|---|---|
delimiter |
, |
(Optional) The delimiter character to use. Default: , (comma). To specify TAB as a delimiter, use the string tab . |
quotechar |
" |
(Optional) The character to use for quoting fields containing the delimiter. Default: " . |
encoding |
utf-8 |
(Optional) Encoding for the CSV file. For example (but not limited to): shift_jis , latin_1 or mskanji . Default: utf-8 . Any Python supported encoding can be used. |
The same format will be used for both intake and output. See a complete example.
Model monitoring¶
The Batch Prediction API integrates well with DataRobot's model monitoring capabilities:
- If you have enabled data drift tracking for your deployment, any predictions run through the Batch Prediction API will be tracked.
- If you have enabled target drift tracking for your deployment, the output will contain the desired association ID to be used for reporting actuals.
Should you need to run a non-production dataset against your deployment, you can turn off drift and accuracy tracking for a single job by providing the following parameter:
Parameter | Example | Description |
---|---|---|
skipDriftTracking |
true |
(Optional) Skip data drift, target drift, and accuracy tracking for this job. Default: false . |
Override the default prediction instance¶
Under normal circumstances, the prediction server used for scoring will be the default prediction server that your model was deployed to. It is however possible to override it If you have access to multiple prediction servers, you can override the default behavior by using the following properties in the predictionInstance
option:
Parameter | Example | Description |
---|---|---|
hostName |
192.0.2.4 |
Sets the hostname to use instead of the default hostname from the prediction server the model was deployed to. |
sslEnabled |
false |
(Optional) Use SSL (HTTPS) to access the prediction server. Default: true . |
apiKey |
NWU...IBn2w |
(Optional) Use an API key different from the job creator's key to authenticate against the new prediction server. |
datarobotKey |
154a8abb-cbde-4e73-ab3b-a46c389c337b |
(Optional) If running in a managed AI Platform environment, specify the per-organization DataRobot key for the prediction server. Find the key on the Deployments> Predictions > Prediction API tab or by contacting your DataRobot representative. |
Here's a complete example:
job_details = {
'deploymentId': deployment_id,
'intakeSettings': {'type': 'localFile'},
'outputSettings': {'type': 'localFile'},
'predictionInstance': {
'hostName': '192.0.2.4',
'sslEnabled': False,
'apiKey': 'NWUQ9w21UhGgerBtOC4ahN0aqjbjZ0NMhL1e5cSt4ZHIBn2w',
'datarobotKey': '154a8abb-cbde-4e73-ab3b-a46c389c337b',
},
}
Consistent scoring with updated model¶
If you deploy a new model after a job has been queued, DataRobot will still use the model that was deployed at the time of job creation for the entire job. Every row will be scored with the same model.
Template variables¶
Sometimes it can be useful to specify dynamic parameters in your batch jobs, such as in Job Definitions. You can use jinja's variable syntax (double curly braces) to print the value of the following parameters:
Variable | Description |
---|---|
current_run_time |
datetime object for current UTC time (datetime.utcnow() ) |
current_run_timestamp |
Milliseconds from Unix epoch (integer) |
last_scheduled_run_time |
datetime object for the start of last job instantiated from the same job definition |
next_scheduled_run_time |
datetime object for the next scheduled start of job from the same job definition |
last_completed_run_time |
datetime object for when the previously scheduled job finished scoring |
The above variables can be used in the following fields:
Field | Condition |
---|---|
intake_settings.query |
For JDBC, Synapse, and Snowflake adapters |
output_settings.table |
For JDBC, Synapse, Snowflake, and BigQuery adapters, when statement type is create_table or create_table_if_not_exists is marked true |
output_settings.url |
For S3, GCP, and Azure adapters |
You should specify the URL as: gs://bucket/output-<added-string-with-double-curly-braces>.csv
.
Note
To ensure that most databases understand the replacements mentioned above, DataRobot strips microseconds off the ISO-8601 format timestamps.
API Reference¶
The Public API¶
The Batch Prediction API is part of the DataRobot REST API. Reference this documentation for more information about how to work with batch predictions.
The Python API Client¶
You can use the Python Public API Client to interface with the Batch Prediction API.