The Batch Prediction API provides flexible options for intake and output when scoring large datasets using the prediction servers you have already deployed. The API is exposed through the DataRobot Public API. The API can be consumed using either any REST-enabled client or the DataRobot Python Public API bindings.
To ensure that the prediction server does not get overloaded, DataRobot will only run one job per prediction instance.
Further jobs are queued and started as soon as previous jobs complete.
A Batch Prediction job is a data pipeline consisting of:
Data Intake > Concurrent Scoring > Data Output
On creation, the job's intakeSettings and outputSettings define the data intake and data output part of the pipeline.
You can configure any combination of intake and output options.
For both, the defaults are local file intake and output, meaning you will have to issue a separate PUT request with the data to score and subsequently download the scored data.
When scoring, the data you supply is split into chunks and scored concurrently on the prediction instance specified by the deployment.
To control the level of concurrency, modify the numConcurrent parameter at job creation.
Some sources or targets for scoring may require DataRobot to authenticate on your behalf (for example, if your database requires that you pass a username and password for login). To ensure proper storage of these credentials, you must have data credentials enabled.
DataRobot uses the following credential types and properties:
To use a stored credential, you must pass the associated credentialId in either intakeSettings or outputSettings as described below for each of the adapters.
The Batch Prediction API integrates well with DataRobot's model monitoring capabilities:
If you have enabled data drift tracking for your deployment, any predictions run through the Batch Prediction API will be tracked.
If you have enabled target drift tracking for your deployment, the output will contain the desired association ID to be used for reporting actuals.
Should you need to run a non-production dataset against your deployment, you can turn off drift and accuracy tracking for a single job by providing the following parameter:
Parameter
Example
Description
skipDriftTracking
true
(Optional) Skip data drift, target drift, and accuracy tracking for this job. Default: false.
Under normal circumstances, the prediction server used for scoring will be the default prediction server that your model was deployed to. It is however possible to override it If you have access to multiple prediction servers, you can override the default behavior by using the following properties in the predictionInstance option:
Parameter
Example
Description
hostName
192.0.2.4
Sets the hostname to use instead of the default hostname from the prediction server the model was deployed to.
sslEnabled
false
(Optional) Use SSL (HTTPS) to access the prediction server. Default: true.
apiKey
NWU...IBn2w
(Optional) Use an API key different from the job creator's key to authenticate against the new prediction server.
datarobotKey
154a8abb-cbde-4e73-ab3b-a46c389c337b
(Optional) If running in a managed AI Platform environment, specify the per-organization DataRobot key for the prediction server.
If you deploy a new model after a job has been queued, DataRobot will still use the model that was deployed at the time of job creation for the entire job. Every row will be scored with the same model.
Sometimes it can be useful to specify dynamic parameters in your batch jobs, such as in Job Definitions. You can use jinja's variable syntax (double curly braces) to print the value of the following parameters:
Variable
Description
current_run_time
datetime object for current UTC time (datetime.utcnow())
current_run_timestamp
Milliseconds from Unix epoch (integer)
last_scheduled_run_time
datetime object for the start of last job instantiated from the same job definition
next_scheduled_run_time
datetime object for the next scheduled start of job from the same job definition
last_completed_run_time
datetime object for when the previously scheduled job finished scoring
The above variables can be used in the following fields:
Field
Condition
intake_settings.query
For JDBC, Synapse, and Snowflake adapters
output_settings.table
For JDBC, Synapse, Snowflake, and BigQuery adapters, when statement type is create_table or create_table_if_not_exists is marked true
output_settings.url
For S3, GCP, and Azure adapters
You should specify the URL as: gs://bucket/output-<added-string-with-double-curly-braces>.csv.
Note
To ensure that most databases understand the replacements mentioned above, DataRobot strips microseconds off the ISO-8601 format timestamps.
The Batch Prediction API is part of the DataRobot REST API. Reference this documentation for more information about how to work with batch predictions.