DataRobot API resources > API reference documentation > Batch Prediction API > Prediction intake options

Prediction intake options¶

You can configure a prediction source using the Predictions > Job Definitions tab or the Batch Prediction API. This topic describes both the UI and API intake options.

Note

For a complete list of supported intake options, see the data sources supported for batch predictions.

Intake option	Description
Local file streaming	Stream input data through a URL endpoint for immediate processing when the job moves to a running state.
HTTP scoring	Stream input data from an absolute URL for scoring. This option can read data from pre-signed URLs for Amazon S3, Azure, and Google Cloud Platform.
Database connections
JDBC scoring	Read prediction data from a JDBC-compatible database with data source details supplied through a job definition or the Batch Prediction API.
SAP Datasphere scoring	Read prediction data from a SAP Datasphere database with data source details supplied through a job definition or the Batch Prediction API.
Cloud storage connections
Azure Blob Storage scoring	Read input data from Azure Blob Storage with DataRobot credentials consisting of an Azure Connection String.
Google Cloud Storage scoring (GCP)	Read input data from Google Cloud Storage with DataRobot credentials consisting of a JSON-formatted account key.
Amazon S3 scoring	Read input data from public or private S3 buckets with DataRobot credentials consisting of an access key (ID and key) and a session token (Optional) This is the preferred intake option for larger files.
Data warehouse connections
BigQuery scoring	Score data using BigQuery with data source details supplied through a job definition or the Batch Prediction API.
Snowflake scoring	Score data using Snowflake with data source details supplied through a job definition or the Batch Prediction API.
Synapse scoring	Score data using Synapse with data source details supplied through a job definition or the Batch Prediction API.
Other connections
AI Catalog / Data Registry dataset scoring	Read input data from a dataset snapshot in the DataRobot AI Catalog / Data Registry.
Wrangler Recipe scoring	Read input data from a wrangler recipe created in the DataRobot NextGen Workbench from a Snowflake data connection.

If you are using a custom CSV format, any intake option dealing with CSV will adhere to that format.

Local file streaming¶

Local file intake does not have any special options. This intake option requires you to upload the job's scoring data using a PUT request to the URL specified in the csvUpload link in the job data. This starts the job (or queues it for processing if the prediction instance is already occupied).

If there is no other queued job for the selected prediction instance, scoring will start while you are still uploading.

Refer to this sample use case.

Note

If you forget to send scoring data, the job remains in the INITIALIZING state.

Multipart upload¶

Because the local file intake process requires that you upload scoring data for a job using a PUT request to the URL specified in the csvUpload parameter, by default, a single PUT request starts the job (or queues it for processing if the prediction instance is occupied). Multipart upload for batch predictions allows you to override the default behavior to upload scoring data through multiple files. This upload process requires multiple PUT requests followed by a single POST request (finalizeMultipart) to finalize the multipart upload manually. This feature can be helpful when you want to upload large datasets over a slow connection or if you experience frequent network instability.

Note

For more information on the batch prediction API and local file intake, see Batch Prediction API and Prediction intake options.

Multipart upload endpoints¶

This feature adds the following multipart upload endpoints to the batch prediction API:

Endpoint	Description
`PUT /api/v2/batchPredictions/:id/csvUpload/part/0/`	Upload scoring data in multiple parts to the URL specified by `csvUpload`. Increment `0` by 1 in sequential order for each part of the upload.
`POST /api/v2/batchPredictions/:id/csvUpload/finalizeMultipart/`	Finalize the multipart upload process. Make sure each part of the upload has finished before finalizing.

Local file intake settings¶

The intake settings for the local file adapter added two new properties to support multipart upload for the batch prediction API:

Property	Type	Default	Description
`intakeSettings.multipart`	boolean	`false`	`true`: Requires you to submit multiple files via a `PUT` request and finalize the process manually via a `POST` request (`finalizeMultipart`). `false`: Finalizes intake after one file is submitted via a `PUT` request.
`intakeSettings.async`	boolean	`true`	`true`: Starts the scoring job when the initial `PUT` request for file intake is made. `false`: Postpones the scoring job until the `PUT` request resolves or the `POST` request for `finalizeMultipart` resolves.

Multipart intake setting¶

To enable the new multipart upload workflow, configure the intakeSettings for the localFile adapter as shown in the following sample request:

{
    "intakeSettings": {
        "type": "localFile",
        "multipart": true
    }
}

These properties alter the local file upload workflow, requiring you to:

Upload any number of sequentially numbered files.
Finalize the upload to indicate that all required files uploaded successfully.

Async intake setting¶

To enable the new multipart upload workflow with async enabled, configure the intakeSettings for the localFile adapter as shown in the following sample request:

Note

You can also use the async intake setting independently of the multipart setting.

{
    "intakeSettings": {
        "type": "localFile",
        "multipart": true,
        "async": false
    }
}

A defining feature of batch predictions is that the scoring job starts on the initial file upload, and only one batch prediction job at a time can run for any given prediction instance. This functionality may cause issues when uploading large datasets over a slow connection. In these cases, the client's upload speed could create a bottleneck and block the processing of other jobs. To avoid this potential bottleneck, you can set async to false, as shown in the example above. This configuration postpones submitting the batch prediction job to the queue.

When "async": false, the point at which a job enters the batch prediction queue depends on the multipart setting:

If "multipart": true, the job is submitted to the queue after the POST request for finalizeMultipart resolves.
If "multipart": false, the job is submitted to the queue after the initial file intake PUT request resolves.

Example multipart upload requests¶

The batch prediction API requests required to upload a 3 part multipart batch prediction job would be:

PUT /api/v2/batchPredictions/:id/csvUpload/part/0/

PUT /api/v2/batchPredictions/:id/csvUpload/part/1/

PUT /api/v2/batchPredictions/:id/csvUpload/part/2/

POST /api/v2/batchPredictions/:id/csvUpload/finalizeMultipart/

Each uploaded part is a complete CSV file with a header.

Abort a multipart upload¶

If you start a multipart upload that you don't want to finalize, you can use a DELETE request to the existing batchPredictions abort route:

DELETE /api/v2/batchPredictions/:id/

HTTP scoring¶

In addition to the cloud storage adapters, you can also point batch predictions to a regular URL so DataRobot can stream the data for scoring:

Parameter	Example	Description
`type`	`http`	Use HTTP for intake.
`url`	`https://example.com/datasets/scoring.csv`	An absolute URL for the file to be scored.

The URL can optionally contain a username and password, such as https://username:password@example.com/datasets/scoring.csv.

The http adapter can be used for ingesting data from pre-signed URLs from either S3, Azure, or GCP.

JDBC scoring¶

DataRobot supports reading from any JDBC-compatible database for Batch Predictions. To use JDBC with the Batch Prediction API, specify jdbc as the intake type. Since no file is needed for a PUT request, scoring will start immediately, transitioning the job to RUNNING if preliminary validation succeeds. To support this, the Batch Prediction API integrates with external data sources using credentials securely stored in data credentials.

Supply data source details using the Predictions > Job Definitions tab or the Batch Prediction API (intakeSettings) as described in the table below.

UI field	Parameter	Example	Description
Source type	`type`	`jdbc`	Use a JDBC data store for intake.
Data connection options
+ Select connection	`dataStoreId`	`5e4bc5b35e6e763beb9db14a`	The ID of an external data source. In the UI, select a data connection or click add a new data connection. Complete account and authorization fields.
Enter credentials	`credentialId`	`5e4bc5555e6e763beb9db147`	The ID of a stored credential. Refer to storing credentials securely.
Schemas	`schema`	`public`	(Optional) The name of the schema containing the table to be scored.
Tables	`table`	`scoring_data`	(Optional) The name of the database table containing data to be scored.
SQL query	`query`	`SELECT feature1, feature2, feature3 AS readmitted FROM diabetes`	(Optional) A custom query to run against the database.
Deprecated option
Fetch size	`fetchSize (deprecated)`	`1000`	Deprecated: `fetchSize` is now inferred dynamically for optimal throughput and is no longer needed. (Optional) To balance throughput and memory usage, sets a custom `fetchSize` (number of rows read at a time). Must be in range [1, 100000]; default 1000.

Note

You must specify either table and schema or query.

Refer to the example section for a complete API example.

Data warehouse connections

Using JDBC to transfer data can be costly in terms of IOPS (input/output operations per second) and expense for data warehouses. The data warehouse adapters reduce the load on database engines during prediction scoring by using cloud storage and bulk insert to create a hybrid JDBC-cloud storage solution. For more information, see the BigQuery, Snowflake, and Synapse data warehouse adapter sections.

Allowed source IP addresses¶

Any connection initiated from DataRobot originates from one of the following IP addresses:

Host: https://app.datarobot.com	Host: https://app.eu.datarobot.com	Host: https://app.jp.datarobot.com
100.26.66.209	18.200.151.211	52.199.145.51
54.204.171.181	18.200.151.56	52.198.240.166
54.145.89.18	18.200.151.43	52.197.6.249
54.147.212.247	54.78.199.18
18.235.157.68	54.78.189.139
3.211.11.187	54.78.199.173
52.1.228.155	18.200.127.104
3.224.51.250	34.247.41.18
44.208.234.185	99.80.243.135
3.214.131.132	63.34.68.62
3.89.169.252	34.246.241.45
3.220.7.239	52.48.20.136
52.44.188.255
3.217.246.191

Note

These IP addresses are reserved for DataRobot use only.

SAP Datasphere scoring¶

Premium

Support for SAP Datasphere is off by default. Contact your DataRobot representative or administrator for information on enabling the feature.

Feature flag(s): Enable SAP Datasphere Connector, Enable SAP Datasphere Batch Predictions Integration

To use SAP Datasphere for scoring, supply data source details using the Predictions > Job Definitions tab or the Batch Prediction API (intakeSettings) as described in the table below.

UI field	Parameter	Example	Description
Source type	`type`	`datasphere`	Use a SAP Datasphere database for intake.
Data connection options
+ Select connection	`dataStoreId`	`5e4bc5b35e6e763beb9db14a`	The ID of an external data source. In the UI, select a data connection or click add a new data connection. Refer to the SAP Datasphere connection documentation.
Enter credentials	`credentialId`	`5e4bc5555e6e763beb9db147`	The ID of a stored credential for Datasphere. Refer to storing credentials securely.
	`catalog`	`/`	The name of the database catalog containing the table to be scored.
Schemas	`schema`	`public`	The name of the database schema containing the table to be scored.
Tables	`table`	`scoring_data`	The name of the database table containing data to be scored.

Azure Blob Storage scoring¶

A scoring option for large files is Azure. To score from Azure Blob Storage, you must configure credentials with DataRobot using an Azure Connection String.

UI field	Parameter	Example	Description
Source type	`type`	`azure`	Use Azure Blob Storage for intake.
URL	`url`	`https://myaccount.blob.core.windows.net/datasets/scoring.csv`	An absolute URL for the file to be scored.
Format	`format`	`csv`	(Optional) Select CSV (`csv`) or Parquet (`parquet`). Default value: CSV
+ Add credentials	`credentialId`	`5e4bc5555e6e763beb488dba`	In the UI, enable the + Add credentials field by selecting This URL requires credentials. Required if explicit access to credentials for this URL are required (optional otherwise). Refer to storing credentials securely.

Azure credentials are encrypted and are only decrypted when used to set up the client for communication with Azure during scoring.

Google Cloud Storage scoring¶

DataRobot supports the Google Cloud Storage adapter. To score from Google Cloud Storage, you must set up a credential with DataRobot consisting of a JSON-formatted account key.

UI field	Parameter	Example	Description
Source type	`type`	`gcp`	Use Google Cloud Storage for intake.
URL	`url`	`gcs://bucket-name/datasets/scoring.csv`	An absolute URL for the file to be scored.
Format	`format`	`csv`	(Optional) Select CSV (`csv`) or Parquet (`parquet`). Default value: CSV
+ Add credentials	`credentialId`	`5e4bc5555e6e763beb488dba`	In the UI, enable the + Add credentials field by selecting This URL requires credentials. Required if explicit access credentials for this URL are required, otherwise optional. Refer to storing credentials securely.

GCP credentials are encrypted and are only decrypted when used to set up the client for communication with GCP during scoring.

Amazon S3 scoring¶

For larger files, S3 is the preferred method for intake. DataRobot can ingest files from both public and private buckets. To score from Amazon S3, you must set up a credential with DataRobot consisting of an access key (ID and key) and, optionally, a session token.

UI field	Parameter	Example	Description
Source type	`type`	`s3`	DataRobot recommends S3 for intake.
URL	`url`	`s3://bucket-name/datasets/scoring.csv`	An absolute URL for the file to be scored.
Format	`format`	`csv`	(Optional) Select CSV (`csv`) or Parquet (`parquet`). Default value: CSV
+ Add credentials	`credentialId`	`5e4bc5555e6e763beb488dba`	In the UI, enable the + Add credentials field by selecting This URL requires credentials. Required if explicit access credentials for this URL are required. Refer to storing credentials securely.

AWS credentials are encrypted and only decrypted when used to set up the client for communication with AWS during scoring.

Note

If running a Private AI Cloud within AWS, it is possible to provide implicit credentials for your application instances using an IAM Instance Profile to access your S3 buckets without supplying explicit credentials in the job data. For more information, see the AWS documentation.

BigQuery scoring¶

To use BigQuery for scoring, supply data source details using the Predictions > Job Definitions tab or the Batch Prediction API (intakeSettings) as described in the table below.

UI field	Parameter	Example	Description
Source type	`type`	`bigquery`	Use the BigQuery API to unload data to Google Cloud Storage and use it as intake.
Dataset	`dataset`	`my_dataset`	The BigQuery dataset to use.
Table	`table`	`my_table`	The BigQuery table or view from the dataset used as intake.
Bucket	`bucket`	`my-bucket-in-gcs`	Bucket where data should be exported.
+ Add credentials	`credentialId`	`5e4bc5555e6e763beb488dba`	Required if explicit access credentials for this bucket are required (otherwise optional). In the UI, enable the + Add credentials field by selecting This connection requires credentials. Refer to storing credentials securely.

Refer to the example section for a complete API example.

Snowflake scoring¶

To use Snowflake as for scoring, supply data source details using the Predictions > Job Definitions tab or the Batch Prediction API (intakeSettings) as described in the table below.

UI field	Parameter	Example	Description
Source type	`type`	`snowflake`	Adapter type.
Data connection options
+ Select connection	`dataStoreId`	`5e4bc5b35e6e763beb9db14a`	ID of Snowflake data source. In the UI, select a Snowflake data connection or click add a new data connection. Complete account and authorization fields.
Enter credentials	`credentialId`	`5e4bc5555e6e763beb9db147`	The ID of a stored credential for Snowflake.
Tables	`table`	`SCORING_DATA`	(Optional) Name of the Snowflake table containing data to be scored.
Schemas	`schema`	`PUBLIC`	(Optional) Name of the schema containing the table to be scored.
SQL query	`query`	`SELECT feature1, feature2, feature3 FROM diabetes`	(Optional) Custom query to run against the database.
Cloud storage type	`cloudStorageType`	`s3`	Type of cloud storage backend used in Snowflake external stage. Can be one of 3 cloud storage providers: `s3`/`azure`/`gcp`. Default is `s3`
External stage	`externalStage`	`my_s3_stage`	Snowflake external stage. In the UI, toggle on Use external stage to enable the External stage field.
+ Add credentials	`cloudStorageCredentialId`	`6e4bc5541e6e763beb9db15c`	ID of stored credentials for a storage backend (S3/Azure/GCS) used in Snowflake stage. In the UI, enable the + Add credentials field by selecting This URL requires credentials.

Refer to the example section for a complete API example.

Synapse scoring¶

To use Synapse for scoring, supply data source details using the Predictions > Job Definitions tab or the Batch Prediction API (intakeSettings) as described in the table below.

UI field	Parameter	Example	Description
Source type	`type`	`synapse`	Adapter type.
Data connection options
+ Select connection	`dataStoreId`	`5e4bc5b35e6e763beb9db14a`	ID of Synapse data source. In the UI, select a Synapse data connection or click add a new data connection. Complete account and authorization fields.
External data source	`externalDatasource`	`my_data_source`	Name of the Synapse external data source.
Tables	`table`	`SCORING_DATA`	(Optional) Name of the Synapse table containing data to be scored.
Schemas	`schema`	`dbo`	(Optional) Name of the schema containing the table to be scored.
SQL query	`query`	`SELECT feature1, feature2, feature3 FROM diabetes`	(Optional) Custom query to run against the database.
Enter credentials	`credentialId`	`5e4bc5555e6e763beb9db147`	The ID of a stored credential for Synapse. Credentials are required if explicit access credentials for this URL are required, otherwise optional. Refer to storing credentials securely.
+ Add credentials	`cloudStorageCredentialId`	`6e4bc5541e6e763beb9db15c`	ID of a stored credential for Azure Blob storage. In the UI, enable the + Add credentials field by selecting This external data source requires credentials.

Refer to the example section for a complete API example.

Note

Synapse supports fewer collations than the default Microsoft SQL Server. For more information, reference the Synapse documentation.

AI Catalog / Data Registry dataset scoring¶

To read input data from an AI Catalog / Data Registry dataset, the following options are available:

UI field	Parameter	Example	Description
Source type	`type`	`dataset`	In the UI, select AI Catalog (or Data Registry in NextGen).
+ Select source from AI Catalog	`datasetId`	`5e4bc5b35e6e763beb9db14a`	The AI Catalog dataset ID. In the UI, search for the dataset, select the dataset, then click Use the dataset (or Confirm in NextGen).
+ Select version	`datasetVersionId`	`5e4bc5555e6e763beb488dba`	The AI Catalog dataset version ID (Optional) In the UI, enable the + Select version field by selecting the Use specific version check box. Search for and select the version. If `datasetVersionId` is not specified, it defaults to the latest version for the specified dataset.

Note

For the specified AI Catalog dataset, the version to be scored must have been successfully ingested, and it must be a snapshot.

Wrangler recipe dataset scoring¶

The following options are available to read input data from a wrangler recipe created in the DataRobot NextGen Workbench from a Snowflake data connection:

Wrangler data connection

Wrangler recipes for batch prediction jobs support data wrangled from a Snowflake data connection or the AI Catalog/Data Registry.

UI field	Parameter	Example	Description
Source type	`type`	`recipe`	In the UI, select Wrangler Recipe.
+ Select wrangler recipe	`recipeId`	`65fb040a42c170ee46230133`	The Wrangler Recipe dataset ID. In the NextGen prediction jobs UI, search for the wrangled dataset, select the dataset, then click Confirm.

Prediction intake options¶

Local file streaming¶

Multipart upload¶

Multipart upload endpoints¶

Local file intake settings¶

Multipart intake setting¶

Async intake setting¶

Example multipart upload requests¶

Abort a multipart upload¶

HTTP scoring¶

JDBC scoring¶

Allowed source IP addresses¶

SAP Datasphere scoring¶

Azure Blob Storage scoring¶

Google Cloud Storage scoring¶

Amazon S3 scoring¶

BigQuery scoring¶

Snowflake scoring¶

Synapse scoring¶

AI Catalog / Data Registry dataset scoring¶

Wrangler recipe dataset scoring¶

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?