Read input data from public or private S3 buckets with DataRobot credentials consisting of an access key (ID and key) and a session token (Optional) This is the preferred intake option for larger files.
Local file intake does not have any special options. This intake option requires you to upload the job's scoring data using a PUT request to the URL specified in the csvUpload link in the job data. This starts the job (or queues it for processing if the prediction instance is already occupied).
If there is no other queued job for the selected prediction instance, scoring will start while you are still uploading.
Because the local file intake process requires that you upload scoring data for a job using a PUT request to the URL specified in the csvUpload parameter, by default, a single PUT request starts the job (or queues it for processing if the prediction instance is occupied). Multipart upload for batch predictions allows you to override the default behavior to upload scoring data through multiple files. This upload process requires multiple PUT requests followed by a single POST request (finalizeMultipart) to finalize the multipart upload manually. This feature can be helpful when you want to upload large datasets over a slow connection or if you experience frequent network instability.
To enable the new multipart upload workflow with async enabled, configure the intakeSettings for the localFile adapter as shown in the following sample request:
Note
You can also use the async intake setting independently of the multipart setting.
A defining feature of batch predictions is that the scoring job starts on the initial file upload, and only one batch prediction job at a time can run for any given prediction instance. This functionality may cause issues when uploading large datasets over a slow connection. In these cases, the client's upload speed could create a bottleneck and block the processing of other jobs. To avoid this potential bottleneck, you can set async to false, as shown in the example above. This configuration postpones submitting the batch prediction job to the queue.
When "async": false, the point at which a job enters the batch prediction queue depends on the multipart setting:
If "multipart": true, the job is submitted to the queue after the POST request for finalizeMultipart resolves.
If "multipart": false, the job is submitted to the queue after the initial file intake PUT request resolves.
The batch prediction API requests required to upload a 3 part multipart batch prediction job would be:
PUT /api/v2/batchPredictions/:id/csvUpload/part/0/
PUT /api/v2/batchPredictions/:id/csvUpload/part/1/
PUT /api/v2/batchPredictions/:id/csvUpload/part/2/
POST /api/v2/batchPredictions/:id/csvUpload/finalizeMultipart/
Each uploaded part is a complete CSV file with a header.
DataRobot supports reading from any JDBC-compatible database for Batch Predictions. To use JDBC with the Batch Prediction API, specify jdbc as the intake type. Since no file is needed for a PUT request, scoring will start immediately, transitioning the job to RUNNING if preliminary validation succeeds. To support this, the Batch Prediction API integrates with external data sources using credentials securely stored in data credentials.
(Optional) The name of the schema containing the table to be scored.
Tables
table
scoring_data
(Optional) The name of the database table containing data to be scored.
SQL query
query
SELECT feature1, feature2, feature3 AS readmitted FROM diabetes
(Optional) A custom query to run against the database.
Deprecated option
Fetch size
fetchSize (deprecated)
1000
Deprecated: fetchSize is now inferred dynamically for optimal throughput and is no longer needed. (Optional) To balance throughput and memory usage, sets a custom fetchSize (number of rows read at a time). Must be in range [1, 100000]; default 1000.
Note
You must specify either table and schema or query.
Using JDBC to transfer data can be costly in terms of IOPS (input/output operations per second) and expense for data warehouses. The data warehouse adapters reduce the load on database engines during prediction scoring by using cloud storage and bulk insert to create a hybrid JDBC-cloud storage solution. For more information, see the BigQuery, Snowflake, and Synapse data warehouse adapter sections.
A scoring option for large files is Azure. To score from Azure Blob Storage, you must configure credentials with DataRobot using an Azure Connection String.
(Optional) Select CSV (csv) or Parquet (parquet). Default value: CSV
+ Add credentials
credentialId
5e4bc5555e6e763beb488dba
In the UI, enable the + Add credentials field by selecting This URL requires credentials. Required if explicit access to credentials for this URL are required (optional otherwise). Refer to storing credentials securely.
Azure credentials are encrypted and are only decrypted when used to set up the client for communication with Azure during scoring.
DataRobot supports the Google Cloud Storage adapter. To score from Google Cloud Storage, you must set up a credential with DataRobot consisting of a JSON-formatted account key.
UI field
Parameter
Example
Description
Source type
type
gcp
Use Google Cloud Storage for intake.
URL
url
gcs://bucket-name/datasets/scoring.csv
An absolute URL for the file to be scored.
Format
format
csv
(Optional) Select CSV (csv) or Parquet (parquet). Default value: CSV
+ Add credentials
credentialId
5e4bc5555e6e763beb488dba
In the UI, enable the + Add credentials field by selecting This URL requires credentials. Required if explicit access credentials for this URL are required, otherwise optional. Refer to storing credentials securely.
GCP credentials are encrypted and are only decrypted when used to set up the client for communication with GCP during scoring.
For larger files, S3 is the preferred method for intake. DataRobot can ingest files from both public and private buckets. To score from Amazon S3, you must set up a credential with DataRobot consisting of an access key (ID and key) and, optionally, a session token.
UI field
Parameter
Example
Description
Source type
type
s3
DataRobot recommends S3 for intake.
URL
url
s3://bucket-name/datasets/scoring.csv
An absolute URL for the file to be scored.
Format
format
csv
(Optional) Select CSV (csv) or Parquet (parquet). Default value: CSV
+ Add credentials
credentialId
5e4bc5555e6e763beb488dba
In the UI, enable the + Add credentials field by selecting This URL requires credentials. Required if explicit access credentials for this URL are required. Refer to storing credentials securely.
AWS credentials are encrypted and only decrypted when used to set up the client for communication with AWS during scoring.
Note
If running a Private AI Cloud within AWS, it is possible to provide implicit credentials for your application instances using an IAM Instance Profile to access your S3 buckets without supplying explicit credentials in the job data. For more information, see the AWS documentation.
ID of Snowflake data source. In the UI, select a Snowflake data connection or click add a new data connection. Complete account and authorization fields.
Enter credentials
credentialId
5e4bc5555e6e763beb9db147
The ID of a stored credential for Snowflake.
Tables
table
SCORING_DATA
(Optional) Name of the Snowflake table containing data to be scored.
Schemas
schema
PUBLIC
(Optional) Name of the schema containing the table to be scored.
SQL query
query
SELECT feature1, feature2, feature3 FROM diabetes
(Optional) Custom query to run against the database.
Cloud storage type
cloudStorageType
s3
Type of cloud storage backend used in Snowflake external stage. Can be one of 3 cloud storage providers: s3/azure/gcp. Default is s3
External stage
externalStage
my_s3_stage
Snowflake external stage. In the UI, toggle on Use external stage to enable the External stage field.
+ Add credentials
cloudStorageCredentialId
6e4bc5541e6e763beb9db15c
ID of stored credentials for a storage backend (S3/Azure/GCS) used in Snowflake stage. In the UI, enable the + Add credentials field by selecting This URL requires credentials.
(Optional) Name of the Synapse table containing data to be scored.
Schemas
schema
dbo
(Optional) Name of the schema containing the table to be scored.
SQL query
query
SELECT feature1, feature2, feature3 FROM diabetes
(Optional) Custom query to run against the database.
Enter credentials
credentialId
5e4bc5555e6e763beb9db147
The ID of a stored credential for Synapse. Credentials are required if explicit access credentials for this URL are required, otherwise optional. Refer to storing credentials securely.
+ Add credentials
cloudStorageCredentialId
6e4bc5541e6e763beb9db15c
ID of a stored credential for Azure Blob storage. In the UI, enable the + Add credentials field by selecting This external data source requires credentials.
In the UI, select AI Catalog (or Data Registry in NextGen).
+ Select source from AI Catalog
datasetId
5e4bc5b35e6e763beb9db14a
The AI Catalog dataset ID.
In the UI, search for the dataset, select the dataset, then click Use the dataset (or Confirm in NextGen).
+ Select version
datasetVersionId
5e4bc5555e6e763beb488dba
The AI Catalog dataset version ID (Optional)
In the UI, enable the + Select version field by selecting the Use specific version check box. Search for and select the version. If datasetVersionId is not specified, it defaults to the latest version for the specified dataset.
Note
For the specified AI Catalog dataset, the version to be scored must have been successfully ingested, and it must be a snapshot.