Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Import data

A typical data pipeline starts with a data read operation. Import modules bring external data into the pipeline and make it available for other modules to consume.

See the section on data processing limits for each module type.

AI Catalog Import module

The AI Catalog Import module imports a snapshotted dataset from the AI Catalog to the pipeline.

Use the following options to configure an AI Catalog Import module in the Details tab:

Option Description
Dataset Select a dataset to import from the AI Catalog.
Version Specify the version of the dataset to import. If "Always use the latest version" is selected, the module runs using the latest version of the dataset.
Force Column Types to String Treat all imported columns as strings instead of performing type inference to detect other types, such as numeric. Useful for larger datasets where some of the column types may be wrongly inferred.
Chunk Size in Rows Specify the number of rows processed at a time. DataRobot recommends only adjusting this value if you encounter performance or memory issues for a specific dataset. Increase the value for performance issues and decrease it for memory issues.

CSV Reader module

The CSV Reader module is an import module that reads delimited text files from the AWS S3 storage service. The following are options used to configure the CSV Reader in the Details tab.

Option Description
File path Specify the path to the delimited text file, including the bucket name.
S3 Credentials Use any existing credentials from your profile’s “Credential Management” section or create a new set of credentials by providing the Access Key, Secret Key, and AWS Session Token details.
AWS region Enter the region where the S3 bucket is located. The default value is us-east-1.
Treat first row as column header If there are no header rows, uncheck this.
Delimiter Specify the field delimiter. Comma is the default.
Encoding Specify the type of encoding for the data. UTF-8 is the default.
Force Column Types to String Treat all imported columns as strings instead of performing type inference to detect other types, such as numeric. Useful for larger datasets where some of the column types may be wrongly inferred.
Parallel Streams Select the number of parallel processing streams to add. This option lets you trade off between speed of ingestion and amount of memory used. You can increase this value for smaller datasets to speed up runs. Keep this value low for larger datasets to avoid "Out of Memory" errors.
Size of blocks in bytes Select the number of blocks of data (in bytes) that are read at a time. Increasing the number of blocks can speed up the module and downstream modules to a point, but may result in "Out of Memory" errors for larger datasets. Decreasing the number of blocks can help avoid "Out of Memory" errors for larger datasets, but setting it too small will slow processing.

Updated December 1, 2021
Back to top