This section provides information on dataset requirements:
- General requirements
- Ensuring acceptable file sizes
- AutoML file import sizes
- Time series (AutoTS) file import sizes
- Feature Discovery file import sizes
- Pipeline data requirements
- File formats
- Encodings and character sets
- Special column detection
- Length and name conversion
- File download sizes
See the associated considerations for important additional information.
Consider the following dataset requirements for AutoML, time series, and Visual AI projects. See additional information about preparing your dataset for Visual AI.
|Requirement||Solution||Dataset type||Visual AI|
|Dataset minimum row requirements for non-date/time projects:
||Error displays number of rows found; add rows to the dataset until project meets the minimum rows (plus the header).||Training||Yes|
|Date/time partitioning-based projects (Time series and OTV) have specific row requirements).||Error displays number of rows found; add rows to the dataset until project meets the minimum rows (plus the header).||Training||Yes|
|Dataset used for predictions via the GUI must have at least one data row plus header row.||Error displays zero rows found; add a header row and one data row.||Predictions||Yes|
|Dataset cannot have more than 20,000 columns.||Error message displays column count and limit; reduce the number of columns to less than 20,000.||Training, predictions||Yes|
|Dataset must have headers.||Lack of header generally leads to bad predictions or ambiguous column names; add headers.||Training, predictions||Yes for CSV. If ZIP upload contains one folder of images per class, then technically there are not headers and so this is not always true for Visual AI.|
|Dataset must meet deployment type and release size limits.||Error message displays dataset size and configured limit. Contact DataRobot Support for size limits; reduce dataset size by trimming rows and/or columns.||Training, predictions||Yes
managed AI Platform: 5GB; 100k 224x224 pixels / 50kB images
Self-Managed AI Platform: 10GB; 200k 224x224 pixels / 50kB images
|Number of columns in the header row must be greater or equal to the number of columns in all data rows. For any data row with fewer columns than the maximum, DataRobot assumes a value of NA/NULL for that row. Number of columns in the header row must be greater or equal to the number of columns in all data rows.||Error displays the line number of the first row that failed to parse; check the row reported in the error message. Quoting around text fields is a common reason for this error.||Training, predictions||Yes|
|Dataset cannot have more than one blank (empty) column name. Typically the first blank column is the first column, due to the way some tools write the index column.||Error displays column index of the second blank column; add a label to the column.||Training, predictions||Yes|
|Dataset cannot have any column names containing only whitespace. A single blank column name (no whitespace) is allowed, but columns such as “(space)” or "(space)(space)" are not allowed.||Error displays index of the column that contained only space(s); remove the space, or rename the column.||Training, predictions||Yes|
|All dataset feature names must be unique. No feature name can be used for more than one column, and feature names must differ from each other beyond just their use of special characters
||Error displays the two columns that resolved to the same name after sanitization; rename one column name. Example:
|Dataset must use a supported encoding. Because UTF-8 processes the fastest, it is the recommended encoding.||Error displays that the detected encoding is not supported or could not be detected; save the dataset to a CSV/delimited format, via another program, and change encoding.||Training, predictions||Yes|
|Dataset files must have one of the following delimiters: comma (,), tab (\t), semicolon (;), or pipe ( | ).||Error displays a malformed CSV/delimited message; save the dataset to another program (e.g., Excel) and modify to use a supported delimiter. A problematic delimiter that is one of the listed values indicates a quoting problem. For text datasets, if strings are not quoted there may be issues detecting the proper delimiter. Example: in a tab separated dataset, if there are commas in text columns that are not quoted, they may be interpreted as a delimiter. See this note for a related file size issue.||Training, predictions||Yes|
|Excel datasets cannot have date times in the header.||Error displays the index of the column and approximation of the column name; rename the column (e.g., “date” or “date-11/2/2016”). Alternatively, save the dataset to CSV/delimited format.||Training, predictions||Yes|
|Dataset must be a single file.||Error displays that the specified file contains more than one dataset. This most commonly occurs with archive files (tar and zip); uncompress the archive and make sure it contains only one file.||Training, predictions||Yes|
|User must have read permissions to the dataset when using URL or HDFS ingest.||Error displays that user does not have permission to access the dataset.||Training, predictions||Yes|
|All values in a date column must have the same format or be a null value.||Error displays the value that did not match and the format itself; find the unmatched value in the date column and change it.||Training, predictions||Yes, this applies to the dataset whenever there is a date column, with no dependence on an image column.|
|Text features can contain up to 5 million characters (for a single cell); in some cases up to 10 million characters are accepted. In other words, no practical limit and the total size of the dataset is more likely the limiting factor.||N/A||Training, predictions||Yes, this applies to the dataset whenever there is a text column, with no dependence on an image column.|
Ensure acceptable file import size¶
All file size limits represent the uncompressed size.
When ingesting a dataset, its actual on-disk size might be different inside of DataRobot.
If the original dataset source is a CSV, then the size may differ slightly from the original size due to data preprocessing performed by DataRobot.
If the original dataset source is not a CSV (e.g., is SAS7BDAT, JDBC, XLSX, GEOJSON, Shapefile, etc.), the on-disk size will be that of the dataset when converted to a CSV. SAS7BDAT, for example, is a binary format that supports different encoding types. As a result, it is difficult to estimate the size of data when converted to CSV based only on the input size as a SAS7BDAT file.
XLSX, due to its structure, is read in as a single, whole document which can cause OOM issues when trying to parse. CSV, by contrast is read in in chunks to reduce memory usage and prevent errors. Best practice recommends not exceeding 150MB for XLSX files.
If the original dataset source is an archive or a compressed CSV (e.g., .gzip, .bzip2, .zip, .tar, .tgz), the actual on-disk size will be that of the uncompressed CSV after preprocessing is performed.
Keep the following in mind when considering file size:
Some of the preprocessing steps that are applied consist of converting the dataset encoding to UTF-8, adding quotation marks for the field data, normalizing missing value representation, converting geospatial fields, and sanitizing column names.
In the case of image archives or other similar formats, additional preprocessing will be done to add the images file contents to the resulting CSV. This potentially will make the size of the final CSV drastically different from the original uploaded file.
File size limitations are applied to files once they have been converted to CSV. If you upload a zipped file into DataRobot, when DataRobot extracts the file, the file must be less than the file size limits.
If a delimited-CSV dataset (CSV, TSV, etc.) size is close to the upload limit prior to ingest, it is best to do the conversion outside of DataRobot. This helps ensure that the file import does not exceed the limit. If a non-comma delimited file is near to the limit size, it may be best to convert to a comma-delimited CSV outside of DataRobot as well.
When converting to CSV outside of DataRobot, be sure to use commas as the delimiter, newline as the record separator, and UTF-8 as the encoding type to avoid discrepancies in uploaded file size and size counted against DataRobot's maximum file size limit.
Consider modifying optional feature flags in some cases:
- Disable Early Limit Checking: By selecting, you disable the estimate-based early limit checker and instead use an exact limit checker. This may help allow ingestion of files that are close to the limit in case the estimate is slightly off. Note however that if the limit is exceeded, projects will fail later in the ingest process.
- Enable Minimal CSV Quoting: Sets the conversion process to be more conservative when quoting the converted CSV, allowing the CSV to be smaller. Be aware, however, that doing so may make projects non-repeatable. This is because if you ingest the dataset with and without this setting enabled, the EDA samples and/or partitioning may differ, which can lead to subtle differences in the project. (By contrast, ingesting the same dataset with the same setting will result in a repeatable project.)
AutoML file import sizes¶
The following sections describe file import size requirements based on deployment type.
File size upload is dependent on your DataRobot package, and in some cases the number and size of servers deployed. See tips to ensure acceptable file size for more assistance.
|File type||Maximum size||Notes|
|CSV (training)||2GB||Base Package|
|CSV (training)||5GB||Premium Package|
|CSV (training)||5GB||Enterprise Package|
|CSV (training)||Up to 10GB*||Business Critical Package|
|XLSX||150MB See note|
* Up to 10GB applies to AutoML projects; considerations apply.
|File type||Maximum size||Release availability||Notes|
|CSV (training)||Up to 10GB||All||Varies based on your DataRobot package and available hardware resources.|
|XLS||150MB||3.0.1 and later|
For out-of-time validation (OTV) modeling, maximum dataset size: Less than 5GB.
OTV backtests require at least 20 rows in each of the validation and holdout folds and at least 100 rows in each training fold. If you set a number of backtests that results in any of the runs not meeting that criteria, DataRobot only runs the number of backtests that do meet the minimums (and marks the display with an asterisk). For example:
- With one backtest, no holdout, minimum 100 training rows and 20 validation rows (120 total).
- With one backtest and holdout, minimum 100 training rows, 20 validation rows, 20 holdout rows (140 total).
Prediction file import sizes¶
|Prediction method||Details||File size limit|
|Leaderboard predictions||To make predictions on a non-deployed model using the UI, expand the model on the Leaderboard and select Predict > Make Predictions. Upload predictions from a local file, URL, data source, or the AI Catalog. You can also upload predictions using the modeling predictions API, also called the "V2 predictions API." Use this API to test predictions using your modeling workers on small datasets. Predictions can be limited to 100 requests per user, per hour, depending on your DataRobot package.||1GB|
|Batch predictions (UI)||To make batch predictions using the UI, deploy a model and navigate to the deployment's Make Predictions tab (requires MLOps).||5GB|
|Batch predictions (API)||The Batch Prediction API is optimized for high-throughput and contains production grade connectivity options that allow you to not only push data through the API, but also connect to the AI catalog, cloud storage, databases, or data warehouses (requires MLOps).||Unlimited|
|Prediction API (real-time)||To make real-time predictions on a deployed model, use the Prediction API.||50 MB|
Time series file import sizes¶
When using time series, datasets must meet the following size requirements:
|File type||Single series maximum size||Multiseries/segmented maximum size||Release availability||Notes|
|CSV (training)||500MB||5GB||N/A||Managed AI Platform (SaaS)|
|CSV (training)||500MB||1GB||5.3||30GB modeler configuration|
|CSV (training)||500MB||2.5GB||6.0||30GB modeler configuration|
|CSV (training)||500MB||5.0GB||6.0||60GB modeler configuration|
If you set a number of backtests that results in any of the runs not meeting that criteria, DataRobot only runs the number of backtests that do meet the minimums (and marks the display with an asterisk). Specific features of time series:
|Minimum rows per backtest|
|Data ingest: Regression||20 rows for training and 4 rows for validation|
|Data ingest: Classification||75 rows for training and 12 rows for validation|
|Post-feature derivation: Regression||Minimum 35 rows|
|Post-feature derivation: Classification||100 rows|
|Calendar event files||Less than 1MB and 10K rows|
|External baseline files for model comparison||Less than 5GB|
* Self-Managed AI Platform versions 5.0 or later are limited to 100,000 series; versions 5.3 or later are limited to 1,000,000 series.
There are times that you may want to partition without holdout, which changes the minimum ingest rows and also the output of various visualizations.
For releases 4.5, 4.4 and 4.3, datasets must be less than 500MB. For releases 4.2 and 4.0, datasets must be less than 10MB for time series and less than 500MB for OTV. Datasets must be less than 5MB for projects using Date/Time partitioning in earlier releases.
Feature Discovery file import sizes¶
When using Feature Discovery, the following requirements apply:
Secondary datasets must be either uploaded files or JDBC sources registered in the AI Catalog.
You can have a maximum of 30 datasets per project.
The sum of all dataset sizes (both primary and secondary) cannot exceed 100GB, and individual dataset sizes cannot exceed 11GB. See the download limits download limits mentioned below.
DataRobot supports the following formats and types for data ingestion. See also the supported data types.
- .csv, .dsv, or .tsv* (preferred formats)
- database tables
*The file must be a comma-, tab-, semicolon-, or pipe-delimited file with a header for each data column. Each row must have the same number of fields, some of which may be blank.
**These file types are supported only if enabled for users in your organization. Contact your DataRobot representative for more information.
Location AI file formats¶
The following Location AI file types are supported only if enabled for users in your organization:
- ESRI Shapefiles
- ESRI File Geodatabase
- Well Known Text (embedded in table column)
- PostGIS Databases (The file must be a comma-delimited, tab-delimited, semicolon-delimited, or pipe-delimited file and must have a header for each data column. Each row must have the same number of fields (columns), some of which may be blank.)
Compression and archive formats¶
Both compression and archive are accepted. Archive is preferred, however, because it allows DataRobot to know the uncompressed data size and therefore to be more efficient during data intake.
The period (.) character is the only supported decimal separator—DataRobot does not support locale-specific decimal separators such as the comma (,). In other words, a value of
1.000 is equal to one (1), and cannot be used to represent one thousand (1000). If a different character is used as the separator, the value is treated as categorical.
A numeric feature can be positive, negative, or zero, and must meet one of the following criteria:
- Contains no periods or commas.
- Contains a single period (values with more than one period are treated as categorical).
The table below provides sample values and their corresponding variable type:
|Feature value||Data type|
Attempting a feature transformation (on features considered categorical based on the separator) from categorical to numeric will result in an empty numeric feature.
Encodings and character sets¶
Datasets must adhere to the following encoding requirements:
The data file cannot have any extraneous characters or escape sequences (from URLs).
Encoding must be consistent through the entire data set. For example, if a datafile is encoded as UTF-8 for the first 100MB, but later in the file there are non-utf-8 characters, it can potentially fail due to incorrect detection from the first 100MB.
Data must adhere to one of the following encodings:
Special column detection¶
Note that these special columns will be detected if they meet the criteria described below, but
date cannot be selected as the target for a project. However,
date can be selected as a partition feature.
Date and time formats¶
Columns are detected as date fields if they match any of the formats containing a date listed below. If they are strictly time formats, (for example,
%H:%M:%S) they are detected as time. See the Python definition table for descriptions of the directives. The following table provides examples using the date and time January 25, 1999 at 1:01 p.m. (specifically, 59 seconds and 000001 microseconds past 1:01 p.m.).
|%I:%M %p||01:01 PM|
|%I:%M:%S %p||01:01:59 PM|
|%Y %m %d||1999 01 25|
|%Y %m %d %H %M %S||1999 01 25 13 01 59|
|%Y %m %d %I %M %S %p||1999 01 25 01 01 59 PM|
|%Y-%m-%d %H:%M:%S||1999-01-25 13:01:59|
|%Y-%m-%d %H:%M:%S.%f||1999-01-25 13:01:59.000000|
|%Y-%m-%d %I:%M:%S %p||1999-01-25 01:01:59 PM|
|%Y-%m-%d %I:%M:%S.%f %p||1999-01-25 01:01:59.000000 PM|
|%Y-%m-%dT%I:%M:%S %p||1999-01-25T01:01:59 PM|
|%Y-%m-%dT%I:%M:%S.%f %p||1999-01-25T01:01:59.000000 PM|
|%Y-%m-%dT%I:%M:%S.%fZ %p||1999-01-25T01:01:59.000000Z PM|
|%Y-%m-%dT%I:%M:%SZ %p||1999-01-25T01:01:59Z PM|
|%Y/%d/%m %H:%M:%S.%f||1999/25/01 13:01:59.000000|
|%Y/%d/%m %H:%M:%S.%fZ||1999/25/01 13:01:59.000000Z|
|%Y/%d/%m %I:%M:%S.%f %p||1999/25/01 01:01:59.000000 PM|
|%Y/%d/%m %I:%M:%S.%fZ %p||1999/25/01 01:01:59.000000Z PM|
|%Y/%m/%d %H:%M:%S||1999/01/25 13:01:59|
|%Y/%m/%d %H:%M:%S.%f||1999/01/25 13:01:59.000000|
|%Y/%m/%d %H:%M:%S.%fZ||1999/01/25 13:01:59.000000Z|
|%Y/%m/%d %I:%M:%S %p||1999/01/25 01:01:59 PM|
|%Y/%m/%d %I:%M:%S.%f %p||1999/01/25 01:01:59.000000 PM|
|%Y/%m/%d %I:%M:%S.%fZ %p||1999/01/25 01:01:59.000000Z PM|
|%d/%m/%Y %H:%M||25/01/1999 13:01|
|%d/%m/%Y %H:%M:%S||25/01/1999 13:01:59|
|%d/%m/%Y %I:%M %p||25/01/1999 01:01 PM|
|%d/%m/%Y %I:%M:%S %p||25/01/1999 01:01:59 PM|
|%d/%m/%y %H:%M||25/01/99 13:01|
|%d/%m/%y %H:%M:%S||25/01/99 13:01:59|
|%d/%m/%y %I:%M %p||25/01/99 01:01 PM|
|%d/%m/%y %I:%M:%S %p||25/01/99 01:01:59 PM|
|%m %d %Y %H %M %S||01 25 1999 13 01 59|
|%m %d %Y %I %M %S %p||01 25 1999 01 01 59 PM|
|%m %d %y %H %M %S||01 25 99 13 01 59|
|%m %d %y %I %M %S %p||01 25 99 01 01 59 PM|
|%m-%d-%Y %H:%M:%S||01-25-1999 13:01:59|
|%m-%d-%Y %I:%M:%S %p||01-25-1999 01:01:59 PM|
|%m-%d-%y %H:%M:%S||01-25-99 13:01:59|
|%m-%d-%y %I:%M:%S %p||01-25-99 01:01:59 PM|
|%m/%d/%Y %H:%M||01/25/1999 13:01|
|%m/%d/%Y %H:%M:%S||01/25/1999 13:01:59|
|%m/%d/%Y %I:%M %p||01/25/1999 01:01 PM|
|%m/%d/%Y %I:%M:%S %p||01/25/1999 01:01:59 PM|
|%m/%d/%y %H:%M||01/25/99 13:01|
|%m/%d/%y %H:%M:%S||01/25/99 13:01:59|
|%m/%d/%y %I:%M %p||01/25/99 01:01 PM|
|%m/%d/%y %I:%M:%S %p||01/25/99 01:01:59 PM|
|%y %m %d||99 01 25|
|%y %m %d %H %M %S||99 01 25 13 01 59|
|%y %m %d %I %M %S %p||99 01 25 01 01 59 PM|
|%y-%m-%d %H:%M:%S||99-01-25 13:01:59|
|%y-%m-%d %H:%M:%S.%f||99-01-25 13:01:59.000000|
|%y-%m-%d %I:%M:%S %p||99-01-25 01:01:59 PM|
|%y-%m-%d %I:%M:%S.%f %p||99-01-25 01:01:59.000000 PM|
|%y-%m-%dT%I:%M:%S %p||99-01-25T01:01:59 PM|
|%y-%m-%dT%I:%M:%S.%f %p||99-01-25T01:01:59.000000 PM|
|%y-%m-%dT%I:%M:%S.%fZ %p||99-01-25T01:01:59.000000Z PM|
|%y-%m-%dT%I:%M:%SZ %p||99-01-25T01:01:59Z PM|
|%y/%d/%m %H:%M:%S.%f||99/25/01 13:01:59.000000|
|%y/%d/%m %H:%M:%S.%fZ||99/25/01 13:01:59.000000Z|
|%y/%d/%m %I:%M:%S.%f %p||99/25/01 01:01:59.000000 PM|
|%y/%d/%m %I:%M:%S.%fZ %p||99/25/01 01:01:59.000000Z PM|
|%y/%m/%d %H:%M:%S||99/01/25 13:01:59|
|%y/%m/%d %H:%M:%S.%f||99/01/25 13:01:59.000000|
|%y/%m/%d %H:%M:%S.%fZ||99/01/25 13:01:59.000000Z|
|%y/%m/%d %I:%M:%S %p||99/01/25 01:01:59 PM|
|%y/%m/%d %I:%M:%S.%f %p||99/01/25 01:01:59.000000 PM|
|%y/%m/%d %I:%M:%S.%fZ %p||99/01/25 01:01:59.000000Z PM|
Columns that have numeric values ending with
% are treated as percentages.
Columns that contain values with the following currency symbols are treated as currency.
- EUR, USD, GBP
- ￡ (fullwidth)
- ￥ (fullwidth)
Also, note the following regarding currency interpretation:
- The currency symbol can be preceding ($1) or following (1EUR) the text but must be consistent across the feature.
- Both comma (
,) and period (
.) can be used as a separator for thousands or cents, but must be consistent across the feature (e.g., 1000 dollars and 1 cent can be represented as 1,000.01 or 1.000,01).
-symbols are allowed.
Columns that contain values matching the convention <feet>’ <inches>” are displayed as variable type
length on the Data page. DataRobot converts the length to a number in inches and then treats the value as a numeric in blueprints. If your dataset has other length values (for example, 12cm), the feature is treated as categorical. If a feature has mixed values that show the measurement (5m, 72in, and 12cm, for example), it is best to clean and normalize the dataset before uploading.
Column name conversions¶
During data ingestion, DataRobot converts the following characters to underscores (
File download sizes¶
Consider the following when downloading datasets:
- There is a 10GB file size limit.
- Datasets are downloaded as CSV files.
- The downloaded dataset may differ from the one initially imported because DataRobot applies the conversions mentioned above.