Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Dataset requirements

This section provides information on dataset requirements:

Note

File size upload is dependent on your DataRobot deployment, and in some cases the number and size of servers deployed. See tips to ensure acceptable file size for more assistance.

General requirements

Consider the following dataset requirements for AutoML, time series, and Visual AI projects. See additional information about preparing your dataset for Visual AI.

Requirement Solution Dataset type Visual AI
Dataset minimum row requirements for non-date/time projects:
  • For regression projects, 20 data rows plus header row
  • For classification projects:
    • minimum 20 minority- and 20 majority-class rows
    • minimum 100 total data rows, plus header row
Error displays number of rows found; add rows to the dataset until project meets the minimum rows (plus the header). Training Yes
Date/time partitioning-based projects (Time series and OTV) have specific row requirements). Error displays number of rows found; add rows to the dataset until project meets the minimum rows (plus the header). Training Yes
Dataset used for predictions via the GUI must have at least one data row plus header row. Error displays zero rows found; add a header row and one data row. Predictions Yes
Dataset cannot have more than 20,000 columns. Error message displays column count and limit; reduce the number of columns to less than 20,000. Training, predictions Yes
Dataset must have headers. Lack of header generally leads to bad predictions or ambiguous column names; add headers. Training, predictions Yes for CSV. If ZIP upload contains one folder of images per class, then technically there are not headers and so this is not always true for Visual AI.
Dataset must meet deployment type and release size limits. Error message displays dataset size and configured limit. Contact DataRobot Support for size limits; reduce dataset size by trimming rows and/or columns. Training, predictions Yes
Cloud: 5GB; 100k 224x224 pixels / 50kB images
On-premise: 10GB; 200k 224x224 pixels / 50kB images
Number of columns in the header row must be greater or equal to the number of columns in all data rows. For any data row with fewer columns than the maximum, DataRobot assumes a value of NA/NULL for that row. Number of columns in the header row must be greater or equal to the number of columns in all data rows. Error displays the line number of the first row that failed to parse; check the row reported in the error message.Quoting around text fields is a common reason for this error. Training, predictions Yes
Dataset cannot have more than one blank (empty) column name. Typically the first blank column is the first column, due to the way some tools write the index column. Error displays column index of the second blank column; add a label to the column. Training, predictions Yes
Dataset cannot have any column names containing only whitespace. A single blank column name (no whitespace) is allowed, but columns such as “(space)” or "(space)(space)" are not allowed. Error displays index of the column that contained only space(s); remove the space, or rename the column. Training, predictions Yes
All dataset feature names must be unique. No feature name can be used for more than one column, and feature names must differ from each other beyond just their use of special characters (e.g., -, $, ., {, }, \n, \r, ", or '). Error displays the two columns that resolved to the same name after sanitization; rename one column name. Example: robot.bar and robot$bar both resolve to robot\_bar. Training, predictions Yes
Dataset must use a supported encoding. Because UTF-8 processes the fastest, it is the recommended encoding. Error displays that the detected encoding is not supported or could not be detected; save the dataset to a CSV/delimited format, via another program, and change encoding. Training, predictions Yes
Dataset files must have one of the following delimiters: comma (,), tab (\t), semicolon (;), or pipe ( | ). Error displays a malformed CSV/delimited message; save the dataset to another program (e.g., Excel) and modify to use a supported delimiter. A problematic delimiter that is one of the listed values indicates a quoting problem.For text datasets, if strings are not quoted there may be issues detecting the proper delimiter. Example: in a tab separated dataset, if there are commas in text columns that are not quoted, they may be interpreted as a delimiter. See this note for a related file size issue. Training, predictions Yes
Excel datasets cannot have date times in the header. Error displays the index of the column and approximation of the column name; rename the column (e.g., “date” or “date-11/2/2016”). Alternatively, save the dataset to CSV/delimited format. Training, predictions Yes
Dataset must be a single file. Error displays that the specified file contains more than one dataset. This most commonly occurs with archive files (tar and zip); uncompress the archive and make sure it contains only one file. Training, predictions Yes
User must have read permissions to the dataset when using URL or HDFS ingest. Error displays that user does not have permission to access the dataset. Training, predictions Yes
All values in a date column must have the same format or be a null value. Error displays the value that did not match and the format itself; find the unmatched value in the date column and change it. Training, predictions Yes, this applies to the dataset whenever there is a date column, with no dependence on an image column.
Text features can contain up to 5 million characters; in some cases up to 10 million characters are accepted. In other words, no practical limit and the total size of the dataset is more likely the limiting factor. N/A Training, predictions Yes, this applies to the dataset whenever there is a text column, with no dependence on an image column.

Ensure acceptable file import size

Note

All file size limits represent the uncompressed size.

When ingesting a dataset, its actual on-disk size might be different inside of DataRobot.

  • If the original dataset source is a CSV, then the size may differ slightly from the original size due to data preprocessing performed by DataRobot.

  • If the original dataset source is not a CSV (e.g., is SAS7BDAT, JDBC, XLSX, GEOJSON, Shapefile, etc.), the on-disk size will be that of the dataset when converted to a CSV. SAS7BDAT, for example, is a binary format that supports different encoding types. As a result, it is difficult to estimate the size of data when converted to CSV based only on the input size as a SAS7BDAT file.

  • If the original dataset source is an archive or a compressed CSV (e.g., .gzip, .bzip2, .zip, .tar, .tgz), the actual on-disk size will be that of the uncompressed CSV after preprocessing is performed.

Keep the following in mind when considering file size:

  • Some of the preprocessing steps that are applied consist of converting the dataset encoding to UTF-8, adding quotation marks for the field data, normalizing missing value representation, converting geospatial fields, and sanitizing column names.

  • In the case of image archives or other similar formats, additional preprocessing will be done to add the images file contents to the resulting CSV. This potentially will make the size of the final CSV drastically different from the original uploaded file.

  • File size limitations are applied to files once they have been converted to CSV. If you upload a zipped file into DataRobot, when DataRobot extracts the file, the file must be less than the file size limits.

  • If a delimited-CSV dataset (CSV, TSV, etc.) size is close to the upload limit prior to ingest, it is best to do the conversion outside of DataRobot. This helps ensure that the file import does not exceed the limit. If a non-comma delimited file is near to the limit size, it may be best to convert to a comma-delimited CSV outside of DataRobot as well.

  • When converting to CSV outside of DataRobot, be sure to use commas as the delimiter, newline as the record separator, and UTF-8 as the encoding type to avoid discrepancies in uploaded file size and size counted against DataRobot's maximum file size limit.

  • Consider modifying optional feature flags in some cases:

    • Disable Early Limit Checking: By selecting, you disable the estimate-based early limit checker and instead use an exact limit checker. This may help allow ingestion of files that are close to the limit in case the estimate is slightly off. Note however that if the limit is exceeded, projects will fail later in the ingest process.
    • Enable Minimal CSV Quoting: Sets the conversion process to be more conservative when quoting the converted CSV, allowing the CSV to be smaller. Be aware, however, that doing so may make projects non-repeatable. This is because if you ingest the dataset with and without this setting enabled, the EDA samples and/or partitioning may differ, which can lead to subtle differences in the project. (By contrast, ingesting the same dataset with the same setting will result in a repeatable project.)

AutoML file import sizes

The following sections describe file import size requirements based on deployment type.

DataRobot Cloud

File type Maximum size Notes
CSV (training) 2GB Base Cloud Package
CSV (training) 5GB Premium Cloud Package
CSV (training) 5GB Enterprise Package
CSV (training) Up to 10GB* Business Critical Package
XLS 150MB See note above
CSV (prediction) 1GB

* Up to 10GB applies to AutoML projects.

On-premise (non-Hadoop)

File type Maximum size Release availability
CSV (training) 5GB All
XLS 150MB 3.0.1 and later
CSV (prediction) 1GB All

On-premise (Hadoop)

File type Maximum size Release availability Notes
Without scalable ingest
CSV (training) 10GB All
XLS 150MB 3.0.1 and later
CSV (prediction) 1GB All
With scalable ingest
CSV (training) Greater than 12GB, up to 100GB 3.1 and later Some file types may be larger when converted to CSV. For example, a Parquet file may be 6GB on disk but ~40GB when represented as CSV, which would initiate downsampling. Or, a Parquet file may be 60GB on disk but exceed the 100GB limit when represented as CSV.
XLS 150MB 3.1 and later
Apache Parquet, Apache Avro, Apache ORC, Multifile CSV Greater than 12GB, up to 100GB 3.1 and later DataRobot converts on ingest
CSV (prediction) 1GB

OTV requirements

For out-of-time validation (OTV) modeling, maximum dataset size: Less than 5GB.

OTV backtests require at least 20 rows in each of the validation and holdout folds and at least 100 rows in each training fold. If you set a number of backtests that results in any of the runs not meeting that criteria, DataRobot only runs the number of backtests that do meet the minimums (and marks the display with an asterisk). For example:

  • With one backtest, no holdout, minimum 100 training rows and 20 validation rows (120 total).
  • With one backtest and holdout, minimum 100 training rows, 20 validation rows, 20 holdout rows (140 total).

Time series file import sizes

When using time series, datasets must meet the following size requirements:

File type Single series maximum size Multiseries maximum size Release availability Notes
CSV (training) 500MB 1GB Managed AI Cloud
CSV (training) 500MB 1GB 5.3 30GB modeler configuration
CSV (training) 500MB 2.5GB 6.0 30GB modeler configuration
CSV (training) 500MB 5.0GB 6.0 60GB modeler configuration

Specific features of time series:

Feature Requirement
Minimum rows per backtest
Data ingest: Regression 20 rows for training and 4 rows for validation
Data ingest: Classification 75 rows for training and 12 rows for validation
Post-feature derivation Regression: Minimum 35 rows
Post-feature derivation Classification:
Calendars
Calendar event files Less than 1MB and 10K rows
Multiseries modeling
On-premise versions 5.0 or later 100,000 series
On-premise versions 5.3 or later) 1,000,000 series
External baseline files for model comparison Less than 5GB
Predictions
Predictions, drag-and-drop Maximum dataset size: Less than 10MB
Predictions, API (predAPI) Maximum dataset size: Less than 50MB

Note

If you set a number of backtests that results in any of the runs not meeting that criteria, DataRobot only runs the number of backtests that do meet the minimums (and marks the display with an asterisk).

For releases 4.5, 4.4 and 4.3, datasets must be less than 500MB. For releases 4.2 and 4.0, datasets must be less than 10MB for time series and less than 500MB for OTV. Datasets must be less than 5MB for projects using Date/Time partitioning in earlier releases.

Feature Discovery file import sizes

When using Feature Discovery, the following requirements apply:

  • Secondary datasets must be either uploaded files or JDBC sources registered in the AI Catalog.

  • You can have a maximum of 30 datasets per project.

  • The sum of all dataset sizes (both primary and secondary) cannot exceed 100GB. Individual dataset size limits are based on AI Catalog import limits mentioned above for AutoML or time series and download limits mentioned below.

Data formats

DataRobot supports the following formats and types for data ingestion. See also the supported data types.

File formats

  • .csv, .dsv, or .tsv* (preferred)
  • database tables
  • .xls/.xlsx
  • .sas7bdat
  • .parquet⁺
  • .avro⁺

Location AI file formats

These file types are supported only if enabled for users in your organization.

  • ESRI Shapefiles
  • GeoJSON
  • ESRI File Geodatabase
  • Well Known Text (embedded in table column)
  • PostGIS Databases (The file must be a comma-delimited, tab-delimited, semicolon-delimited, or pipe-delimited file and must have a header for each data column. Each row must have the same number of fields (columns), some of which may be blank.)

Compression formats

  • .gz
  • .bz2

Archive format

  • .tar

Compression and archive formats

  • .zip
  • .tar.gz/.tgz
  • .tar.bz2

Both compression and archive are accepted. Archive is preferred, however, because it allows DataRobot to know the uncompressed data size and therefore to be more efficient during data intake.

Decimal separator

The period (.) character is the only supported decimal separator—DataRobot does not support locale-specific decimal separators such as the comma (,). In other words, a value of 1.000 is equal to one (1), and cannot be used to represent one thousand (1000). If a different character is used as the separator, the value is treated as categorical.

A numeric feature can be positive, negative, or zero, and must meet one of the following criteria:

  • Contains no periods or commas
  • Contains a single period (values with more than one period are treated as categorical)

The table below provides sample values and their corresponding variable type:

Feature value Data type
1000000 Numeric
0.1 Numeric
0.1 Numeric
1,000.000 Categorical
1.000.000 Categorical
1,000,000 Categorical
0,1000 Categorical
1000.000… Categorical
1000,000… Categorical
(0,100) Categorical
(0.100) Categorical

Tip

Attempting a feature transformation (on features considered categorical based on the separator) from categorical to numeric will result in an empty numeric feature.

Encodings and character sets

Datasets must adhere to the following encoding requirements:

  • The data file cannot have any extraneous characters or escape sequences (from URLs).

  • Encoding must be consistent through the entire data set. For example, if a datafile is encoded as UTF-8 for the first 100MB, but later in the file there are non-utf-8 characters, it can potentially fail due to incorrect detection from the first 100MB.

Data must adhere to one of the following encodings:

  • ascii
  • cp1252
  • utf-8
  • utf-8-sig
  • utf-16
  • utf-16-le
  • utf-16-be
  • utf-32
  • utf-32-le
  • utf-32-be
  • Shift-JIS
  • ISO-2022-JP
  • EUC-JP
  • CP932
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • windows-1251
  • windows-1256
  • KOI8-R
  • GB18030
  • Big5
  • ISO-2022-KR
  • IBM424
  • windows-1252

Special column detection

Note that these special columns will be detected if they meet the criteria described below, but currency, length, percent, and date cannot be selected as the target for a project. However, date can be selected as a partition feature.

Date and time formats

Columns are detected as date fields if they match any of the formats containing a date listed below. If they are strictly time formats, (for example, %H:%M:%S) they are detected as time. See the Python definition table for descriptions of the directives. The following table provides examples using the date and time January 25, 1999 at 1:01 p.m. (specifically, 59 seconds and 000001 microseconds past 1:01 p.m.).

String Example
%H:%M 13:01
%H:%M:%S 13:01:59
%I:%M %p 01:01 PM
%I:%M:%S %p 01:01:59 PM
%M:%S 01:59
%Y %m %d 1999 01 25
%Y %m %d %H %M %S 1999 01 25 13 01 59
%Y %m %d %I %M %S %p 1999 01 25 01 01 59 PM
%Y%m%d 19990125
%Y-%d-%m 1999-25-01
%Y-%m-%d 1999-01-25
%Y-%m-%d %H:%M:%S 1999-01-25 13:01:59
%Y-%m-%d %H:%M:%S.%f 1999-01-25 13:01:59.000000
%Y-%m-%d %I:%M:%S %p 1999-01-25 01:01:59 PM
%Y-%m-%d %I:%M:%S.%f %p 1999-01-25 01:01:59.000000 PM
%Y-%m-%dT%H:%M:%S 1999-01-25T13:01:59
%Y-%m-%dT%H:%M:%S.%f 1999-01-25T13:01:59.000000
%Y-%m-%dT%H:%M:%S.%fZ 1999-01-25T13:01:59.000000Z
%Y-%m-%dT%H:%M:%SZ 1999-01-25T13:01:59Z
%Y-%m-%dT%I:%M:%S %p 1999-01-25T01:01:59 PM
%Y-%m-%dT%I:%M:%S.%f %p 1999-01-25T01:01:59.000000 PM
%Y-%m-%dT%I:%M:%S.%fZ %p 1999-01-25T01:01:59.000000Z PM
%Y-%m-%dT%I:%M:%SZ %p 1999-01-25T01:01:59Z PM
%Y.%d.%m 1999.25.01
%Y.%m.%d 1999.01.25
%Y/%d/%m %H:%M:%S.%f 1999/25/01 13:01:59.000000
%Y/%d/%m %H:%M:%S.%fZ 1999/25/01 13:01:59.000000Z
%Y/%d/%m %I:%M:%S.%f %p 1999/25/01 01:01:59.000000 PM
%Y/%d/%m %I:%M:%S.%fZ %p 1999/25/01 01:01:59.000000Z PM
%Y/%m/%d 1999/01/25
%Y/%m/%d %H:%M:%S 1999/01/25 13:01:59
%Y/%m/%d %H:%M:%S.%f 1999/01/25 13:01:59.000000
%Y/%m/%d %H:%M:%S.%fZ 1999/01/25 13:01:59.000000Z
%Y/%m/%d %I:%M:%S %p 1999/01/25 01:01:59 PM
%Y/%m/%d %I:%M:%S.%f %p 1999/01/25 01:01:59.000000 PM
%Y/%m/%d %I:%M:%S.%fZ %p 1999/01/25 01:01:59.000000Z PM
%d.%m.%Y 25.01.1999
%d.%m.%y 25.01.99
%d/%m/%Y 25/01/1999
%d/%m/%Y %H:%M 25/01/1999 13:01
%d/%m/%Y %H:%M:%S 25/01/1999 13:01:59
%d/%m/%Y %I:%M %p 25/01/1999 01:01 PM
%d/%m/%Y %I:%M:%S %p 25/01/1999 01:01:59 PM
%d/%m/%y 25/01/99
%d/%m/%y %H:%M 25/01/99 13:01
%d/%m/%y %H:%M:%S 25/01/99 13:01:59
%d/%m/%y %I:%M %p 25/01/99 01:01 PM
%d/%m/%y %I:%M:%S %p 25/01/99 01:01:59 PM
%m %d %Y %H %M %S 01 25 1999 13 01 59
%m %d %Y %I %M %S %p 01 25 1999 01 01 59 PM
%m %d %y %H %M %S 01 25 99 13 01 59
%m %d %y %I %M %S %p 01 25 99 01 01 59 PM
%m-%d-%Y 01-25-1999
%m-%d-%Y %H:%M:%S 01-25-1999 13:01:59
%m-%d-%Y %I:%M:%S %p 01-25-1999 01:01:59 PM
%m-%d-%y 01-25-99
%m-%d-%y %H:%M:%S 01-25-99 13:01:59
%m-%d-%y %I:%M:%S %p 01-25-99 01:01:59 PM
%m.%d.%Y 01.25.1999
%m.%d.%y 01.25.99
%m/%d/%Y 01/25/1999
%m/%d/%Y %H:%M 01/25/1999 13:01
%m/%d/%Y %H:%M:%S 01/25/1999 13:01:59
%m/%d/%Y %I:%M %p 01/25/1999 01:01 PM
%m/%d/%Y %I:%M:%S %p 01/25/1999 01:01:59 PM
%m/%d/%y 01/25/99
%m/%d/%y %H:%M 01/25/99 13:01
%m/%d/%y %H:%M:%S 01/25/99 13:01:59
%m/%d/%y %I:%M %p 01/25/99 01:01 PM
%m/%d/%y %I:%M:%S %p 01/25/99 01:01:59 PM
%y %m %d 99 01 25
%y %m %d %H %M %S 99 01 25 13 01 59
%y %m %d %I %M %S %p 99 01 25 01 01 59 PM
%y-%d-%m 99-25-01
%y-%m-%d 99-01-25
%y-%m-%d %H:%M:%S 99-01-25 13:01:59
%y-%m-%d %H:%M:%S.%f 99-01-25 13:01:59.000000
%y-%m-%d %I:%M:%S %p 99-01-25 01:01:59 PM
%y-%m-%d %I:%M:%S.%f %p 99-01-25 01:01:59.000000 PM
%y-%m-%dT%H:%M:%S 99-01-25T13:01:59
%y-%m-%dT%H:%M:%S.%f 99-01-25T13:01:59.000000
%y-%m-%dT%H:%M:%S.%fZ 99-01-25T13:01:59.000000Z
%y-%m-%dT%H:%M:%SZ 99-01-25T13:01:59Z
%y-%m-%dT%I:%M:%S %p 99-01-25T01:01:59 PM
%y-%m-%dT%I:%M:%S.%f %p 99-01-25T01:01:59.000000 PM
%y-%m-%dT%I:%M:%S.%fZ %p 99-01-25T01:01:59.000000Z PM
%y-%m-%dT%I:%M:%SZ %p 99-01-25T01:01:59Z PM
%y.%d.%m 99.25.01
%y.%m.%d 99.01.25
%y/%d/%m %H:%M:%S.%f 99/25/01 13:01:59.000000
%y/%d/%m %H:%M:%S.%fZ 99/25/01 13:01:59.000000Z
%y/%d/%m %I:%M:%S.%f %p 99/25/01 01:01:59.000000 PM
%y/%d/%m %I:%M:%S.%fZ %p 99/25/01 01:01:59.000000Z PM
%y/%m/%d 99/01/25
%y/%m/%d %H:%M:%S 99/01/25 13:01:59
%y/%m/%d %H:%M:%S.%f 99/01/25 13:01:59.000000
%y/%m/%d %H:%M:%S.%fZ 99/01/25 13:01:59.000000Z
%y/%m/%d %I:%M:%S %p 99/01/25 01:01:59 PM
%y/%m/%d %I:%M:%S.%f %p 99/01/25 01:01:59.000000 PM
%y/%m/%d %I:%M:%S.%fZ %p 99/01/25 01:01:59.000000Z PM

Percentages

Columns that have numeric values ending with “%” are treated as percentages.

Currencies

Columns that contain values with the following currency symbols are treated as currency.

  • $
  • EUR, USD, GBP
  • £
  • ¥

Also, note the following regarding currency interpretation:

  • The currency symbol can be preceding ($1) or following (1EUR) the text but must be consistent across the feature.
  • Both comma (,) and period (.) can be used as a separator for thousands or cents, but must be consistent across the feature (e.g., 1000 dollars and 1 cent can be represented as 1,000.01 or 1.000,01).
  • Leading + and - symbols are allowed.

Column length

Columns that contain values matching the convention <feet>’ <inches>” are displayed as variable type length on the Data page. DataRobot converts the length to a number in inches and then treats the value as a numeric in blueprints. If your dataset has other length values (for example, 12cm), the feature is treated as categorical. If a feature has mixed values that show the measurement (5m, 72in, and 12cm, for example), it is best to clean and normalize the dataset before uploading.

Column name conversions

During data ingestion, DataRobot converts the following characters to underscores (_): -, $, . {, }, ", \n, and \r.

File download sizes

Consider the following when downloading datasets:

  • There is a 10GB file size limit.
  • Datasets are downloaded as CSV files.
  • The downloaded dataset may differ from the one initially imported because DataRobot applies the conversions mentioned above.

Updated November 10, 2021
Back to top