Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Dataset requirements

This section provides information on dataset requirements:

See the associated considerations for important additional information.

General requirements

Consider the following dataset requirements for AutoML, time series, and Visual AI projects. See additional information about preparing your dataset for Visual AI.

Requirement Solution Dataset type Visual AI
Dataset minimum row requirements for non-date/time projects:
  • For regression projects, 20 data rows plus header row
  • For binary classification projects:
    • minimum 20 minority- and 20 majority-class rows
    • minimum 100 total data rows, plus header row
  • For multiclass classification projects, minimum 100 total data rows, plus header row
Error displays number of rows found; add rows to the dataset until project meets the minimum rows (plus the header). Training Yes
Date/time partitioning-based projects (Time series and OTV) have specific row requirements. Error displays number of rows found; add rows to the dataset until project meets the minimum rows (plus the header). Training Yes
Dataset used for predictions via the GUI must have at least one data row plus header row. Error displays zero rows found; add a header row and one data row. Predictions Yes
Dataset cannot have more than 20,000 columns. Error message displays column count and limit; reduce the number of columns to less than 20,000. Training, predictions Yes
Dataset must have headers. Lack of header generally leads to bad predictions or ambiguous column names; add headers. Training, predictions Yes for CSV. If ZIP upload contains one folder of images per class, then technically there are not headers and so this is not always true for Visual AI.
Dataset must meet deployment type and release size limits. Error message displays dataset size and configured limit. Contact DataRobot Support for size limits; reduce dataset size by trimming rows and/or columns. Training, predictions Yes
managed AI Platform: 5GB; 100k 224x224 pixels / 50kB images
Self-Managed AI Platform: 10GB; 200k 224x224 pixels / 50kB images
Number of columns in the header row must be greater or equal to the number of columns in all data rows. For any data row with fewer columns than the maximum, DataRobot assumes a value of NA/NULL for that row. Number of columns in the header row must be greater or equal to the number of columns in all data rows. Error displays the line number of the first row that failed to parse; check the row reported in the error message. Quoting around text fields is a common reason for this error. Training, predictions Yes
Dataset cannot have more than one blank (empty) column name. Typically the first blank column is the first column, due to the way some tools write the index column. Error displays column index of the second blank column; add a label to the column. Training, predictions Yes
Dataset cannot have any column names containing only whitespace. A single blank column name (no whitespace) is allowed, but columns such as “(space)” or "(space)(space)" are not allowed. Error displays index of the column that contained only space(s); remove the space, or rename the column. Training, predictions Yes
All dataset feature names must be unique. No feature name can be used for more than one column, and feature names must differ from each other beyond just their use of special characters (e.g., -, $, ., {, }, \n, \r, ", or '). Error displays the two columns that resolved to the same name after sanitization; rename one column name. Example: robot.bar and robot$bar both resolve to robot\_bar. Training, predictions Yes
Dataset must use a supported encoding. Because UTF-8 processes the fastest, it is the recommended encoding. Error displays that the detected encoding is not supported or could not be detected; save the dataset to a CSV/delimited format, via another program, and change encoding. Training, predictions Yes
Dataset files must have one of the following delimiters: comma (,), tab (\t), semicolon (;), or pipe ( | ). Error displays a malformed CSV/delimited message; save the dataset to another program (e.g., Excel) and modify to use a supported delimiter. A problematic delimiter that is one of the listed values indicates a quoting problem. For text datasets, if strings are not quoted there may be issues detecting the proper delimiter. Example: in a tab separated dataset, if there are commas in text columns that are not quoted, they may be interpreted as a delimiter. See this note for a related file size issue. Training, predictions Yes
Excel datasets cannot have date times in the header. Error displays the index of the column and approximation of the column name; rename the column (e.g., “date” or “date-11/2/2016”). Alternatively, save the dataset to CSV/delimited format. Training, predictions Yes
Dataset must be a single file. Error displays that the specified file contains more than one dataset. This most commonly occurs with archive files (tar and zip); uncompress the archive and make sure it contains only one file. Training, predictions Yes
User must have read permissions to the dataset when using URL or HDFS ingest. Error displays that user does not have permission to access the dataset. Training, predictions Yes
All values in a date column must have the same format or be a null value. Error displays the value that did not match and the format itself; find the unmatched value in the date column and change it. Training, predictions Yes, this applies to the dataset whenever there is a date column, with no dependence on an image column.
Text features can contain up to 5 million characters (for a single cell); in some cases up to 10 million characters are accepted. In other words, no practical limit and the total size of the dataset is more likely the limiting factor. N/A Training, predictions Yes, this applies to the dataset whenever there is a text column, with no dependence on an image column.

Ensure acceptable file import size

Note

All file size limits represent the uncompressed size.

When ingesting a dataset, its actual on-disk size might be different inside of DataRobot.

  • If the original dataset source is a CSV, then the size may differ slightly from the original size due to data preprocessing performed by DataRobot.

  • If the original dataset source is not a CSV (e.g., is SAS7BDAT, JDBC, XLSX, GEOJSON, Shapefile, etc.), the on-disk size will be that of the dataset when converted to a CSV. SAS7BDAT, for example, is a binary format that supports different encoding types. As a result, it is difficult to estimate the size of data when converted to CSV based only on the input size as a SAS7BDAT file.

  • XLSX, due to its structure, is read in as a single, whole document which can cause OOM issues when trying to parse. CSV, by contrast is read in in chunks to reduce memory usage and prevent errors. Best practice recommends not exceeding 150MB for XLSX files.

  • If the original dataset source is an archive or a compressed CSV (e.g., .gzip, .bzip2, .zip, .tar, .tgz), the actual on-disk size will be that of the uncompressed CSV after preprocessing is performed.

Keep the following in mind when considering file size:

  • Some of the preprocessing steps that are applied consist of converting the dataset encoding to UTF-8, adding quotation marks for the field data, normalizing missing value representation, converting geospatial fields, and sanitizing column names.

  • In the case of image archives or other similar formats, additional preprocessing will be done to add the images file contents to the resulting CSV. This potentially will make the size of the final CSV drastically different from the original uploaded file.

  • File size limitations are applied to files once they have been converted to CSV. If you upload a zipped file into DataRobot, when DataRobot extracts the file, the file must be less than the file size limits.

  • If a delimited-CSV dataset (CSV, TSV, etc.) size is close to the upload limit prior to ingest, it is best to do the conversion outside of DataRobot. This helps ensure that the file import does not exceed the limit. If a non-comma delimited file is near to the limit size, it may be best to convert to a comma-delimited CSV outside of DataRobot as well.

  • When converting to CSV outside of DataRobot, be sure to use commas as the delimiter, newline as the record separator, and UTF-8 as the encoding type to avoid discrepancies in uploaded file size and size counted against DataRobot's maximum file size limit.

  • Consider modifying optional feature flags in some cases:

    • Disable Early Limit Checking: By selecting, you disable the estimate-based early limit checker and instead use an exact limit checker. This may help allow ingestion of files that are close to the limit in case the estimate is slightly off. Note however that if the limit is exceeded, projects will fail later in the ingest process.
    • Enable Minimal CSV Quoting: Sets the conversion process to be more conservative when quoting the converted CSV, allowing the CSV to be smaller. Be aware, however, that doing so may make projects non-repeatable. This is because if you ingest the dataset with and without this setting enabled, the EDA samples and/or partitioning may differ, which can lead to subtle differences in the project. (By contrast, ingesting the same dataset with the same setting will result in a repeatable project.)

AutoML file import sizes

The following sections describe supported file import sizes.

Note

File size upload is dependent on your DataRobot package and, in some cases, the number and size of servers deployed. See tips to ensure acceptable file size for more assistance.

File type Maximum size Notes
CSV 5GB Essentials package
CSV Up to 10GB* Business Critical package
CSV Up to 20GB See considerations
XLSX 150MB See guidance

* Up to 10GB applies to AutoML projects; considerations apply.

File type Maximum size Release availability Notes
CSV (training) Up to 10GB* All Varies based on your DataRobot package and available hardware resources.
XLS 150MB 3.0.1 and later  

* Up to 20GB in some instances; contact DataRobot Support for more information.

Beyond 10GB ingest (SaaS only)

Ingest of up to 20GB training datasets provides large-scale modeling capabilities. When enabled, file ingest limit is increased from 10GB to 20GB.

Availability information

Ingest of up to 20GB training data is a preview feature, off by default. Contact your DataRobot representative or administrator for information on enabling the feature.

Feature flags: Enable 20GB Scaleup Modeling Optimization

Consider the following when training with 20GB:

  • Available for binary classification and regression projects only.
  • No support for Visual AI or Location AI projects.
  • Ingestion is only available from an external source (data connection or URL); training data must be registered in the AI Catalog (20GB datasets cannot be directly uploaded from a local computer).
  • Sliced insights are disabled.
  • Feature Discovery is disabled.
  • By default, Feature Effects generates insights for the top 500 features (ranked by feature impact). With projects greater than 10GB, in consideration of runtime performance, Feature Effects generates insights for the top 100 features.

OTV requirements

For out-of-time validation (OTV) modeling, maximum dataset size: Less than 5GB.

OTV backtests require at least 20 rows in each of the validation and holdout folds and at least 100 rows in each training fold. If you set a number of backtests that results in any of the runs not meeting that criteria, DataRobot only runs the number of backtests that do meet the minimums (and marks the display with an asterisk). For example:

  • With one backtest, no holdout, minimum 100 training rows and 20 validation rows (120 total).
  • With one backtest and holdout, minimum 100 training rows, 20 validation rows, 20 holdout rows (140 total).

Prediction file import sizes

Prediction method Details File size limit
Leaderboard predictions To make predictions on a non-deployed model using the UI, expand the model on the Leaderboard and select Predict > Make Predictions. Upload predictions from a local file, URL, data source, or the AI Catalog. You can also upload predictions using the modeling predictions API, also called the "V2 predictions API." Use this API to test predictions using your modeling workers on small datasets. Predictions can be limited to 100 requests per user, per hour, depending on your DataRobot package. 1GB
Batch predictions (UI) To make batch predictions using the UI, deploy a model and navigate to the deployment's Make Predictions tab (requires MLOps). 5GB
Batch predictions (API) The Batch Prediction API is optimized for high-throughput and contains production grade connectivity options that allow you to not only push data through the API, but also connect to the AI catalog, cloud storage, databases, or data warehouses (requires MLOps). Unlimited
Prediction API (real-time) To make real-time predictions on a deployed model, use the Prediction API. 50 MB
Prediction monitoring While the Batch Prediction API isn't limited to a specific file size, prediction monitoring is still subject to an hourly rate limit. 100MB / hour

Time series file requirements

When using time series, datasets must be CSV format and meet the following size requirements:

Max file size: single series Max file size: multiseries/segmented Notes
500MB 5GB SaaS
500MB 2.5GB Self-Managed 6.0+, 30GB modeler configuration
500MB 5GB Self-Managed 6.0+, 60GB modeler configuration

If you set a number of backtests that results in any of the runs not meeting that criteria, DataRobot only runs the number of backtests that do meet the minimums (and marks the display with an asterisk). Specific features of time series:

Feature Requirement
Minimum rows per backtest
Data ingest: Regression 20 rows for training and 4 rows for validation
Data ingest: Classification 75 rows for training and 12 rows for validation
Post-feature derivation: Regression Minimum 35 rows
Post-feature derivation: Classification 100 rows
Calendars
Calendar event files Less than 1MB and 10K rows
Multiseries modeling*
External baseline files for model comparison Less than 5GB

* Self-Managed AI Platform versions 5.0 or later are limited to 100,000 series; versions 5.3 or later are limited to 1,000,000 series.

Note

There are times that you may want to partition without holdout, which changes the minimum ingest rows and also the output of various visualizations.

For releases 4.5, 4.4 and 4.3, datasets must be less than 500MB. For releases 4.2 and 4.0, datasets must be less than 10MB for time series and less than 500MB for OTV. Datasets must be less than 5MB for projects using Date/Time partitioning in earlier releases.

Feature Discovery file import sizes

When using Feature Discovery, the following requirements apply:

  • Secondary datasets must be either uploaded files or JDBC sources registered in the AI Catalog.

  • You can have a maximum of 30 datasets per project.

  • The sum of all dataset sizes (both primary and secondary) should not exceed 40GB, and individual dataset sizes should not exceed 20GB. Using larger datasets may impact performance and result in error. See the download limits download limits mentioned below.

Data formats

DataRobot supports the following formats and types for data ingestion. See also the supported data types.

File formats

  • .csv, .dsv, or .tsv* (preferred formats)
  • database tables
  • .xls/.xlsx
  • PDF**
  • .sas7bdat
  • .parquet***
  • .avro**

*The file must be a comma-, tab-, semicolon-, or pipe-delimited file with a header for each data column. Each row must have the same number of fields, some of which may be blank.

**These file types are preview. Contact your DataRobot representative for more information.

***Parquet files are typed data; if the file contains a string field with numeric values, DataRobot treats this field as categorical.

Location AI file formats

The following Location AI file types are supported only if enabled for users in your organization:

  • ESRI Shapefiles
  • GeoJSON
  • ESRI File Geodatabase
  • Well Known Text (embedded in table column)
  • PostGIS Databases (The file must be a comma-delimited, tab-delimited, semicolon-delimited, or pipe-delimited file and must have a header for each data column. Each row must have the same number of fields (columns), some of which may be blank.)

Compression formats

  • .gz
  • .bz2

Archive format

  • .tar

Compression and archive formats

  • .zip
  • .tar.gz/.tgz
  • .tar.bz2

Both compression and archive are accepted. Archive is preferred, however, because it allows DataRobot to know the uncompressed data size and therefore to be more efficient during data intake.

Decimal separator

The period (.) character is the only supported decimal separator—DataRobot does not support locale-specific decimal separators such as the comma (,). In other words, a value of 1.000 is equal to one (1), and cannot be used to represent one thousand (1000). If a different character is used as the separator, the value is treated as categorical.

A numeric feature can be positive, negative, or zero, and must meet one of the following criteria:

  • Contains no periods or commas.
  • Contains a single period (values with more than one period are treated as categorical).

The table below provides sample values and their corresponding variable type:

Feature value Data type
1000000 Numeric
0.1 Numeric
0.1 Numeric
1,000.000 Categorical
1.000.000 Categorical
1,000,000 Categorical
0,1000 Categorical
1000.000… Categorical
1000,000… Categorical
(0,100) Categorical
(0.100) Categorical

Tip

Attempting a feature transformation (on features considered categorical based on the separator) from categorical to numeric will result in an empty numeric feature.

Encodings and character sets

Datasets must adhere to the following encoding requirements:

  • The data file cannot have any extraneous characters or escape sequences (from URLs).

  • Encoding must be consistent through the entire dataset. For example, if a datafile is encoded as UTF-8 for the first 100MB, but later in the file there are non-utf-8 characters, it can potentially fail due to incorrect detection from the first 100MB.

Data must adhere to one of the following encodings:

  • ascii
  • cp1252
  • utf-8
  • utf-8-sig
  • utf-16
  • utf-16-le
  • utf-16-be
  • utf-32
  • utf-32-le
  • utf-32-be
  • Shift-JIS
  • ISO-2022-JP
  • EUC-JP
  • CP932
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • windows-1251
  • windows-1256
  • KOI8-R
  • GB18030
  • Big5
  • ISO-2022-KR
  • IBM424
  • windows-1252

Special column detection

Note that these special columns will be detected if they meet the criteria described below, but currency, length, percent, and date cannot be selected as the target for a project. However, date can be selected as a partition feature.

Date and time formats

Columns are detected as date fields if they match any of the formats containing a date listed below. If they are strictly time formats, (for example, %H:%M:%S) they are detected as time. See the Python definition table for descriptions of the directives. The following table provides examples using the date and time January 25, 1999 at 1:01 p.m. (specifically, 59 seconds and 000001 microseconds past 1:01 p.m.).

String Example
%H:%M 13:01
%H:%M:%S 13:01:59
%I:%M %p 01:01 PM
%I:%M:%S %p 01:01:59 PM
%M:%S 01:59
%Y %m %d 1999 01 25
%Y %m %d %H %M %S 1999 01 25 13 01 59
%Y %m %d %I %M %S %p 1999 01 25 01 01 59 PM
%Y%m%d 19990125
%Y-%d-%m 1999-25-01
%Y-%m-%d 1999-01-25
%Y-%m-%d %H:%M:%S 1999-01-25 13:01:59
%Y-%m-%d %H:%M:%S.%f 1999-01-25 13:01:59.000000
%Y-%m-%d %I:%M:%S %p 1999-01-25 01:01:59 PM
%Y-%m-%d %I:%M:%S.%f %p 1999-01-25 01:01:59.000000 PM
%Y-%m-%dT%H:%M:%S 1999-01-25T13:01:59
%Y-%m-%dT%H:%M:%S.%f 1999-01-25T13:01:59.000000
%Y-%m-%dT%H:%M:%S.%fZ 1999-01-25T13:01:59.000000Z
%Y-%m-%dT%H:%M:%SZ 1999-01-25T13:01:59Z
%Y-%m-%dT%I:%M:%S %p 1999-01-25T01:01:59 PM
%Y-%m-%dT%I:%M:%S.%f %p 1999-01-25T01:01:59.000000 PM
%Y-%m-%dT%I:%M:%S.%fZ %p 1999-01-25T01:01:59.000000Z PM
%Y-%m-%dT%I:%M:%SZ %p 1999-01-25T01:01:59Z PM
%Y.%d.%m 1999.25.01
%Y.%m.%d 1999.01.25
%Y/%d/%m %H:%M:%S.%f 1999/25/01 13:01:59.000000
%Y/%d/%m %H:%M:%S.%fZ 1999/25/01 13:01:59.000000Z
%Y/%d/%m %I:%M:%S.%f %p 1999/25/01 01:01:59.000000 PM
%Y/%d/%m %I:%M:%S.%fZ %p 1999/25/01 01:01:59.000000Z PM
%Y/%m/%d 1999/01/25
%Y/%m/%d %H:%M:%S 1999/01/25 13:01:59
%Y/%m/%d %H:%M:%S.%f 1999/01/25 13:01:59.000000
%Y/%m/%d %H:%M:%S.%fZ 1999/01/25 13:01:59.000000Z
%Y/%m/%d %I:%M:%S %p 1999/01/25 01:01:59 PM
%Y/%m/%d %I:%M:%S.%f %p 1999/01/25 01:01:59.000000 PM
%Y/%m/%d %I:%M:%S.%fZ %p 1999/01/25 01:01:59.000000Z PM
%d.%m.%Y 25.01.1999
%d.%m.%y 25.01.99
%d/%m/%Y 25/01/1999
%d/%m/%Y %H:%M 25/01/1999 13:01
%d/%m/%Y %H:%M:%S 25/01/1999 13:01:59
%d/%m/%Y %I:%M %p 25/01/1999 01:01 PM
%d/%m/%Y %I:%M:%S %p 25/01/1999 01:01:59 PM
%d/%m/%y 25/01/99
%d/%m/%y %H:%M 25/01/99 13:01
%d/%m/%y %H:%M:%S 25/01/99 13:01:59
%d/%m/%y %I:%M %p 25/01/99 01:01 PM
%d/%m/%y %I:%M:%S %p 25/01/99 01:01:59 PM
%m %d %Y %H %M %S 01 25 1999 13 01 59
%m %d %Y %I %M %S %p 01 25 1999 01 01 59 PM
%m %d %y %H %M %S 01 25 99 13 01 59
%m %d %y %I %M %S %p 01 25 99 01 01 59 PM
%m-%d-%Y 01-25-1999
%m-%d-%Y %H:%M:%S 01-25-1999 13:01:59
%m-%d-%Y %I:%M:%S %p 01-25-1999 01:01:59 PM
%m-%d-%y 01-25-99
%m-%d-%y %H:%M:%S 01-25-99 13:01:59
%m-%d-%y %I:%M:%S %p 01-25-99 01:01:59 PM
%m.%d.%Y 01.25.1999
%m.%d.%y 01.25.99
%m/%d/%Y 01/25/1999
%m/%d/%Y %H:%M 01/25/1999 13:01
%m/%d/%Y %H:%M:%S 01/25/1999 13:01:59
%m/%d/%Y %I:%M %p 01/25/1999 01:01 PM
%m/%d/%Y %I:%M:%S %p 01/25/1999 01:01:59 PM
%m/%d/%y 01/25/99
%m/%d/%y %H:%M 01/25/99 13:01
%m/%d/%y %H:%M:%S 01/25/99 13:01:59
%m/%d/%y %I:%M %p 01/25/99 01:01 PM
%m/%d/%y %I:%M:%S %p 01/25/99 01:01:59 PM
%y %m %d 99 01 25
%y %m %d %H %M %S 99 01 25 13 01 59
%y %m %d %I %M %S %p 99 01 25 01 01 59 PM
%y-%d-%m 99-25-01
%y-%m-%d 99-01-25
%y-%m-%d %H:%M:%S 99-01-25 13:01:59
%y-%m-%d %H:%M:%S.%f 99-01-25 13:01:59.000000
%y-%m-%d %I:%M:%S %p 99-01-25 01:01:59 PM
%y-%m-%d %I:%M:%S.%f %p 99-01-25 01:01:59.000000 PM
%y-%m-%dT%H:%M:%S 99-01-25T13:01:59
%y-%m-%dT%H:%M:%S.%f 99-01-25T13:01:59.000000
%y-%m-%dT%H:%M:%S.%fZ 99-01-25T13:01:59.000000Z
%y-%m-%dT%H:%M:%SZ 99-01-25T13:01:59Z
%y-%m-%dT%I:%M:%S %p 99-01-25T01:01:59 PM
%y-%m-%dT%I:%M:%S.%f %p 99-01-25T01:01:59.000000 PM
%y-%m-%dT%I:%M:%S.%fZ %p 99-01-25T01:01:59.000000Z PM
%y-%m-%dT%I:%M:%SZ %p 99-01-25T01:01:59Z PM
%y.%d.%m 99.25.01
%y.%m.%d 99.01.25
%y/%d/%m %H:%M:%S.%f 99/25/01 13:01:59.000000
%y/%d/%m %H:%M:%S.%fZ 99/25/01 13:01:59.000000Z
%y/%d/%m %I:%M:%S.%f %p 99/25/01 01:01:59.000000 PM
%y/%d/%m %I:%M:%S.%fZ %p 99/25/01 01:01:59.000000Z PM
%y/%m/%d 99/01/25
%y/%m/%d %H:%M:%S 99/01/25 13:01:59
%y/%m/%d %H:%M:%S.%f 99/01/25 13:01:59.000000
%y/%m/%d %H:%M:%S.%fZ 99/01/25 13:01:59.000000Z
%y/%m/%d %I:%M:%S %p 99/01/25 01:01:59 PM
%y/%m/%d %I:%M:%S.%f %p 99/01/25 01:01:59.000000 PM
%y/%m/%d %I:%M:%S.%fZ %p 99/01/25 01:01:59.000000Z PM

Percentages

Columns that have numeric values ending with % are treated as percentages.

Currencies

Columns that contain values with the following currency symbols are treated as currency.

  • $
  • EUR, USD, GBP
  • £
  • £ (fullwidth)
  • ¥
  • ¥ (fullwidth)

Also, note the following regarding currency interpretation:

  • The currency symbol can be preceding ($1) or following (1EUR) the text but must be consistent across the feature.
  • Both comma (,) and period (.) can be used as a separator for thousands or cents, but must be consistent across the feature (e.g., 1000 dollars and 1 cent can be represented as 1,000.01 or 1.000,01).
  • Leading + and - symbols are allowed.

Length

Columns that contain values matching the convention <feet>’ <inches>” are displayed as variable type length on the Data page. DataRobot converts the length to a number in inches and then treats the value as a numeric in blueprints. If your dataset has other length values (for example, 12cm), the feature is treated as categorical. If a feature has mixed values that show the measurement (5m, 72in, and 12cm, for example), it is best to clean and normalize the dataset before uploading.

Column name conversions

During data ingestion, DataRobot converts the following characters to underscores (_): -, $, . {, }, ", \n, and \r. Additionally, DataRobot removes all leading and trailing spaces.

File download sizes

Consider the following when downloading datasets:

  • There is a 10GB file size limit.
  • Datasets are downloaded as CSV files.
  • The downloaded dataset may differ from the one initially imported because DataRobot applies the conversions mentioned above.

Updated October 24, 2024