Hadoop scalable ingest¶
The scalable ingest feature is only available on Hadoop installations.
For files that reside in an HDFS (Hadoop Distributed File System) file system and are larger than the 12GB (as CSV)* file size requirement for project creation, DataRobot offers a scalable ingest option. With this option, you can ingest up to a 100GB file. DataRobot will downsample the file, using up to 10GB for model training plus an additional 2GB for validation and holdout (for a total of 12GB). Scalable ingest is only enabled for project creation; prediction dataset file size requirements remain unchanged.
* Some file types may be below the 12GB limit on disk but larger when converted to CSV. For example, a Parquet file may be 6GB on disk but about 40GB when represented as CSV.
Scalable ingest details¶
The following describe the requirements and capabilities of using the scalable ingest option:
- Files must live in a Hadoop file system and be loaded via the HDFS browser. For non-Hadoop file formats, the ingest process, file formats, and size requirements are unchanged.
- Scalable ingest supports all formats supported by existing ingest except SAS (
Scalable ingest supports additional formats not supported by standard file ingest. Note that these formats display as directories in the file browser, but you can select and import them just like you would a single file:
- Apache Parquet
- Apache Avro
- Apache ORC
- Multifile CSV
When file type is Parquet, Avro, ORC, or multifile CSV, files (or directories of files) can be less than 12GB as CSV. In that case, DataRobot does not downsample and instead converts the data to CSV and then applies the standard file ingest process.
You cannot specify validation and holdout sizes for downsampled data. These values are determined by the application configuration—default is 1GB for holdout, 1GB for validation
You cannot set advanced (Group, Partition Feature, or Date/Time) partitioning options. DataRobot automatically sets the partitioning method based on the downsampling strategy. If you select representative downsampling, DataRobot uses random partitioning; if you choose smart downsampling, DataRobot uses stratified partitioning.
With scalable ingest, you select a target prior to EDA1. You cannot change the target once ingest completes.
Understanding the large data workflow¶
The following steps illustrate the large file, scalable ingest workflow:
From the project creation page, select the HDFS file import method. If file size is greater than the standard file size limit, DataRobot displays a warning indicating that the dataset will be downsampled.
The note references a file size greater than
13GB because DataRobot rounds up to 11GB (to account for a file size of, for example, 10.2GB) and to include 1GB each for holdout and validation.
From the HDFS file browser, select a dataset and click NEXT. DataRobot then displays the Fast EDA page for target selection. The page also reports all columns, the number of unique and missing values, and statistics for numeric columns (generated from a psuedo-random selection of 1M rows from the dataset).
Select a target. DataRobot runs target validation and disallows any target value that will result in error.
In cases where the possibility of error exists but isn't definite, you receive a warning but can continue with that selection as target.
If you continue and the target does, indeed, result in error, DataRobot returns a message at the top of the screen indicating that auto-start has failed.
After selecting your target, DataRobot displays a review page displaying the dataset, target, and warning (if applicable).
On this page you can:
a. Set the downsampling method.
b. Choose to have DataRobot automatically start full Autopilot (after downsampling, partitioning and model building will start automatically). If toggled on, DataRobot uses the default recommended metric, partitioning method, and feature list.
Click CREATE PROJECT. DataRobot begins uploading and reading the raw data. Note that for scalable ingest projects, EDA1 includes a downsampling phase (step 1) and reports progress status.
Once ingest completes, DataRobot begins model building (if you selected that option) or displays the standard page for you to start your build.
Downsampling with scalable ingest¶
Downsampling with large data generally follows the same operating characteristics of smart downsampling. That is, DataRobot reads the data to determine the majority/minority class ratio and then uses the downsampled data for the three stages of Autopilot (the rest of the data is not used for training but is available in an overflow file for validation and holdout models). To use more, you can unlock holdout and retrain the model on a larger sample percent.
With scalable ingest you can select a downsampling method, either Representative or Smart:
Representative (random) is available for all project types. DataRobot keeps a total of 12GB.
Smart (balanced) is only available for binary classification and zero-boosted regression projects. With scalable ingest smart downsampling, you can let the application automatically determine the best ratio or specify a ratio manually. See below for more information and examples of smart downsampling.
For binary classification problems and zero-boosted regression problems in which DataRobot estimates a majority/minority ratio less than 6, or for non-binary numeric columns, representative downsampling is recommended. This is also true for regression problems (a numeric target with more than two unique values). When the majority/minority ratio is greater than or equal to 6, smart downsampling is a better choice.
Understanding smart downsampling¶
Smart downsampling is "class aware," meaning that the downsampled data reflects the minority-to-majority ratio. DataRobot uses weights in the modeling phase, according to the original distribution of data, to ensure that the ratios in the downsampled data correctly represent the values in the dataset. You can elect to let DataRobot choose the ratio (Auto) or you can specify a ratio (Manual):
Auto (toggle off): DataRobot selects as balanced a ratio as possible, based on the majority/minority class counts from the entire dataset and resulting in as close as possible to 12GB available for in-memory modeling.
Manual: You specify a value (the number of majority rows in relation to minority rows) and DataRobot selects only as much downsampled data as required (up to 12GB) to satisfy the specified ratio. The ratio cannot be less than 1. If you specify a ratio greater than the actual maximum ratio in the dataset, DataRobot uses the maximum ratio from the dataset.
Examples of ratios¶
If you have a a minority class of 6GB or larger, you will always have the full 12GB available. If the minority class is less than 6GB, you may end up with less than 12GB in your downsampled dataset.
Let's say you have a 100GB dataset with 3GB of minority class data:
- If you use Auto, DataRobot uses all 3GB of the minority class and 9GB of the majority. It then applies weights to better represent the data in the final training dataset.
- If you manually specify a ratio of
2, DataRobot keeps 9GB of data—3GB of minority class data and 6GB of majority class data. From that, 7GB is used for training, 1GB for holdout, and 1GB for validation.
If your 100GB dataset has a 20GB minority class:
- If you manually specify a ratio of
2, DataRobot keeps 12GB of data—4GB of minority class data and 8GB of majority class data.
- If you set the ratio to
1, the downsampled dataset includes 6GB minority and 6GB majority class data. Weighting ensures correct representation of the full dataset.
DataRobot also calculates the "traditional" 20% values for holdout and validation when the dataset is less than 12GB. When the 20% value is greater than 1GB, DataRobot uses just the 1GB and selects corresponding partitions accordingly. When less than 1GB, DataRobot uses 20%. For example, with a 9GB dataset a 20% partition would result in 1.8GB of data. Since that is larger than 1GB, DataRobot uses just 1GB. However, if the dataset is 4GB (1GB minority, 3GB majority), 20% would result in 800MB and DataRobot would that amount instead.