Early target selection¶
The data ingestion process for large datasets can, optionally, be different than that used for smaller sets. (You can also use the same process by letting the project complete EDA1.) When DataRobot optimizes for larger sets, it launches "Fast (or preliminary) EDA," a subset of the full EDA1 process, which looks more like:
- Dataset import begins.
- DataRobot detects the need for, and launches, Fast EDA.
- When Fast EDA completes, there is a window of time in which you can choose to participate in early target selection. This window is only valid between the time when Fast EDA completes and when EDA1 completes. As a result, for smaller datasets (less than 200MB) the window may be too small to take advantage of it.
- If early target selection was enabled, DataRobot completes EDA1, partitions the data, and launches EDA2 using project criteria for early target selection. If it was not enabled, the standard ingest process resumes (select a target and options and press Start).
When working with large datasets, you cannot create GLM, ENET, or PLS blender models. Median and average blenders are available. Also, Fast EDA is disabled in some cases, such as when the dataset has too many columns or uses too much RAM during ingest.
Fast EDA application¶
A dataset qualifies for Fast EDA if it is larger than 5MB, has fewer than 10,000 columns, and if when you begin to load data, 10 seconds elapse and the ingestion process is less than 75% complete. Note that the ingestion process is internal to DataRobot and may appear differently to you in the status bar. Fast EDA allows you to see preliminary EDA1 results and explore your data shortly after upload begins and while ingestion continues. Once Fast EDA completes, DataRobot continues calculating until full EDA1 completes.
Fast EDA is particularly helpful with large datasets because it allows you to:
- explore your data while ingestion continues. This is particularly helpful with large datasets. For example, it may take 15 minutes to ingest a 10GB file, but with Fast EDA you can see data information much more quickly.
- use early target selection, described below, to set the target variable and advanced option settings earlier on in the upload process.
Fast EDA is calculated on the first X rows of the dataset, not a random sample.
Fast EDA and early target selection¶
Fast EDA paves the way for early target selection. Once you have chosen the target, DataRobot populates the project options (partitioning, downsampling, number of workers, etc.) with default values based on your Fast EDA data. You can change and save the options, then set the project to auto-start at the completion of full EDA1. In this way, you do not have to check repeatedly for ingestion completion, which can be time consuming with quite large datasets. If there is any kind of error in the settings or ingest, DataRobot notifies you by email with an informative error message (if configured to do so). Once you set the target and any advanced options, even if you close your browser, DataRobot saves your project selections.
You can set the following at the completion of Fast EDA:
- Initial number of workers
- Modeling mode
- Advanced options
Until full EDA1 completes, you cannot:
Apply early target selection¶
To use early target selection, keep an eye on the Start screen and the processing status reported in the right sidebar. Fast EDA is part of the ingestion process, but if your dataset is too small for early target selection to make sense, you won't be able to modify these selections and EDA1 will go on to complete. If early target selection is applicable to your project, you will see a change in the start screen that indicates early target selection is an option:
To use early target selection:
Import your dataset to DataRobot.
When Fast EDA completes (part-way through the full EDA1 process), you are allowed to enter a target variable. Scroll down to explore your data and you see a yellow information message indicating, approximately, the amount of data used for the preliminary results:
The informational message disappears after completion of EDA1.
Enter a target variable. The Data page displays the auto-start toggle:
Click the Show Advanced options link to set additional parameters.
If you choose to auto-start the model build process, toggle the auto-start and select a modeling mode:
When full EDA1 completes, DataRobot launches the model building process using the criteria you set.
When working with large datasets, there are some differences in behaviors that you should note.
Train into validation and holdout¶
If, when training models, you trained into the validation and/or holdout sets, those scores display
N/A on the Leaderboard:
With large datasets, DataRobot disables internal cross-validation when you train models into validation/holdout. For anything over 800MB, DataRobot uses TVH as the default validation method. This is because the validation/holdout rows are used for training the model, the scores are not likely to be an accurate representation of the model performance on unseen data (and thus,
Some additional considerations for models displaying
- they are not represented in the Learning Curves or Speed vs Accuracy tabs
- the Lift Chart, Feature Fit, and Feature Effects tabs are unavailable
- you cannot compute predictions using the Make Predictions tab
- you cannot run DataRobot Prime
Change model sample size¶
You can change model sample size either from the Leaderboard or the Repository. Additionally, you cannot change the sample size for blender models. If you do wish to change a blender sample size, you can:
- Retrain each constituent model at the new sample size.
- Blend the constituent models to make a new blender.
Understand the messages¶
DataRobot provides some notifications to help you interpret the preliminary data displayed and used for early target selection. For example:
The Smart Downsampling setting is available (for binary classification or zero-boosted regression problems) after you set your target. The notification about the feature indicates the subset of data, in number of rows, that DataRobot will use in modeling. You can change the value in Advanced options:
The dataset notification tells you the number of rows in your dataset that are missing the target variable (and are therefore excluded from model building/predictions):
Additionally, auto-start returns a partitioning error when DataRobot bases preliminary calculations for partitioning settings on a subset of your dataset. If the cardinality of your partitioning column is outside the given range, auto-start returns an error.