Smart downsampling is a technique to reduce total dataset size by reducing the size of the majority class, enabling you to build models faster without sacrificing accuracy. When enabled, all analysis and model building is based on the new dataset size after smart downsampling.
When setting the downsampling percentage rate, you are specifying the size of the majority class after Smart Downsampling. For example, a 70% Smart Downsampling rate would downsample a majority class of 100 rows to 70 rows.
When to downsample¶
There are two types of problems that benefit from Smart Downsampling:
Imbalanced classification: This is a problem in which one of the two target classes occurs far more frequently than others in the dataset. For example, a direct mail response dataset might consist of negative responses on 99.5% of the records and positive responses on only 0.5%.
Is imbalanced data ok?
There is a myth that you must balance data for binary classification problems, leading many data scientists to mistakenly resample their data. While upsampling is the worst mistake you can make here, downsampling can cause problems too. Remember, while humans have trouble understanding imbalanced data, computers do not. For example, it is not intuitive that a model that predicts a constant value can be 99% accurate, but if the data is 99% a single value due to imbalance, this happens.
Most classification models optimize for LogLoss, which naturally handles class balance issues. If downsampling is applied, it affects only the majority class. Once downsampled, DataRobot weights the data to correct for the sampling and ensure predicted probabilities are correct.
If there is no need to downsample (or upsample for that matter), why does DataRobot do it? Downsampling can result in much faster modeling, with very similar accuracy
Zero-inflated regression: This is a problem in which the value zero appears in more than 50% of the dataset. A common example of this is within insurance claim data where, for example, 90% of policies may generate zero loss while the other 10% generate claims of various amounts.
In both cases, DataRobot first downsamples the majority class to make the classes balanced, then adds a weight so that the effect of the resulting dataset mimics the original balance of the classes. The applicable optimization metric indicates that the classes are weighted.
Conditions for Smart Downsampling¶
Consider the following when using Smart Downsampling:
The dataset must be larger than 500MB.
The target variable must take only two values (binary classification) or it must be numeric with more than 50% of values being exactly zero (zero-boosted regression). With time series projects, modeling with many zeros uses a different calculation.
You cannot select Random Partitioning (it is automatically disabled when you enable Smart Downsampling).
DataRobot will not create anomaly models when Smart Downsampling is enabled.
Once enabled, the selected downsampling percentage rate cannot result in the majority class becoming smaller than the minority class.
If the conditions are not met, you cannot enable the feature. The Smart Downsampling option displays a message indicating that the current target is not a binary classification or zero-boosted regression problem.
When you use simple (binary) classification, DataRobot downsamples the majority class. When you use regression, DataRobot downsamples the zero values. Smart Downsampling is selected by default when both of the following conditions are met:
- The majority class is 2x or greater than the minority class.
- The dataset is larger than 500 MB.
Enable Smart Downsampling¶
Enable Smart Downsampling and specify a sampling percentage from the Advanced options link on the Data page:
Import a dataset or open a project for which models have not yet been built and enter a target variable that results in a binary classification or zero-boosted regression problem.
Click the Show advanced options link and select the Smart Downsampling option.
Toggle Downsample Data to ON:
By typing in the box or using the slider, enter the majority class downsampling percentage rate. Note the following:
The minimum percentage is the smallest percentage allowed. Any rate below the indicated minimum will result in a majority class that is smaller than the minority class.
As you change the percentage rate, the majority rows listed under “Results of downsampling…” updates to indicate the new size of the majority class.
Scroll to the top of the page, choose a modeling mode, and click Start to begin modeling.
When model building is complete, select Models from the toolbar. The Leaderboard displays an icon indicating that model results are based on downsampling:
Click the icon for a report of the downsampling results:
From the report, you can see that readmitted=true, the minority class, was not modified by downsampling. The majority class, readmitted=false, was reduced by 25%. In other words, the percentage of the majority class that was maintained was 75%.