Modeling process¶
This section provides more detail to help understand DataRobot's initial model building process.
- More on modeling modes, such as small datasets and Quick Autopilot
- Two-stage models (Frequency and Severity models).
- Data summary information
- Handling project build failure
- Working with missing values
See also:
- The data quality assessment, which automatically detects, and in some cases addresses, data quality issues.
- The basic modeling process section for a workflow overview.
- The list of modeling algorithms used by DataRobot.
Modeling modes¶
The exact action and options for a modeling mode are dependent on your data. In addition to the standard description of mode behavior, the following sections describe circumstantial modeling behavior.
Small datasets¶
Autopilot for AutoML changes the sample percentages run depending on the number of rows in the dataset. The following table describes the criteria:
Number of rows | Percentages run |
---|---|
Less than 2000 | Final Autopilot stage only (64%) |
Between 2001 and 3999 | Final two Autopilot stages (32% and 64%) |
4000 and larger | All stages of Autopilot (16%, 32%, and 64%) |
Quick Autopilot¶
Quick Autopilot is the default modeling mode, which has been optimized to ensure, typically, availability of more accurate models without sacrificing variety of tested options. As a result, reference models are not run. DataRobot runs supported models on a sample of data, depending on project type:
Project type | Sample size |
---|---|
AutoML | Typically 64% of data or 500MB, whichever is smaller. |
OTV | 100% of each backtest. |
Time series | Maximum training size for each backtest defined in the date/time partitioning. |
With this shortened version of the full Autopilot, DataRobot selects models to run based on a variety of criteria, including target and performance metric, but as its name suggests, chooses only models with relatively short training runtimes to support quicker experimentation. The specific number of Quick models run varies by project and target type (e.g., some blueprints are only available for a specific target/target distribution). The Average blender, when enabled, is created from the top two models. To maximize runtime efficiency in Quick mode, DataRobot automatically creates the DR Reduced Features list but does not automatically fit the recommended (or any) model to it (fitting the reduced list requires retraining models).
The steps involved in Quick mode are dependent on whether the Recommend and prepare a model for deployment is checked.
Option state | Action |
---|---|
Checked |
|
Unchecked | Run Quick mode at 64%. |
For single column text datasets, DataRobot runs the following models:
- Elastic Net (text-capable)
- Single-column text models (for word clouds)
- SVM on the document-term matrix
For projects with Offset or Exposure set, DataRobot runs the following:
- XGBoost
- Elastic Net (text-capable)
- LightGBM
- ASVM
- Scikit learn GBM
- GA2M + rating table
- Eureqa GAM
- Single-column text models (for word clouds)
Two-stage models¶
Some datasets result in a two-stage modeling process; these projects create additional models not otherwise available—Frequency and Severity models. Creation of this two-stage process, and the resulting additional model types, occurs in regression projects when the target is zero-inflated (that is, greater than 50% of rows in the dataset have a value of 0 for the target feature). These methods are most frequently applicable in insurance and operational risk and loss modeling—insurance claim, modeling foreclosure frequency with loss severity, and frequent flyer points redemption activity.
For qualifying models (see below), you can view stage-related information in the following tabs:
- In the Coefficients tab, DataRobot graphs parameters corresponding to the selected stage for linear models. Additionally, if you export the coefficients, two additional columns—
Frequency_Coefficient
andSeverity_Coefficient
—provide the coefficients at each stage. - In the Advanced Tuning tab, DataRobot graphs the parameters corresponding to the selected stage.
DataRobot automatically runs some models built to support the frequency/severity methods as part of Autopilot; additional models are available in the Repository. The models in which the staging is available can be identified by the preface "Frequency-Cost" or "Frequency-Severity" and include the following:
- XGBoost*
- LightGBM*
- Generalized Additive Models
- Elastic Net
* Coefficients are not available for these models
Example use case: insurance
Zach is building an insurance claim model based on frequency (the number of times that a policyholders made a claim) and severity (cost of the claim). Zach wants to predict the payout amount of claims for a potential policyholder in the coming year. Generally, most policyholders don't have accidents and so don't file claims. Therefore, a dataset where each row represents one policyholder—and the target is claim payouts—the target column for most rows will be $0. In Zach's dataset he has a zero-inflated target. Most policyholders represented in the training data have $0 as their target value. In this project, DataRobot will build several Frequency-Cost and Frequency-Severity models.
Data summary information¶
The following information assumes that you have selected a target feature and started the modeling process.
After you select a target variable and begin modeling, DataRobot analyzes the data and presents this information in the Project Data tab of the Data page. Data features are listed in order of importance in predicting the target variable. DataRobot also detects the data (variable) type of each feature; supported data types are:
- numeric
- categorical
- date
- percentage
- currency
- length
- text
- summarized categorical
- multicategorical
- multilabel
- date duration (OTV projects)
- location (Location AI projects)
Additional information on the Data page includes:
- Unique and missing values
- Mean, median, standard deviation, and minimum and maximum values
- Informational tags
- Feature importance
- Access to tabs that allow you to work with feature lists and investigate feature associations
Importance score¶
The Importance bars show the degree to which a feature is correlated with the target. These bars are based on "Alternating Conditional Expectations" (ACE) scores. ACE scores are capable of detecting non-linear relationships with the target, but as they are univariate, they are unable to detect interaction effects between features. Importance is calculated using an algorithm that measures the information content of the variable; this calculation is done independently for each feature in the dataset. The importance score has two components—Value
and Normalized Value
:
Value
: This shows the metric score you should expect (more or less) if you build a model using only that variable. For Multiclass,Value
is calculated as the weighted average from the binary univariate models for each class. For binary classification and regression,Value
is calculated from a univariate model evaluated on the validation set using the selected project metric.Normalized Value
:Value
normalized; scores up to 1 (higher scores are better). 0 means accuracy is the same as predicting the training target mean. Scores of less than 0 mean the ACE model prediction is worse than the target mean model (overfitting).
These scores represent a measure of predictive power for a simple model using only that variable to predict the target. (The score is adjusted by exposure if you set the Exposure parameter.) Scores are measured using the project's accuracy metric.
Features are ranked from most important to least important. The length of the green bar next to each feature indicates its relative importance—the amount of green in the bar compared to the total length of the bar, which shows the maximum potential feature importance (and is proportional to the Normalized Value
)—the more green in the bar, the more important the feature. Hovering on the green bar shows both scores. These numbers represent the score in relation to the project metric for a model that uses only that feature (the metric selected when the project was run). Changing the metric on the Leaderboard has no effect on the tooltip scores.
Click a feature name to view details of the data values. While the values change between EDA1 and EDA2 (e.g., rows are removed because they are part of holdout or they are missing values), the meaning of the charts and the options are the same.
Automated feature transformations¶
Feature engineering is a key part of the modeling process. After pressing Start, DataRobot performs automated feature engineering on the given dataset to create derived variables in order to enhance model accuracy. See the table below for a list of feature engineering tasks DataRobot may perform during modeling for each feature type:
Feature type | Automated transformations |
---|---|
Numeric and categorical |
|
Date |
|
Text |
|
Images | DataRobot uses featurizers to turn images into numbers:
DataRobot also allows you to fine-tune these featurizers. |
Geospatial | DataRobot uses several techniques to automatically derive spatially-lagged features from the input dataset:
DataRobot derives features for the following geometric properties:
|
Text vs. categorical features¶
DataRobot runs heuristics to differentiate text from categorical features, including the following:
-
If the number of unique rows is less than 5% of the column size, or if there are fewer than 60 unique rows, the column is classified as categorical.
-
Using the Python language identifier
langid
, DataRobot attempts to detect a language. If no language is detected, the column is classified as categorical. -
Languages are categorized as either Japanese/Chinese/Korean or English and all other languages ("English+"). If at least three of the following checks pass, the feature is classified as text:
English+
(Number of unique lines / total number of lines > 0.3)
ornumber of unique lines > 1000
.- The mean number of spaces per line is at least 1.5.
- 10% or more lines have at least 4 words.
- The longest line has at least 6 words.
Japanese/Chinese/Korean
(Number of unique lines / total number of lines > 0.3)
ornumber of unique lines > 1000
.- The mean line length is at least 4 characters.
- 10% or more lines have at least 7 characters.
- The longest line has at least 12 characters.
Manual feature transformations allow you to override the automated assignment, but because this can cause errors, DataRobot provides a warning during the transformation process.
Missing values¶
DataRobot handles missing values differently, depending on the model and/or value type. The following are the codes DataRobot recognizes and treats as missing values:
Special NaN Values for all feature types
null, NULL
na, NA, n/a, #N/A, N/A, #NA, #N/A N/A
1.#IND, -1.#IND
NaN, nan, -NaN, -nan
1.#QNAN, -1.#QNAN
?
.
Inf, INF, inf, -Inf, -INF, -inf
None
- One or more whitespace characters and empty cells are also treated as missing values.
The following notes describe some specifics of DataRobot's value handling.
Note
The missing value imputation method is fixed during training time. Either the median or arbitrary value set during training will be provided at prediction time.
-
Some models natively handle missing values so that no special preprocessing is needed.
-
For linear models (such as linear regression or an SVM), DataRobot's handling depends on the case:
- median imputation: DataRobot imputes missing values, using the median of the non-missing training data. This effectively handles data that are missing-at-random.
- missing value flag: DataRobot adds a binary "missing value flag" for each variable with any missing values, allowing the model to recognize the pattern in structurally missing values and learn from it. This effectively handles data that are missing-not-at-random.
-
For tree-based models, DataRobot imputes with an arbitrary value (e.g., -9999) rather than the median. This method is faster and gives just as accurate a result.
-
For categorical variables in all models, DataRobot treats missing values as another level in the categories.
Numeric columns¶
DataRobot assigns a var type to a value during EDA. For numeric columns, there are three types of values:
- Numeric values: these can be integers or floating point numbers.
- Special NaN values (listed in the table above): these are not numeric, but are recognized as representative of NaN.
- All other values: for example, string or text data.
Following are the rules DataRobot uses when determining if a particular column is treated as numeric, and how it handles the column at prediction time:
-
Strict Numeric: If a column has only numeric and special NaN values, DataRobot treats the column as numeric. At prediction time, DataRobot accepts any of the same special NaN values as missing and makes predictions. If an other value is present, DataRobot errors.
-
Permissive Numeric: If a column has numeric values, special NaN values and one (and only one) other value, DataRobot treats that other value as missing and treats the column as numeric. At prediction time, all other values are treated as missing (regardless of whether they differ from the first one).
-
Categorical: If DataRobot finds two or more other (non-numeric and non-NaN) values in a column during EDA, it treats the feature as categorical instead of numeric.
-
If DataRobot does not process any other value during EDA sampling and categorizes the feature as numeric, before training (but after EDA) it "looks" at the full dataset for that column. If any other values are seen for the full dataset, the column is treated as permissive numeric. If not, it is strict numeric.