Modeling process¶
This section provides more detail to help understand DataRobot's initial model building process.
 More on modeling modes, such as small datasets and Quick Autopilot
 Twostage models (Frequency and Severity models).
 Data summary information
 Handling project build failure
 Working with missing values
DataRobot also runs a complete data quality assessment that automatically detects, and in some cases addresses, data quality issues. See also the basic modeling process section for a workflow overview.
Modeling modes¶
The exact action and options for a modeling mode are dependent on your data. In addition to the basic description, the following sections describe circumstantial modeling behavior.
Small datasets¶
Autopilot for AutoML changes the sample percentages run depending on the number of rows in the dataset. The following table describes the criteria:
Number of rows  Percentages run 

Less than 2000  Final Autopilot stage only (64%) 
Between 2001 and 3999  Final two Autopilot stages (32% and 64%) 
4000 and larger  All stages of Autopilot (16%, 32%, and 64%) 
Quick Autopilot¶
Quick Autopilot is the default modeling mode. Quick has been optimized to ensure, typically, availability of more accurate models without sacrificing variety of tested options. As a result, reference models are not run.
When used, DataRobot runs the following models during the first stage of Quick (32%) on a "typical" dataset (other cases are described below):
 TensorFlow
 XGBoost
 LightGBM
 Elastic Net, Ridge regressor, or Lasso regressor (textcapable)
 Nystroem Kernel SVM
 Random Forest
 Vowpal Wabbit
 Spark Random Forest
 GA2M
 Singlecolumn text models (for wordclouds)
 R GBM
 Rulefit
DataRobot then initiates stage 2, running at 64%, on the top four models from stage 1. The "average" blender is created from the top two models of stage 2.
For single column text datasets, DataRobot runs the following models:
 Elastic Net (textcapable)
 Singlecolumn text models (for wordclouds)
 SVM on the documentterm matrix
For projects with Offset or Exposure set, DataRobot runs the following:
 XGBoost
 Elastic Net (textcapable)
 LightGBM
 ASVM
 Scikit learn GBM
 GA2M + rating table
 Eureqa GAM
 Singlecolumn text models (for word clouds)
Twostage models¶
Some datasets result in a twostage modeling process; these projects create additional models not otherwise available—Frequency and Severity models. Creation of this twostage process, and the resulting additional model types, occurs in regression projects when the target is zeroinflated (that is, greater than 50% of rows in the dataset have a value of 0 for the target feature). These methods are most frequently applicable in insurance and operational risk and loss modeling—insurance claim, modeling foreclosure frequency with loss severity, and frequent flyer points redemption activity.
For qualifying models (see below), you can view stagerelated information in the following tabs:
 In the Coefficients tab, DataRobot graphs parameters corresponding to the selected stage for linear models. Additionally, if you export the coefficients, two additional columns—
Frequency_Coefficient
andSeverity_Coefficient
—provide the coefficients at each stage.  In the Advanced Tuning tab, DataRobot graphs the parameters corresponding to the selected stage.
DataRobot automatically runs some models built to support the frequency/severity methods as part of Autopilot; additional models are available in the Repository. The models in which the staging is available can be identified by the preface "FrequencyCost" or "FrequencySeverity" and include the following:
 XGBoost*
 LightGBM*
 Generalized Additive Models
 Elastic Net
* Coefficients are not available for these models
Example use case: insurance
Zach is building an insurance claim model based on frequency (the number of times that a policyholders made a claim) and severity (cost of the claim). Zach wants to predict the payout amount of claims for a potential policyholder in the coming year. Generally, most policyholders don't have accidents and so don't file claims. Therefore, a dataset where each row represents one policyholder—and the target is claim payouts—the target column for most rows will be $0. In Zach's dataset he has a zeroinflated target. Most policyholders represented in the training data have $0 as their target value. In this project, DataRobot will build several FrequencyCost and FrequencySeverity models.
Data summary information¶
The following information assumes that you have selected a target feature and started the modeling process.
After you select a target variable and begin modeling, DataRobot analyzes the data and presents this information in the Project Data tab of the Data page. Data features are listed in order of importance in predicting the target variable. DataRobot also detects the data (variable) type of each feature; supported data types are:
 numeric
 categorical
 date
 percentage
 currency
 length
 text
 summarized categorical
 multicategorical
 multilabel
 date duration (OTV projects)
 location (Location AI projects)
Text vs. categorical features¶
DataRobot runs heuristics to differentiate text from categorical features, including the following:

If the number of unique rows is less than 5% of the column size, or if there are fewer than 60 unique rows, the column is classified as categorical.

Using the Python language identifier
langid
, DataRobot attempts to detect a language. If no language is detected, the column is classified as categorical. 
Languages are categorized as either Japanese/Chinese/Korean or English and all other languages ("English+"). If at least three of the following checks pass, the feature is classified as text:
English+
(Number of unique lines / total number of lines > 0.3)
ornumber of unique lines > 1000
. The mean number of spaces per line is at least 1.5.
 10% or more lines have at least 4 words.
 The longest line has at least 6 words.
Japanese/Chinese/Korean
(Number of unique lines / total number of lines > 0.3)
ornumber of unique lines > 1000
. The mean line length is at least 4 characters.
 10% or more lines have at least 7 characters.
 The longest line has at least 12 characters.
Manual feature transformations allow you to override the automated assignment, but because this can cause errors, DataRobot provides a warning during the transformation process.
Additional information on the Data page includes:
 Unique and missing values
 Mean, median, standard deviation, and minimum and maximum values
 Informational tags
 Feature importance
 Access to tabs that allow you to work with feature lists and investigate feature associations
Importance score¶
The Importance bars show the degree to which a feature is correlated with the target. These bars are based on "Alternating Conditional Expectations" (ACE) scores. ACE scores are capable of detecting nonlinear relationships with the target, but as they are univariate, they are unable to detect interaction effects between features. Importance is calculated using an algorithm that measures the information content of the variable; this calculation is done independently for each feature in the dataset. The importance score has two components—Value
and Normalized Value
:
Value
: For binary classification and regression, predictions from a univariate model evaluated on the validation set using the selected project metric. It is useful because it shows the metric score you should expect (more or less) if you build a model using only that variable. For Multiclass,Value
is calculated as the weighted average from the binary univariate models for each class.Normalized Value
:Value
normalized; scores up to 1 (higher scores are better). 0 means accuracy is the same as predicting the training target mean. Scores of less than 0 mean the ACE model prediction is worse than the target mean model (overfitting).
These scores represent a measure of predictive power of a simple model using only that variable to predict the target. (The score is adjusted by exposure if you set the Exposure parameter.) Scores are measured using the project's accuracy metric.
Features are ranked from most important (most green in a bar) to least important (least green in a bar). The length of the green bar next to each feature indicates its relative importance. The length of the green bar on the EDA screen is proportional to the Normalized Value
and is adjusted so that maximum feature importance is represented by the full bar. Hovering on the green bar shows both scores. These numbers represent the score in relation to the project metric for a model that uses only that feature (the metric selected when the project was run). Changing the metric in the Leaderboard has no effect on the tooltip scores.
Click a feature name to view details of the data values. While the values change between EDA1 and EDA2 (e.g., rows are removed because they are part of holdout or they are missing values), the meaning of the charts and the options are the same.
Missing values¶
DataRobot handles missing values differently, depending on the model and/or value type. The following are the codes DataRobot recognizes and treats as missing values:
Special NaN Values for all feature types
null, NULL
na, NA, n/a, #N/A, N/A, #NA, #N/A N/A
1.#IND, 1.#IND
NaN, nan, NaN, nan
1.#QNAN, 1.#QNAN
?
.
Inf, INF, inf, Inf, INF, inf
<space>
None
<blank>
Special NaN Values for numeric only
Infinity, INFINITY, Infinity, Infinity, INFINITY, Infinity
The following notes describe some specifics of DataRobot's value handling:

Some models natively handle missing values so that no special preprocessing is needed.

For linear models (such as linear regression or an SVM), DataRobot's handling depends on the case:
 median imputation: DataRobot imputes missing values, using the median of the nonmissing training data. This effectively handles data that are missingatrandom.
 missing value flag: DataRobot adds a binary "missing value flag" for each variable with any missing values, allowing the model to recognize the pattern in structurally missing values and learn from it. This effectively handles data that are missingnotatrandom.

For treebased models, DataRobot imputes with an arbitrary value (e.g., 9999) rather than the median. This method is faster and gives just as accurate a result.

For categorical variables in all models, DataRobot treats missing values as another level in the categories.
Numeric columns¶
DataRobot assigns a var type to a value during EDA. For numeric columns, there are three types of values:
 Numeric values: these can be integers or floating point numbers.
 Special NaN values (listed in the table above): these are not numeric, but are recognized as representative of NaN.
 All other values: for example, string or text data.
Following are the rules DataRobot uses when determining if a particular column is treated as numeric, and how it handles the column at prediction time:

Strict Numeric: If a column has only numeric and special NaN values, DataRobot treats the column as numeric. At prediction time, DataRobot accepts any of the same special NaN values as missing and makes predictions. If an other value is present, DataRobot errors.

Permissive Numeric: If a column has numeric values, special NaN values and one (and only one) other value, DataRobot treats that other value as missing and treats the column as numeric. At prediction time, all other values are treated as missing (regardless of whether they differ from the first one).

Categorical: If DataRobot finds two or more other (nonnumeric and nonNaN) values in a column during EDA, it treats the feature as categorical instead of numeric.

If DataRobot does not process any other value during EDA sampling and categorizes the feature as numeric, before training (but after EDA) it "looks" at the full dataset for that column. If any other values are seen for the full dataset, the column is treated as permissive numeric. If not, it is strict numeric.