Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Modeling process

This section provides more detail to help understand DataRobot's initial model building process.

DataRobot also runs a complete data quality assessment that automatically detects, and in some cases addresses, data quality issues. See also the basic modeling process section for a workflow overview.

Modeling modes

The exact action and options for a modeling mode are dependent on your data. In addition to the standard description of mode behavior, the following sections describe circumstantial modeling behavior.

DataRobot supports Tree-based models, Deep Learning models, Support Vector Machines (SVM), Generalized Linear Models, Anomaly Detection models, Text Mining models, and many more. Specifically:

  • TensorFlow
  • XGBoost
  • LightGBM
  • Elastic Net, Ridge regressor, or Lasso regressor (text-capable)
  • Nystroem Kernel SVM
  • Random Forest
  • GA2M
  • Single-column text models (for word clouds)
  • Rulefit

Small datasets

Autopilot for AutoML changes the sample percentages run depending on the number of rows in the dataset. The following table describes the criteria:

Number of rows Percentages run
Less than 2000 Final Autopilot stage only (64%)
Between 2001 and 3999 Final two Autopilot stages (32% and 64%)
4000 and larger All stages of Autopilot (16%, 32%, and 64%)

Quick Autopilot

Quick Autopilot is the default modeling mode. Quick has been optimized to ensure, typically, availability of more accurate models without sacrificing variety of tested options. As a result, reference models are not run.

When used, DataRobot runs the supported models during the first stage of Quick (32% sample size) on a "typical" dataset (other cases are described below). DataRobot then initiates stage 2, running at 64% (or 500MB of data, whichever is smaller), on the top four models from stage 1. The "average" blender is created from the top two models of stage 2.

For single column text datasets, DataRobot runs the following models:

  • Elastic Net (text-capable)
  • Single-column text models (for word clouds)
  • SVM on the document-term matrix

For projects with Offset or Exposure set, DataRobot runs the following:

  • XGBoost
  • Elastic Net (text-capable)
  • LightGBM
  • ASVM
  • Scikit learn GBM
  • GA2M + rating table
  • Eureqa GAM
  • Single-column text models (for word clouds)

Two-stage models

Some datasets result in a two-stage modeling process; these projects create additional models not otherwise available—Frequency and Severity models. Creation of this two-stage process, and the resulting additional model types, occurs in regression projects when the target is zero-inflated (that is, greater than 50% of rows in the dataset have a value of 0 for the target feature). These methods are most frequently applicable in insurance and operational risk and loss modeling—insurance claim, modeling foreclosure frequency with loss severity, and frequent flyer points redemption activity.

For qualifying models (see below), you can view stage-related information in the following tabs:

  • In the Coefficients tab, DataRobot graphs parameters corresponding to the selected stage for linear models. Additionally, if you export the coefficients, two additional columns—Frequency_Coefficient and Severity_Coefficient—provide the coefficients at each stage.
  • In the Advanced Tuning tab, DataRobot graphs the parameters corresponding to the selected stage.

DataRobot automatically runs some models built to support the frequency/severity methods as part of Autopilot; additional models are available in the Repository. The models in which the staging is available can be identified by the preface "Frequency-Cost" or "Frequency-Severity" and include the following:

  • XGBoost*
  • LightGBM*
  • Generalized Additive Models
  • Elastic Net

* Coefficients are not available for these models

Example use case: insurance

Zach is building an insurance claim model based on frequency (the number of times that a policyholders made a claim) and severity (cost of the claim). Zach wants to predict the payout amount of claims for a potential policyholder in the coming year. Generally, most policyholders don't have accidents and so don't file claims. Therefore, a dataset where each row represents one policyholder—and the target is claim payouts—the target column for most rows will be $0. In Zach's dataset he has a zero-inflated target. Most policyholders represented in the training data have $0 as their target value. In this project, DataRobot will build several Frequency-Cost and Frequency-Severity models.

Data summary information

The following information assumes that you have selected a target feature and started the modeling process.

After you select a target variable and begin modeling, DataRobot analyzes the data and presents this information in the Project Data tab of the Data page. Data features are listed in order of importance in predicting the target variable. DataRobot also detects the data (variable) type of each feature; supported data types are:

Additional information on the Data page includes:

Importance score

The Importance bars show the degree to which a feature is correlated with the target. These bars are based on "Alternating Conditional Expectations" (ACE) scores. ACE scores are capable of detecting non-linear relationships with the target, but as they are univariate, they are unable to detect interaction effects between features. Importance is calculated using an algorithm that measures the information content of the variable; this calculation is done independently for each feature in the dataset. The importance score has two components—Value and Normalized Value:

  • Value: This shows the metric score you should expect (more or less) if you build a model using only that variable. For Multiclass, Value is calculated as the weighted average from the binary univariate models for each class. For binary classification and regression, Value is calculated from a univariate model evaluated on the validation set using the selected project metric.
  • Normalized Value: Value normalized; scores up to 1 (higher scores are better). 0 means accuracy is the same as predicting the training target mean. Scores of less than 0 mean the ACE model prediction is worse than the target mean model (overfitting).

These scores represent a measure of predictive power of a simple model using only that variable to predict the target. (The score is adjusted by exposure if you set the Exposure parameter.) Scores are measured using the project's accuracy metric.

Features are ranked from most important to least important. The length of the green bar next to each feature indicates its relative importance—the amount of green in the bar compared to the total length of the bar, which shows the maximum potential feature importance (and is proportional to the Normalized Value)—the more green in the bar, the more important the feature. Hovering on the green bar shows both scores. These numbers represent the score in relation to the project metric for a model that uses only that feature (the metric selected when the project was run). Changing the metric in the Leaderboard has no effect on the tooltip scores.

Click a feature name to view details of the data values. While the values change between EDA1 and EDA2 (e.g., rows are removed because they are part of holdout or they are missing values), the meaning of the charts and the options are the same.

Automated feature transformations

Feature engineering is a key part of the modeling process. After pressing Start, DataRobot performs automated feature engineering on the given dataset to create derived variables in order to enhance model accuracy. See the table below for a list of feature engineering tasks DataRobot may perform during modeling for each feature type:

Feature type Automated transformations
Numeric and categorical
  • Missing Imputation (Median, Arbitrary)
  • Standardization
  • Search for ratios
  • Search for differences
  • Ridit Transform
  • DataRobot Smart Binning using a second model
  • Principal Components Analysis
  • K-Means Clustering
  • One hot encoding
  • Ordinal encoding
  • Credibility intervals
  • Category counts
  • Variational Autoencoder
  • Custom Feature Engineering for Numerics
Date
  • Month-of-year
  • Day of week
  • Day of year
  • Day of month
  • Hour of day
  • Year
  • Month
  • Week
Text
  • Character / word ngrams
  • Pretrained tinyBERT featurizer
  • Stopword removal
  • Part of Speech Tagging / Removal
  • TF-IDF scaling (optional sublinear scaling and binormal separation scaling)
  • Hashing vectorizers for big data
  • SVD preprocessing
  • Cosine similarity between pairs of text columns (on datasets with 2+ text columns)
  • Support for multiple languages, including English, Japanese, French, Korean, Spanish, Chinese, and Portuguese
Images DataRobot uses featurizers to turn images into numbers:
  • Resnet50
  • Xception
  • Squeezenet
  • Efficientnet
  • PreResnet
  • Darknet
  • MobileNet

    • DataRobot also allows you to fine-tune these featurizers.
Geospatial DataRobot uses several techniques to automatically derive spatially-lagged features from the input dataset:
  • Spatial Lag: A k-nearest neighbor approach to calculate mean neighborhood values of numeric features are varying spatial lags and neighborhood sizes.
  • Spatial Kernel: Characterizes spatial dependence structure using a spatial kernel neighborhood technique. This technique characterizes spatial dependence structure for all numerica variables using varying kernel sizes, weighing by distance
DataRobot also derives local autocorrelation features using local indicators of spatial association to capture hot and cold spot of spatial similarity within the context of the entire input dataset.
DataRobot derives features for the following geometric properties:
  • MultiPoints: Centroid
  • Lines/MultiLines: Centroid, Length, Minimum bounding rectangle area
  • Polygons/MultiPolygons: Centroid, Perimeter, Area, Minimum bounding rectangle area

Text vs. categorical features

DataRobot runs heuristics to differentiate text from categorical features, including the following:

  1. If the number of unique rows is less than 5% of the column size, or if there are fewer than 60 unique rows, the column is classified as categorical.

  2. Using the Python language identifier langid, DataRobot attempts to detect a language. If no language is detected, the column is classified as categorical.

  3. Languages are categorized as either Japanese/Chinese/Korean or English and all other languages ("English+"). If at least three of the following checks pass, the feature is classified as text:

    English+

    • (Number of unique lines / total number of lines > 0.3) or number of unique lines > 1000.
    • The mean number of spaces per line is at least 1.5.
    • 10% or more lines have at least 4 words.
    • The longest line has at least 6 words.

    Japanese/Chinese/Korean

    • (Number of unique lines / total number of lines > 0.3) or number of unique lines > 1000.
    • The mean line length is at least 4 characters.
    • 10% or more lines have at least 7 characters.
    • The longest line has at least 12 characters.

Manual feature transformations allow you to override the automated assignment, but because this can cause errors, DataRobot provides a warning during the transformation process.

Missing values

DataRobot handles missing values differently, depending on the model and/or value type. The following are the codes DataRobot recognizes and treats as missing values:

Special NaN Values for all feature types

  • null, NULL
  • na, NA, n/a, #N/A, N/A, #NA, #N/A N/A
  • 1.#IND, -1.#IND
  • NaN, nan, -NaN, -nan
  • 1.#QNAN, -1.#QNAN
  • ?
  • .
  • Inf, INF, inf, -Inf, -INF, -inf
  • None
  • One or more whitespace characters and empty cells are also treated as missing values.

The following notes describe some specifics of DataRobot's value handling.

Note

The missing value imputation method is fixed during training time. Either the median or arbitrary value set during training will be provided at prediction time.

  • Some models natively handle missing values so that no special preprocessing is needed.

  • For linear models (such as linear regression or an SVM), DataRobot's handling depends on the case:

    • median imputation: DataRobot imputes missing values, using the median of the non-missing training data. This effectively handles data that are missing-at-random.
    • missing value flag: DataRobot adds a binary "missing value flag" for each variable with any missing values, allowing the model to recognize the pattern in structurally missing values and learn from it. This effectively handles data that are missing-not-at-random.
  • For tree-based models, DataRobot imputes with an arbitrary value (e.g., -9999) rather than the median. This method is faster and gives just as accurate a result.

  • For categorical variables in all models, DataRobot treats missing values as another level in the categories.

Numeric columns

DataRobot assigns a var type to a value during EDA. For numeric columns, there are three types of values:

  1. Numeric values: these can be integers or floating point numbers.
  2. Special NaN values (listed in the table above): these are not numeric, but are recognized as representative of NaN.
  3. All other values: for example, string or text data.

Following are the rules DataRobot uses when determining if a particular column is treated as numeric, and how it handles the column at prediction time:

  • Strict Numeric: If a column has only numeric and special NaN values, DataRobot treats the column as numeric. At prediction time, DataRobot accepts any of the same special NaN values as missing and makes predictions. If an other value is present, DataRobot errors.

  • Permissive Numeric: If a column has numeric values, special NaN values and one (and only one) other value, DataRobot treats that other value as missing and treats the column as numeric. At prediction time, all other values are treated as missing (regardless of whether they differ from the first one).

  • Categorical: If DataRobot finds two or more other (non-numeric and non-NaN) values in a column during EDA, it treats the feature as categorical instead of numeric.

  • If DataRobot does not process any other value during EDA sampling and categorizes the feature as numeric, before training (but after EDA) it "looks" at the full dataset for that column. If any other values are seen for the full dataset, the column is treated as permissive numeric. If not, it is strict numeric.


Updated June 7, 2022
Back to top