Modeling > Modeling FAQ

Modeling FAQ¶

The following addresses questions and answers about modeling in general and then more specifically about building models and using model insights.

General modeling¶

What types of models does DataRobot build?

DataRobot supports Tree-based models, Deep Learning models, Support Vector Machines (SVM), Generalized Linear Models, Anomaly Detection models, Text Mining models, and more. See the list of specific model types for more information.

What are modeling workers?

DataRobot uses different types of workers for different types of jobs; modeling workers are for training models and creating insights. You can adjust these workers in the Worker Queue, which can speed model building and allow you to allocate across projects.

Why can't I add more workers?

You may have reached your maximum, if in a shared pool your coworkers may be using them, or they may be in use with another project. See the troubleshooting tips for more information.

What is the difference between a model and a blueprint?

A modeling algorithm fits a model to data, which is just one component of a blueprint. A blueprint represents the high-level end-to-end procedure for fitting the model, including any preprocessing steps, modeling, and post-processing steps. Read about accessing the graphical representation of a blueprint.

What is smart downsampling?

Smart downsampling is a technique to reduce the total size of the dataset by reducing the size of the majority class, enabling you to build models faster without sacrificing accuracy.

What are EDA1 and EDA2?

Exploratory data analysis, or EDA, is DataRobot's approach to analyzing datasets and summarizing their main characteristics. It consists of two phases. EDA1 describes the state of your project after data finishes uploading, providing summary statistics based on up to 500MB of your data. In EDA2, DataRobot does additional calculations on the target column using the entire dataset (excluding holdout) and recalculates summary statistics and ACE scores.

What does a Leaderboard asterisk mean?

An asterisk on the Leaderboard indicates that the scores are computed from stacked predictions on the model's training data.

What does the Leaderboard snowflake icon mean?

The snowflake next to a model indicates that the model is the result of a frozen run. In other words, DataRobot “froze” parameter settings from a model’s early, small sample size-based run. Because parameter settings based on smaller samples tend to also perform well on larger samples of the same data, DataRobot can piggyback on its early experimentation.

What is cross-validation?

Cross validation is a partitioning method for evaluating model performance. It is run automatically for datasets less than 50,000 rows and can be started manually from the Leaderboard for larger datasets.

What do the modes and sample sizes mean?

There are several modeling mode options; the selected mode determines the sample size(s) of the run. Autopilot is DataRobot's "survival of the fittest" modeling mode that automatically selects the best predictive models for the specified target feature and runs them at ever-increasing sample sizes.

Why are the sample sizes shown in the repository not the standard Autopilot sizes?

The sample size available when adding models from the Repository differs depending on the size of the dataset. It defaults to the last Autopilot stage, either 64% or 500MB of data, whichever is smaller. In other words, it is the maximal training size without stepping into validation.

Are there modeling guardrails?

DataRobot provides guardrails to help ensure ML best practices and instill confidence in DataRobot models. Some examples include a substantive data quality assessment, a feature list with target leakage features removed, and automated data drift tracking.

How are missing values handled?

DataRobot handles missing values differently, depending on the model and/or value type. There are certain patterns recognized and handled as missing, as well as disguised missing value handling.

Build models¶

Does DataRobot support feature transformations?

In AutoML, DataRobot performs automatic feature transformations for features recognized as type “date,” adding these new features to the modeling dataset. Additionally, you can create manual transformations and change variable type. For image datasets, the train-time image augmentation process creates new training images. The time series feature derivation process creates a new modeling dataset. Feature Discovery discovers and generates new features from multiple datasets to consolidate datasets. Or, use a Spark SQL query from the AI Catalog to prepare a new dataset from a single dataset or blend two or more datasets. Transformed features are marked with an info icon on the data page.

Can I choose which optimization metric to use?

The optimization metric defines how to score models. DataRobot selects a metric best-suited for your data from a comprehensive set of choices, but also computes alternative metrics. After EDA1 completes, you can change the selection from the Advanced Options > Additional tab. After EDA2 completes, you can redisplay the Leaderboard listing based on a different computed metric.

Can I change the project type?

Once you enter a target feature, DataRobot automatically analyzes the training dataset, determines the project type (classification if the target has categories or regression if the target is numerical), and displays the distribution of the target feature. If the project is classified as regression and eligible for multiclass conversion, you can change the project to a classification project, and DataRobot will interpret values as classes instead of continuous values.

How do I control how to group or partition my data for model training?

By default, DataRobot splits your data into a 20% holdout (test) partition and an 80% cross-validation (training and validation) partition, which is divided into five sub-partitions. You can change these values after loading data and selecting a target from the Advanced Options > Partitioning tab. From there, you can set the method, sizes for data partitions, number of partitions for cross-validation, and the method by which those partitions are created.

What do the green "importance" bars represent on the Data tab?

The Importance green bars, based on "Alternating Conditional Expectations" (ACE) scores, show the degree to which a feature is correlated with the target. Importance has two components—Value and Normalized Value—and is calculated independently for each feature in the dataset.

Does DataRobot handle natural language processing (NLP)?

When text fields are detected in your data, DataRobot automatically detects the language and applies appropriate preprocessing. This may include advanced tokenization, data cleaning (stop word removal, stemming, etc.), and vectorization methods. DataRobot supports n-gram matrix (bag-of-words, bag-of-characters) analysis as well as word embedding techniques such as Word2Vec and fastText with both CBOW and Skip-Gram learning methods. Additional capabilities include Naive Bayes SVM and cosine similarity analysis. For visualization, there are per-class word clouds for text analysis. You can see the applied language preprocessing steps in the model blueprint.

How do I restart a project with the same data?

If your data is stored in the AI Catalog, you can create and recreate projects from that dataset. To recreate a project—using either just the data or the data and the settings (i.e., to duplicate the project)—use the Actions menu in the project control center.

Do I have to use the UI or can I interact programmatically?

DataRobot provides both a UI and a REST API. The UI and REST API provide nearly matching functionality. Additionally, Python and R clients provide a subset of what you can do with the full API.

Does DataRobot provide partner integrations?

DataRobot offers a Snowflake integration to allow joint users to execute Feature Discovery projects in DataRobot while performing computations in Snowflake for minimized data movement.

SaaSSelf-Managed

What is the difference between prediction and modeling servers?

Modeling servers power all the creation and model analysis done from the UI and from the R and Python clients. Prediction servers are used solely for making predictions and handling prediction requests on deployed models.

What is the difference between prediction and modeling servers?

Modeling servers power all the creation and model analysis done from the UI and from the R and Python clients. Modeling worker resources are reported in the Resource Monitor. Prediction servers are used solely for making predictions and handling prediction requests on deployed models.

Model insights¶

How do I directly compare model performance?

There are many ways to compare model performance. Some starter points:

Look at the Leaderboard to compare Validation, Cross-Validation, and/or Holdout scores.
Use Learning Curves to help determine whether it is worthwhile to increase the size of your dataset for a given model. The results help identify which models may benefit from being trained into the Validation or Holdout partition.
Speed vs Accuracy compares multiple models in a measure of the tradeoff between runtime and predictive accuracy. If prediction latency is important for model deployment then this will help you find the most effective model.
Model Comparison lets you select a pair of models and compare a variety of insights (Lift Charts, Profit Curve, ROC Curves).

How does DataRobot choose the recommended model?

As part of the Autopilot modeling process, DataRobot identifies the most accurate non-blender model and prepares it for deployment. Although Autopilot recommends and prepares a single model for deployment, you can initiate the Autopilot recommendation and deployment preparation stages for any Leaderboard model.

Why not always use the most accurate model?

There could be several reasons, but the two most common are:

Prediction latency—This means the speed at which predictions are made. Some business applications of a model will require very fast predictions on new data. The most accurate models are often blender models which are usually slower at making predictions.
Organizational readiness—Some organizations favor linear models and/or decision trees for perceived interpretability reasons. Additionally, there may be compliance reasons for favoring certain types of models over others.

Why doesn’t the recommended model have text insights?

One common reason that text models are not built is because DataRobot removes single-character "words" when model building, a common practice in text mining. If this causes a problem, look at your model log and consider the documented workarounds.

What is model lift?

Lift is the ratio of points correctly classified as positive in a model versus the 45-degree line (or baseline model) represented in the Cumulative Gain plot. The cumulative charts show, for a given % of top predictions, how much more effective the selected model is at identifying the positive class versus the baseline model.

What is the ROC Curve chart?

The ROC Curve tab provides extensive tools for exploring classification, performance, and statistics related to a selected model at any point on the probability scale. Documentation discusses prediction thresholds, the Matthews Correlation Coefficient (MCC), as well as interpreting the ROC Curve, Cumulative Gain, and Profit Curve charts.

Can I tune model hyperparameters?

You can tune model hyperparameters in the Advanced Tuning tab for a particular model. From here, you can manually set model parameters, overriding the DataRobot selections. However, consider whether it is instead better to spend “tuning” time doing feature engineering, for example using Feature Discovery for automated feature engineering.

How is Tree-Based Variable Importance different from Feature Impact?

Feature Impact shows, at a high level, which features are driving model decisions. It is computed by permuting the rows of a given feature while leaving the rows of the other features unchanged, and measuring the impact of the permutation on the model's predictive performance. Tree-based Variable Importance shows how much gain each feature adds to a model--the relative importance of the key features. It is only available for tree/forest models (for example, Gradient Boosted Trees Classifier or Random Forest).

How can I find models that produced coefficients?

Any model that produces coefficients can be identified on the Leaderboard with a Beta () tag. Those models allow you to export the coefficients and transformation parameters necessary to verify steps and make predictions outside of DataRobot. When a blueprint has coefficients but is not marked with the Beta tag, it indicates that the coefficients are not exact (e.g., they may be rounded).

What is the difference between "text mining" and "word clouds"?

The Text Mining and Word Cloud insights demonstrate text importance in different formats. Text Mining shows text coefficient effect (numeric value) and direction (positive=red or negative=blue) in a bar graph format. The Word Cloud shows the normalized version of those coefficients in a cloud format using text size and color.

Why are there variables in some insights that are not in the dataset?

DataRobot performs a variety of data preprocessing, such as automatic transformations and deriving features (for example, ratios and differences). When building models, it uses all useful features, which includes both original and derived variables.

Why does Feature Effects show missing partial dependence values when my dataset has none?

Partial dependence (PD) is reported as part of Feature Effects. It shows how dependent the prediction value is on different values of the selected feature. Prediction values are affected by all features, though, not just the selected feature, so PD must measure how predictions change given different values of the other features as well. When computing, DataRobot adds “missing” as one of the values calculated for the selected feature, to show how the absence of a value will affect the prediction. The end result is the average effect of each value on the prediction, given other values, and following the distribution of the training data.

How do I determine how long it will take to calculate Feature Effects?

It can take a long time to compute Feature Effects, particularly if blenders are involved. As a rough estimate of the runtime, use the Model Info tab to check the time it takes, in seconds, for your model to score 1000 rows. Multiply this number by 0.5-1.0 hours. Note that the actual runtime may be longer if you don’t assign enough workers to work on all Feature Effects sub-jobs simultaneously.

Why is a feature’s impact different depending on the model?

Autopilot builds a wide selection of models to capture varying degrees of underlying complexity and each model has its strengths and weaknesses in addressing that complexity. A feature’s impact shouldn't be drastically different, however, so while the ordering of features will change, the overall inference is often not impacted. Examples:

A model that is not capable of detecting nonlinear relationships or interactions will use the variables one way, while a model that can detect these relationships will use the variables another way. The result is different feature impacts from different models.
If two variables are highly correlated, a regularized linear model will tend to use only one of them, while a tree-based method will tend to use both, and at different splits. With the linear model, one of these variables will show up high in feature importance and the other will be low, while with the tree-based model, both will be closer to the middle.