Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Fundamentals of modeling

DataRobot uses automated machine learning (AutoML) to build models that solve real-world problems across domains and industries. DataRobot takes the data you provide, generates multiple machine learning (ML) models, and recommends the best model to put into use. You don't need to be a data scientist to build ML models using DataRobot, but an understanding of the basics will help you build better models. Your domain knowledge and DataRobot's AI expertise will lead to successful models that solve problems with speed and accuracy.

DataRobot supports many different approaches to ML modeling—supervised learning, unsupervised learning, time series modeling, segmented modeling, multimodal modeling, and more. This section describes these approaches and also provides tips for analyzing and selecting the best models for deployment.

Modeling methods

ML modeling is the process of developing algorithms that learn by example from historical data. These algorithms predict outcomes and uncover patterns not easily discerned.

Supervised and unsupervised learning

The most basic form of machine learning is supervised learning.

With supervised learning, you provide "labeled" data. A label in a dataset provides information to help the algorithm learn from the data. The label—also called the target—is what you're trying to predict.

  • In a regression project, the target is a numeric value. A regression model estimates a continuous dependent variable given a list of input variables (also referred to as features or columns). Examples of regression problems include financial forecasting, time series forecasting, maintenance scheduling, and weather analysis.

  • In a classification project, the target is a category. A classification model groups observations into categories by identifying shared characteristics of certain classes. It compares those characteristics to the data you're classifying and estimates how likely it is that the observation belongs to a particular class. Classification projects can be binary (two classes) or multiclass (three or more classes). For classification, DataRobot also supports multilabel modeling where the target feature has a variable number of classes or labels; each row of the dataset is associated with one, several, or zero labels.

Another form of machine learning is unsupervised learning.

With unsupervised learning, the dataset is unlabeled and the algorithm must infer patterns in the data.

  • In an anomaly detection project, the algorithm detects unusual data points in your dataset. Use cases include detection of fraudulent transactions, faults in hardware, and human error during data entry.

  • In a clustering project, the algorithm splits the dataset into groups according to similarity. Clustering is useful for gaining intuition about your data. The clusters can also help label your data so that you can then use a supervised learning method on the dataset.

Time-aware modeling

Time data is a crucial component in solving prediction and forecasting problems. DataRobot provides several methods and tools for time-aware modeling.

  • With time series modeling, you can generate a forecast—a series of predictions for a period of time in the future. You train time series models on past data to predict future events. Predict a range of values in the future or use nowcasting to make a prediction at the current point in time. Use cases for time series modeling include predicting pricing and demand in domains such as finance, healthcare, and retail— basically, any domain where problems have a time component.

  • You can use time series modeling for a dataset containing a single series, but you can also build a model for a dataset that contains multiple series. For this type of multiseries project, one feature serves as the series identifier. An example is a "store location" identifier that essentially divides the dataset into multiple series, one for each location. So you might have four store locations (Paris, Milan, Dubai, and Tokyo) and therefore four series for modeling.

  • With a multiseries project, you can choose to generate a model for each series using segmented modeling. In this case, DataRobot creates a deployment using the best model for each segment.

  • Sometimes, the dataset for the problem you're solving contains date and time information, but instead of generating a forecast as you do with time series modeling, you predict a target value on each individual row. This approach is called out-of-time validation (OTV).

  • Along with supervised learning models, you can also develop time series anomaly detection models.

See What is time-aware modeling for an in-depth discussion of these strategies.

Specialized modeling workflows

DataRobot provides specialized workflows to help you address a wide range of problems.

  • Visual AI allows you to include images as features in your datasets. Use the image data alongside other data types to improve outcomes for various types of modeling projects—regression, classification, anomaly detection, clustering, and more.

  • With Composable ML, you can build and edit your own ML blueprints, incorporating DataRobot preprocessing and modeling algorithms, as well as your own models.

  • For text features in your data, use Text AI insights like Word Clouds and Text Mining to understand the impact of the text features.

  • Location AI supports geospatial analysis of modeling data. Use geospatial features to gain insights and visualize data using interactive maps before and after modeling.

This powerful collection of modeling strategies will ensure successful automated modeling projects.

ML modeling workflow

This section walks you through the steps for implementing a DataRobot modeling project.

  1. To begin the modeling process, import your data.

  2. DataRobot conducts the first stage of exploratory data analysis (EDA1), where it analyzes data features.

  3. Next, you select your target and a modeling mode, then start modeling.

    DataRobot generates feature lists from which to build models. By default, it uses the feature list with the most informative features. Alternatively, you can select different generated feature lists or customize your own.

  4. DataRobot performs EDA2 and further evaluates the data, determining which features correlate to the target (feature importance) and which features are informative, among other information.

    The application performs feature engineering—transforming, generating, and reducing the feature set depending on the project type and selected settings.

  5. DataRobot selects blueprints based on the project type and builds candidate models.

Analyze and select a model

DataRobot automatically generates models and displays them on the Leaderboard. The recommended model displays at the top with a Recommended for Deployment indicator, but you can select any of the models to deploy.

To analyze and select a model:

  1. Compare models by selecting an optimization metric from the Metric dropdown; RMSE (root mean squared error) is the metric displayed in this example.

  2. Analyze the model using the visualization tools that are best suited for the type of model you are building.

    See the list of project types and associated visualizations below.

  3. Experiment with modeling settings to potentially improve the accuracy of your model. You can try rerunning Autopilot using a different feature list or use a different modeling mode like Comprehensive Autopilot.

  4. After analyzing your models, select the best for deployment.

    Tip

    It's recommended that you test predictions before deploying. If you aren't satisfied with the results, you can revisit the modeling process and further experiment with feature lists and optimization settings. You might also find that gathering more informative data features can improve outcomes.

  5. As part of the deployment process, you upload predictions. You can also set up a recurring batch prediction job.

  6. DataRobot monitors your deployment. Use the application's visualizations to track data (feature) drift, accuracy, bias, and service health. You can set up notifications so that you are regularly informed of the model's status.

    Tip

    Consider enabling automatic retraining to automate an end-to-end workflow. With automatic retraining, DataRobot regularly tests challenger models against the current best model (the champion model) and replaces the champion if a challenger outperforms it.

Which visualizations should I use?

DataRobot provides many visualizations for analyzing models. Not all visualization tools are applicable to all modeling projects—the visualizations you can access depend on your project type. The following table lists project types and examples of visualizations that are suited to their analysis:

Project type Analysis tools
All models
  • Feature Impact: Provides a high-level visualization of which features are most strongly driving model decisions (Understand > Feature Impact).
  • Feature Fit: Provides feature details ranked in order of model-agnostic importance (Evaluate > Feature Fit).
  • Feature Effects: Visualizes the effect of changes in the value of each feature on the model’s predictions (Understand > Feature Effects).
  • Prediction Explanations: Illustrates what drives predictions on a row-by-row basis, answering why a given model made a certain prediction (Understand > Prediction Explanations).
Regression
  • Lift Chart: Shows how well a model segments the target population and how capable it is of predicting the target (Evaluate > Lift Chart).
  • Residuals plot: Depicts the predictive performance and validity of a regression model by showing how linearly your models scale relative to the actual values of the dataset used (Evaluate > Residuals).
Classification
  • ROC Curve: Explores classification, performance, and statistics related to a selected model at any point on the probability scale (Evaluate > ROC Curve).
  • Confusion Matrix (binary projects): Compares actual data values with predicted data values in binary projects (Evaluate > ROC Curve).
  • Confusion Matrix (multiclass projects): Compares actual data values with predicted data values in multiclass projects (Evaluate > Confusion Matrix).
Time-aware modeling (time series and out-of-time validation)
  • Accuracy Over Time: Visualizes how predictions change over time (Evaluate > Accuracy Over Time).
  • Forecast vs Actual: Compares how different predictions behave at different forecast points to different times in the future (Evaluate > Forecast vs Actual).
  • Forecasting Accuracy: Provides a visual indicator of how well a model predicts at each forecast distance in the project’s forecast window (Evaluate > Forecasting Accuracy).
  • Stability: Provides an at-a-glance summary of how well a model performs on different backtests (Evaluate > Stability).
  • Over Time chart: Identifies trends and potential gaps in your data by visualizing how features change over the primary date/time feature (Data > Over Time).
Multiseries Series Insights: Provides a histogram and table for series-specific information (Evaluate > Series Insights).
Segmented modeling Segmentation tab: Displays data about each segment of a Combined Model (Describe > Segmentation).
Multilabel modeling Feature Statistics: Helps evaluate a dataset with multilabel characteristics, providing a pairwise matrix so that you can visualize correlations, joint probability, and conditional probability of feature pairs (Data > Feature Statistics).
Visual AI
  • Image Embeddings: Displays a projection of images onto a two-dimensional space defined by similarity (Understand > Image Embeddings).
  • Activation Maps: Visualizes areas of images that a model is using when making predictions (Insights > Activation Maps).
Text AI
  • Word Cloud: Visualizes variable keyword relevancy (Understand > Word Cloud).
  • Text Mining: Visualizes relevancy of words and short phrases (Insights > Text Mining).
Geospatial AI
  • Geospatial Map: Provides exploratory spatial data analysis (ESDA) by visualizing the spatial distribution of observations (Data > Geospatial Map).
  • Accuracy Over Space: Provides a spatial residual mapping within an individual model (Evaluate > Accuracy Over Space).
Clustering
  • Cluster Insights: Captures latent features in your data, surfacing and communicating actionable insights and identifying segments for further modeling (Understand > Cluster Insights).
  • Image Embeddings: Displays a projection of images onto a two-dimensional space defined by similarity (Understand > Image Embeddings).
  • Activation Maps: Visualizes areas of images that a model is using when making predictions (Understand > Activation Maps).
Anomaly detection
  • Anomaly Over Time: Plots how anomalies occur across the timeline of your data (Evaluate > Anomaly Over Time).
  • Anomaly Assessment: Plots data for the selected backtest and provides SHAP explanations for up to 500 anomalous points (Evaluate > Anomaly Assessment).

Updated May 6, 2022
Back to top