Fundamentals of DataRobot Classic¶
DataRobot uses automated machine learning (AutoML) to build models that solve real-world problems across domains and industries. DataRobot takes the data you provide, generates multiple machine learning (ML) models, and recommends the best model to put into use. You don't need to be a data scientist to build ML models using DataRobot, but an understanding of the basics will help you build better models. Your domain knowledge and DataRobot's AI expertise will lead to successful models that solve problems with speed and accuracy.
DataRobot supports many different approaches to ML modeling—supervised learning, unsupervised learning, time series modeling, segmented modeling, multimodal modeling, and more. This section describes these approaches and also provides tips for analyzing and selecting the best models for deployment. You can begin this process by logging in to DataRobot Classic.
If your organization is using an external account management system for single sign-on:
- If using LDAP, note that your username is not necessarily your registered email address. Contact your DataRobot administrator to obtain your username, if necessary.
- If using a SAML-based system, on the login page, ignore the entry box for credentials. Instead, click Single Sign-On and enter credentials on the resulting page.
ML modeling is the process of developing algorithms that learn by example from historical data. These algorithms predict outcomes and uncover patterns not easily discerned.
Supervised and unsupervised learning¶
The most basic form of machine learning is supervised learning.
With supervised learning, you provide "labeled" data. A label in a dataset provides information to help the algorithm learn from the data. The label—also called the target—is what you're trying to predict.
In a regression project, the target is a numeric value. A regression model estimates a continuous dependent variable given a list of input variables (also referred to as features or columns). Examples of regression problems include financial forecasting, time series forecasting, maintenance scheduling, and weather analysis.
In a classification project, the target is a category. A classification model groups observations into categories by identifying shared characteristics of certain classes. It compares those characteristics to the data you're classifying and estimates how likely it is that the observation belongs to a particular class. Classification projects can be binary (two classes) or multiclass (three or more classes). For classification, DataRobot also supports multilabel modeling where the target feature has a variable number of classes or labels; each row of the dataset is associated with one, several, or zero labels.
Another form of machine learning is unsupervised learning.
With unsupervised learning, the dataset is unlabeled and the algorithm must infer patterns in the data.
In an anomaly detection project, the algorithm detects unusual data points in your dataset. Use cases include detection of fraudulent transactions, faults in hardware, and human error during data entry.
In a clustering project, the algorithm splits the dataset into groups according to similarity. Clustering is useful for gaining intuition about your data. The clusters can also help label your data so that you can then use a supervised learning method on the dataset.
Time data is a crucial component in solving prediction and forecasting problems. DataRobot provides several methods and tools for time-aware modeling.
With time series modeling, you can generate a forecast—a series of predictions for a period of time in the future. You train time series models on past data to predict future events. Predict a range of values in the future or use nowcasting to make a prediction at the current point in time. Use cases for time series modeling include predicting pricing and demand in domains such as finance, healthcare, and retail— basically, any domain where problems have a time component.
You can use time series modeling for a dataset containing a single series, but you can also build a model for a dataset that contains multiple series. For this type of multiseries project, one feature serves as the series identifier. An example is a "store location" identifier that essentially divides the dataset into multiple series, one for each location. So you might have four store locations (Paris, Milan, Dubai, and Tokyo) and therefore four series for modeling.
With a multiseries project, you can choose to generate a model for each series using segmented modeling. In this case, DataRobot creates a deployment using the best model for each segment.
Sometimes, the dataset for the problem you're solving contains date and time information, but instead of generating a forecast as you do with time series modeling, you predict a target value on each individual row. This approach is called out-of-time validation (OTV).
Along with supervised learning models, you can also develop time series anomaly detection models.
See What is time-aware modeling for an in-depth discussion of these strategies.
Specialized modeling workflows¶
DataRobot provides specialized workflows to help you address a wide range of problems.
Visual AI allows you to include images as features in your datasets. Use the image data alongside other data types to improve outcomes for various types of modeling projects—regression, classification, anomaly detection, clustering, and more.
Location AI supports geospatial analysis of modeling data. Use geospatial features to gain insights and visualize data using interactive maps before and after modeling.
This powerful collection of modeling strategies will ensure successful automated modeling projects.
ML modeling workflow¶
This section walks you through the steps for implementing a DataRobot modeling project.
To begin the modeling process, import your data.
DataRobot generates feature lists from which to build models. By default, it uses the feature list with the most informative features. Alternatively, you can select different generated feature lists or customize your own.
The application performs feature engineering—transforming, generating, and reducing the feature set depending on the project type and selected settings.
DataRobot selects blueprints based on the project type and builds candidate models.
Analyze and select a model¶
DataRobot automatically generates models and displays them on the Leaderboard. The recommended model displays at the top with a Recommended for Deployment indicator, but you can select any of the models to deploy.
To analyze and select a model:
Analyze the model using the visualization tools that are best suited for the type of model you are building.
See the list of project types and associated visualizations below.
Experiment with modeling settings to potentially improve the accuracy of your model. You can try rerunning Autopilot using a different feature list or use a different modeling mode like Comprehensive Autopilot.
After analyzing your models, select the best for deployment.
It's recommended that you test predictions before deploying. If you aren't satisfied with the results, you can revisit the modeling process and further experiment with feature lists and optimization settings. You might also find that gathering more informative data features can improve outcomes.
DataRobot monitors your deployment. Use the application's visualizations to track data (feature) drift, accuracy, bias, and service health. You can set up notifications so that you are regularly informed of the model's status.
Which visualizations should I use?¶
DataRobot provides many visualizations for analyzing models. Not all visualization tools are applicable to all modeling projects—the visualizations you can access depend on your project type. The following table lists project types and examples of visualizations that are suited to their analysis:
|Project type||Analysis tools|
|Time-aware modeling (time series and out-of-time validation)||
|Multiseries||Series Insights: Provides a histogram and table for series-specific information (Evaluate > Series Insights).|
|Segmented modeling||Segmentation tab: Displays data about each segment of a Combined Model (Describe > Segmentation).|
|Multilabel modeling||Feature Statistics: Helps evaluate a dataset with multilabel characteristics, providing a pairwise matrix so that you can visualize correlations, joint probability, and conditional probability of feature pairs (Data > Feature Statistics).|