Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Unsupervised time-aware modeling

Unsupervised learning uses unlabeled data to surface insights about patterns in your data. Supervised learning, by contrast, uses the other features of your dataset to make forecasts and predictions. The unsupervised learning setup is described below.

Note

There is extensive material available about the fundamentals of time aware modeling. While the instructions largely represent the workflow as applied in DataRobot Classic, the reference material describing the framework, feature derivation process, and more are still applicable.

Create basic

Follow the steps below to create a new experiment from within a Use Case.

Note

You can also start modeling directly from a dataset by clicking the Start modeling button. The Set up new experiment page opens. From there, the instructions follow the flow described below.

Create a feature list

Before modeling, you can create a custom feature list from the data explore page. You can then select that list during modeling setup to create the modeling data using only the features in that list.

DataRobot automatically creates new feature lists after the feature derivation process. Once modeling completes, you can train new models using the time-aware lists. Learn more about feature lists post-modeling here.

Add experiment

From within a Use Case, click Add and select Experiment. The Set up new experiment page opens, which lists all data previously loaded to the Use Case.

Add data

Add data to the experiment, either by adding new data (1) or selecting a dataset that has already been loaded to the Use Case (2).

Once the data is loaded to the Use Case (option 2 above), click to select the dataset you want to use in the experiment. Workbench opens a preview of the data.

From here, you can:

Option
1 Click to return to the data listing and choose a different dataset.
2 Click the icon to proceed and set the learning type and target.
3 Click Next to proceed and set the learning type and target.

Start modeling setup

Once you have proceeded, Workbench prepares the dataset for modeling (EDA 1).

Note

From this point forward in experiment creation, you can either continue setting up your experiment (Next) or you can exit. If you click Exit, you are prompted to discard changes or to save all progress as a draft. In either case, on exit you are returned to the point where you began experiment setup and EDA1 processing is lost. If you chose Exit and save draft, the draft is available in the Use Case directory.

If you open a Workbench draft in DataRobot Classic and make changes that introduce features not supported in Workbench, the draft will be listed in your Use Case but will not be accessible except through the classic interface.

Set learning type

Typically DataRobot works with labeled data, using supervised learning methods for model building. With supervised learning, you specify a target and DataRobot builds models using the other features of your dataset to make predictions.

In unsupervised learning, no target is specified and the data is unlabeled. Instead of generating predictions, unsupervised learning surfaces insights about patterns in your data, answering questions like "Are there anomalies in my data?" and "Are there natural clusters?"

To create an unsupervised learning experiment after EDA1 completes, from the Learning type dropdown, choose one of:

Learning type Description
Supervised Builds models using the other features of your dataset to make forecasts and predictions; this is the default learning type.
Clustering (unsupervised) Using no target and unlabeled data, builds models that group similar data and identify segments.
Anomaly detection (unsupervised) Using no target and unlabeled data, builds that detect abnormalities in the dataset.

See the time series-specific feature considerations for things to know when working with clustering.

Note

Time series clustering requires multiseries datasets. Also, non-time series date/time partitioned clustering is not available.

Clustering

Clustering lets you explore your data by grouping and identifying natural segments from many types of data—numeric, categorical, text, image, and geospatial data—independently or combined. In clustering mode, DataRobot captures a latent behavior that's not explicitly captured by a column in the dataset. It is useful when data doesn't come with explicit labels and you have to determine what they should be. Examples of clustering include:

  • Detecting topics, types, taxonomies, and languages in a text collection. You can apply clustering to datasets containing a mix of text features and other feature types or a single text feature for topic modeling.

  • Segmenting a customer base before running a predictive marketing campaign. Identify key groups of customers and send different messages to each group.

  • Capturing latent categories in an image collection.

Configure clustering

To set up a clustering experiment, set the Learning type to Clustering. Because unsupervised experiments do not specify a target, the Target feature field is removed and the other basic settings become available.

The table below describes each field:

Field Description
Modeling mode The modeling mode, which influences the blueprints DataRobot chooses to train. Comprehensive Autopilot, the default, runs all repository blueprints on the maximum Autopilot sample size to provide the most accurate similarity groupings.
Optimization metric Defines how DataRobot scores clustering models. For clustering experiments, Silhouette score is the only supported metric.
Training feature list Defines the subset of features that DataRobot uses to build models.

Set the number of clusters

DataRobot trains multiple models, one for each algorithm that supports setting a fixed number of clusters (such as K-Means or Gaussian Mixture Model). The number trained is based on what is specified in Number of clusters, with default values based on the number of rows in the dataset.

For example, if the numbers are set as in the image above, DataRobot runs clustering algorithms using 3, 5, 7, 10 clusters.

To customize the number of clusters that DataRobot trains, expand Show additional automation settings and enter values within the provided range.

Enable time series clustering

When initial settings are complete:

  1. Enable time series modeling.
  2. Set a series ID.
  3. Click Edit selection to select at least one clustering feature. Any feature you add will be in addition to the ordering feature and series ID, which DataRobot automatically includes. Be aware that each feature added will increase modeling time, so best practice recommends you:

    • Choose features whose values change over time
    • Avoid selecting low-importance features

  4. Review the setup in the left panel. You can see a summary of the configuration as well as notice that DataRobot has applied a special time series clustering feature list, which cannot be changed once clustering is configured. Click Partitioning to change the clustering buffer](change-clustering-partitioning) setting, if desired, or click Start modeling.

Change clustering partitioning

In the partitioning tab, you cannot change the number of backtest partitions—only one backtest is allowed with clustering. Clustering does not set aside rows for holdout. Instead it provides an option to include a clustering buffer. Toggle the buffer on or off to change the durations. When a clustering buffer is included, the training duration is smaller; validation is unchanged.

Anomaly detection

Anomaly detection, also referred to as outlier or novelty detection, is an application of unsupervised learning. It can be used, for example, in cases where there are thousands of normal transactions with a low percentage of abnormalities, such as network and cyber security, insurance fraud, or credit card fraud. Although supervised methods are very successful at predicting these abnormal, minority cases, it can be expensive and very time-consuming to label the relevant data. See the feature considerations for important information about working with anomaly detection.

Configure anomaly detection

To set up an anomaly detection experiment, set the Learning type to Anomaly detection. No target feature is required.

The table below describes each field:

Field Description
Modeling mode The modeling mode, which influences the blueprints DataRobot chooses to train. Quick Autopilot, the default, provide a base set of models that build and provide insights quickly.
Optimization metric Defines how DataRobot scores clustering models. For anomaly detection experiments, Synthetic AUC is the default, and recommended, metric.
Training feature list Defines the subset of features that DataRobot uses to build models.

Enable date/time anomaly detection

To use anomaly detection for time-aware projects, change the partitioning method for the Data partitioning tab. Configure date/time partitioning, as with any other time-aware experiment (ordering feature and backtest partition configuration).

Enable anomaly detection for time series

To use anomaly detection for time series:

  1. Enable time series. The ordering feature is carried over from the date/time partitioning configuration.
  2. Set the series ID.
  3. Review the window settings. Note that for anomaly detection, only the days in advance of the forecast point in the feature derivation window can be changed.

When settings are complete, click Start modeling.

Unsupervised insights

After you start modeling, DataRobot populates the Leaderboard with models as they complete. The following table describes the insights available for unsupervised anomaly detection (AD) and clustering for date/time-partitioned experiments.

Insight AD for OTV AD for time series Clustering for time series
Anomaly Assessment N Y N
Anomaly Over Time Y Y N
Blueprint Y Y Y
Feature Effects Y Y Y
Feature Impact Y Y Y
Prediction Explanations*ast; Y Y N
Stability Y Y N
Series Insights N N Y

Feature considerations

Unsupervised learning availability is license-dependent:

Feature Predictive Date/time partitioned Time series
Anomaly detection Generally available Generally available Premium (time series license)
Clustering Premium (Clustering license) Not available Premium (time series license)

Clustering considerations

When using clustering, consider the following:

  • Datasets for clustering projects must be less than 5GB.
  • The following is not supported:

    • Relational data (summarized categorical features, for example)
    • Word Clouds
    • Feature Discovery projects
    • Prediction Explanations
    • Scoring Code
    • Composable ML
  • Clustering models can be deployed to dedicated prediction servers, but Portable Prediction Servers (PPS) and monitoring agents are not supported.

  • The maximum number of clusters is 100.

Time series-specific considerations

  • Clustering is only available for multiseries time series projects. Your data must contain a time index and at least 10 series.

  • To create X clusters, you need at least X series, each with 20+ time steps. (For example, if you specify 3 clusters, at least three of your series must be a length of 20 time steps or more.)

  • Building from the union of all selected series, the union needs to collectively span at least 35 time steps.

  • At least two clusters must be discovered for the clustering model to be used in a segmented modeling run.

What does it mean to "discover" clusters?

To build clusters, DataRobot must be able to group data into two or more distinct groups. For example, if a dataset has 10 series but they are all copies of the same single series, DataRobot would not be able to discover more than one cluster. In a more realistic example, very slight time shifts of the same data will also not be discoverable. If all the data is too mathematically similar that it cannot be separated into different clusters, then it cannot subsequently be used by segmentation.

The "closeness" of the data is model-dependent—the convergence conditions are different. Velocity clustering would not converge if a project has 10 series, all with the same means. That, however, does not imply that K-means itself wouldn't converge.

Note, however, the restrictions are less strict if clusters are not being used for segmentation.

Anomaly detection considerations

Consider the following when working with anomaly detection projects:

  • In the case of numeric missing values, DataRobot supplies the imputed median (which, by definition, is non-anomalous).

  • The higher the number of features in a dataset, the longer it takes DataRobot to detect anomalies and the more difficult it is to interpret results. If you have more than 1000 features, be aware that the anomaly score becomes difficult to interpret, making it potentially difficult to identify the root cause of anomalies.

  • If you train an anomaly detection model on greater than 1000 features, Insights in the Understand tab are not available. These include Feature Impact, Feature Effects, Prediction Explanations, Word Cloud, and Document Insights (if applicable).

  • Because anomaly scores are normalized, DataRobot labels some rows as anomalies even if they’re not too far away from normal. For training data, the most anomalous row will have a score of 1. For some models, test data and external data can have anomaly score predictions that are greater than 1 if the row is more anomalous than other rows in the training data.

  • Synthetic AUC is an approximation based on creating synthetic anomalies and inliers from the training data.

  • Synthetic AUC scores are not available for blenders that contain image features.

  • Feature Impact for anomaly detection models trained from DataRobot blueprints is always computed using SHAP. For anomaly detection models from user blueprints, Feature Impact is computed using the permutation-based approach.

  • Because time series anomaly detection is not yet optimized for pure text data anomalies, data must contain some numerical or categorical columns.

  • The following methods are implemented and tunable:

Method Details
Isolation Forest
  • Up to 2 million rows
  • Dataset < 500 MB
  • Number of numerical + categorical + text columns > 2
  • Up to 26 text columns
Double Mean Absolute Deviation (MAD)
  • Any number of rows
  • Datasets of all sizes
  • Up to 26 text columns
One Class Support Vector Machine (SVM)
  • Up to 10,000 rows
  • Dataset < 500 MB
  • Number of numerical + categorical + text columns < 500
Local outlier factor
  • Up to 500,001 rows
  • Dataset < 500 MB
  • Up to 26 text columns
Mahalanobis Distance
  • Any number of rows
  • Datasets of all sizes
  • Up to 26 text columns
  • At least one numerical or categorical column
  • The following is not supported:

  • Projects with weights or offsets, including smart downsampling

  • Scoring Code

  • Anomaly detection does not consider geospatial data (that is, models will build but those data types will not be present in blueprints).

Additionally, for time series projects:

  • Millisecond data is the lower limit of data granularity.
  • Datasets must be less than 1GB.
  • Some blueprints don’t run on purely categorical data.
  • Some blueprints are tied to feature lists and expect certain features (e.g., Bollinger Band rolling must be run on a feature list with robust z-score features only).
  • For time series projects with periodicity, because applying periodicity affects feature reduction/processing priorities, if there are too many features then seasonal features are also not included in Time Series Extracted and Time Series Informative Features lists.

Additionally, the time series considerations apply.


Updated August 28, 2024