Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Clustering

Time series clustering is an out of the box solution unique to DataRobot that enables you to easily identify and group similar series across a multiseries dataset. Instead of manually running a time series clustering technique outside the platform and then using the cluster assignments as a segmenting feature, this process is entirely contained within the time series workflow. You do not need to be familiar with advanced concepts like Dynamic Time Warping (DTW) or be code-savvy to use the clustering capability as DataRobot builds both DTW and Velocity clustering models (see the detailed descriptions here).

Note

Non-time-aware projects clustering is also available, although segmented modeling is not.

Example: You are predicting shoe sales across your North American stores. With clustering, DataRobot can automatically group all stores in San Francisco and Cleveland into one cluster because the sales profiles for these locations is the same.

Simply put, clustering is a mechanism for grouping the series together. Found clusters can then be used as input to time series segmented modeling. (Additionally, clustering can be used to simply get a better understanding of data.) Without clustering, you define how to group the series together based on a configured segment ID. Clustering, on the other hand, automatically groups series together by looking at the data and determining which series look most similar. Once clusters are established, you can:

When you cluster, there is no target ("output") variable. DataRobot groups series together based on their similarity. However, you must think about the target variable you will use in segmented modeling. DataRobot recommends using the variable you plan to select as the target in your segmented modeling project as the output variable for clustering.

See also the time series clustering considerations.

Cluster discovery

To allow DataRobot to discover clusters:

  1. Upload data, click No target?, and select Clusters.

    Modeling Mode defaults to Comprehensive and Optimization Metric defaults to Silhouette Score.

  2. Click Set up time-aware modeling and select the primary date/time feature. (Modeling mode switches from Comprehensive to Autopilot.)

  3. Set the Series ID. DataRobot launches the time-aware clustering workflow—an unsupervised project with the Clusters option enabled.

  4. Set the feature(s) you want to cluster on. Note that only the selected features will be available for modeling. DataRobot automatically adds the date/time feature and series ID.

    • To use clusters in segmented modeling, add only the intended output variable ("target"). DataRobot recommends using the variable you plan to select as the target in your segmented modeling project as the output variable for clustering.

    • To cluster without segmentation, add any features.

    Click Set Cluster features.

    Info

    DataRobot does not use features created during the feature derivation process when clustering.

  5. (Optional) Change the number of clusters that DataRobot discovers. Click Clustering in the help text to open the advanced options Clustering tab. If using Manual mode, you will have an option to set the number from the Repository.

    Deep dive: Clustering buffer

    A clustering model has a start and end timestamp. The difference between start and end is the clustering training duration. Any time after the end is considered the holdout buffer.

    If there is enough data available, DataRobot creates a clustering buffer that can be seen in the Partitioning section of advanced options. The clustering buffer is a section of data that DataRobot calculates to represent what the holdout would be in a subsequent segmentation project. It then shifts the training data dates back to account for the holdout period, to prevent data leakage and to ensure that you are not training a clustering model into what will be the holdout partition in segmentation.

    To remove the buffer, toggle Include clustering buffer to off.

  6. Click Start to begin Autopilot.

You can use the discovered clusters to explore—clusters can capture latent behavior that are not explicitly captured by a column in the dataset. Or, continue the workflow to use the clusters in a segmented modeling project or save the model to the Model Registry for later use.

Use cluster models now

Once Autopilot completes, you can view the Series Insights tab for cluster and series distribution information. To create a segmented modeling project that uses the newly found clusters to define the segments:

  1. Select a model from the Leaderboard and click Predict; the tab opens to Use for Segmentation. On this tab, you can:

  2. Enter the target feature for the segmented modeling project in the What would you like the new project to predict? field:

  3. Click Create project and save to Model Registry.

    To save the clustering model and create the project later

    Instead of creating a segmentation project now, you can save the clustering model as a model package by selecting Save to Model Registry.

    Later you can build a segmented modeling project using the clustering model.

  4. Click Go to project.

    Your segmentation method is configured with the clustering model.

  5. Click Start to build your segmented model. At the prompt, confirm that you want to run a segmentation project.

    After modeling is complete, a Combined Model displays on the Leaderboard where you can explore the results and the model segments.

Tip

This procedure saves the time series clustering model as a model package. You can later create new segmented modeling projects using the saved clustering model package.

Use cluster models from the Model Registry

After you save a time series clustering model as a model package, you can use it in a new segmented modeling project.

Note

When building segmented modeling project from a clustering project, you must use the same dataset that was used to generate clusters.

  1. Use the standard workflow to set up a time series project:

    • Enter the target that you specified for What would you like the new project to predict? when you created clusters in the steps above.
    • Set Automated time series forecasting as the modeling method.
    • Set the series ID.
  2. Modify window settings as needed and click the pencil next to Segmentation method.

  3. Confirm building models per segment. Then, choose to use an Existing clustering model and click + Browse model registry in the definitions section.

  4. In the resulting popup window, select a time series clustering model package and click Select model package.

  5. The package is now listed as part of the segmentation definition screen. DataRobot will use the training length window from the clustering project in the segmentation project to ensure the clusters used for the segmentation project were evaluated in the clustering project. Click Set segmentation method.

  6. Click Start to build your segmented model. At the prompt, confirm that you want to run a segmentation project.

    After modeling is complete, a Combined Model displays on the Leaderboard. You can explore the results and the segment models.


Updated July 3, 2024