# Clustering algorithms¶

Clustering is the ability to cluster time series within a multiseries dataset and then directly apply those clusters as segment IDs within a segmented modeling project. It allows you to group the most similar time series together to create more accurate segmented models, reducing time-to-value when creating complex multiseries projects.

DataRobot uses the following algorithms for time series clustering:

The following table compares Velocity and K-means:

Characteristic | Velocity clustering | K-means clustering |
---|---|---|

Speed | Fast | Slow |

Robust to irregular time steps | Yes | No |

Clusters based on series shape | No | Yes |

Series need to be the same length | No | No |

While a single series may contain many unique values that vary over time, the time series variant looks for similarities across the different series to identify which series most closely relate to one another. Series are classified by being most closely associated with a specific barycenter (the time series clustering equivalent of a centroid), which is derived from the original dataset.

## Velocity clustering¶

Velocity clustering is an unsupervised learning technique that splits series up based on summary statistics. The goal, as with most clustering, is to put similar series in the same cluster and series that differ significantly in different clusters. Specifically, it groups time series based on statistical properties such as the mean, standard deviation, and the percentage of zeros. The benefit of this approach is that time series with similar values within the feature derivation window are grouped together so that during segmented modeling, these features within the FDW have more signal.

Calculation for each clustering feature is as follows:

- Perform the given aggregation (i.e., mean, standard deviation) on all series.
- Divide the resulting aggregations into quantiles representing the number of desired clusters.
- Determine in which quantile the feature’s aggregation falls and assign the feature to that cluster.

The cluster assigned the most features is the final cluster for the series.

DataRobot implements four types of Velocity clustering:

- Mean Aggregations
- Standard Deviation Aggregations
- Zero-Inflated & Mean Aggregations
- Zero-Inflated & Standard Deviation Aggregations

## K-means with DTW¶

Understanding the implementation of K-means clustering requires understanding how DataRobot measures the distance between two time series. This is done with Dynamic Time Warping (DTW). See the K-Means deep dive (as it relates to clustering and DTW), below.

### Single-feature DTW¶

DTW produces a similarity metric between two series by attempting to match the "shapes." The graph below shows an example where the matching between the indices is not 1-to-1—it's warped.

### Multi-feature DTW¶

The simple diagram above illustrates how to measure the distance between series focusing on a single feature. However, what if there are multiple features? For example, to calculate the distance between store A and store B using daily sales data and daily customer count data, do you calculate the average of the "DTW distance of sales data" and the "DTW distance of customer count data"?

DTW is calculated as the distances measure between each feature. The value is then paired with K-means to do the actual clustering.

Think of a series as a matrix of shape `m x n`

, where `m`

is each feature represented, and `n`

is each point in time. Each series has its own `m x n`

matrix. You then have a collection of series within a cluster. Note that the distance matrix itself is *not* the error metric. The error metric is calculated on top of the distance matrix. The distance matrix is then calculated for each `m`

and `n`

point across the series.

Features are kept independent within DTW, resulting in a `2D`

DTW representation instead of the `1D`

representation in the image above. The actual K-means is an optimization of the resulting `2D`

distance representations for each cluster.

### DTW match requirements¶

There are four requirements to create a match:

- Every index from the series
`A`

must be matched with at least one index from series`B`

and vice versa. - The first index of sequence
`A`

must match with at least the first index of the sequence`B`

. - The last index from sequence
`A`

is matched with last index from sequence`B`

. - Mapping of the indices from sequence A to sequence B (and vice versa) must be monotonically increasing to prevent index mappings from crossing.

How does DataRobot know if the match is accurate? To calculate distance with DTW:

- Identify all matches between the two time series.
- Calculate the sum of the squared differences between the values for each match.
- Find the minimum squared sum across all matches.

## Deep dive: K-Means DTW clustering¶

K-means is an unsupervised learning algorithm that clusters data by trying to separate samples into *K* groups of equal variance, minimizing a criterion known as “inertia” or “within-cluster sum-of-squares.” This algorithm requires the number of clusters to be specified in advanced. It scales well to large numbers of samples and has been used across a broad range of application areas in many different fields.

To apply the K-means algorithm to a sequence, DataRobot initializes a cluster by selecting a series to serve as a barycenter. Specifically, DataRobot:

- Identifies how many clusters to create (K).
- Initializes clusters by randomly selecting K series as the barycenters.
- Calculates the sum of the squared distance from each time series to all barycenters.
- Assigns the time series to the cluster with the smallest sum.
- Recalculates the barycenter of each cluster.
- Repeats steps 4 & 5 until convergence or maximum iterations reached.

### DTW distance calculations¶

A barycenter is a time series that is derived from other series, with the explicit goal of minimizing the distance between itself and other series. A barycenter (`B`

) is mathematically defined as the minimum sum of the squared distances (`d`

) to a reference time series (`b`

), when you are given a new time series dataset (`x`

):

Clustering time series algorithms, i.e., K-means, computes these distances using different distance algorithms, such as Euclidean and DTW.

With DTW K-means, DTW is used as the similarity measure between different series. One major advantage of DTW is that series do not need to be the same length, or aligned with one another in time. Instead two series can be compared after minimizing the temporal alignments between them.

Compare this to Euclidean distance. In Euclidean distance, each point is compared exactly with the other point that occupies the same point in time between series (`T_0`

will always be compared with `T_0`

). However if `T_10`

is a peak in Series A, and `T_30`

is a peak in Series B, than `T_10`

will be compared with `T_30`

when using DTW. Euclidean distance would compare `T_10`

with `T_10`

. While DTW is slightly slower than other methods, it provides a much more robust similarity metrics. This is especially important when considering that most multiseries datasets have a wide variety of characteristics, including differing start and stop times for each series.