Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Cluster the data

The next step in the demand forecasting workflow is to cluster the data. If your dataset has a large number of series, it can be useful to break the dataset up into smaller datasets that group similar items together.

Get monthly sales

# Last 12 Months
# agg = df.loc[df[DATE_COL] >= df[DATE_COL].max() - timedelta(days=365)].pivot_table(values=TARGET, index=SERIES_ID, columns='Month', aggfunc='sum').fillna(value=0)
# agg.reset_index(inplace=True)

# All Data
agg = df.pivot_table(values=TARGET, index=SERIES_ID, columns='Month', aggfunc='sum').fillna(value=0)
agg.reset_index(inplace=True)

agg.columns = [SERIES_ID, 'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
agg.head(2)
X = agg[['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']].copy()
X = MinMaxScaler(feature_range=(0, 5)).fit_transform(X)

Configure a cluster

Use the code below to set the desired cluster size. This size indicates how many series you want to include in each cluster. The size can fluctuate based on the density of each cluster.

UNIQUE_SERIES = df[SERIES_ID].nunique()   
EPS_TRUNCATE = int(UNIQUE_SERIES * 0.05)
CLUSTER_SIZE = 50  # Set the desired number of Series per Cluster
NUM_CLUSTERS = int(UNIQUE_SERIES / CLUSTER_SIZE)

print("Number of Unique Series: {}".format(UNIQUE_SERIES))
print("Number of Cluster: {}".format(NUM_CLUSTERS))

Dimensionality reduction and clustering

Note that the code below has set random state variables.

pca = umap.UMAP(n_neighbors=10, 
                n_components=3, 
                metric='euclidean',   
                output_metric='euclidean',  
                learning_rate=1.0, 
                init='spectral', 
                min_dist=0.5, 
                spread=1.0,  
                set_op_mix_ratio=1.0,
                local_connectivity=1.0, 
                repulsion_strength=1.0, 
                negative_sample_rate=5, 
                transform_queue_size=4.0,  
                angular_rp_forest=False, 
                target_n_neighbors=-1, 
                target_metric='categorical', 
                target_weight=0.5,
                random_state=RANDOM_STATE
                )

principalComponents = pca.fit_transform(X)
embedding = pd.DataFrame(data = principalComponents, columns = ['PC_1', 'PC_2', 'PC_3'])

clusterer = KMeans(n_clusters=NUM_CLUSTERS, random_state=RANDOM_STATE) 
clusterer.fit(embedding)

labels = pd.DataFrame(clusterer.labels_, columns=['Cluster'])
principalDf = agg.join([labels, embedding])
principalDf = principalDf[[SERIES_ID, 'Cluster', 'January', 'February', 
                           'March', 'April', 'May', 'June', 'July', 'August', 
                           'September', 'October', 'November', 'December', 
                           'PC_1', 'PC_2', 'PC_3']]

# 2D
# fig = px.scatter(principalDf, x='PC_1', y='PC_2', color='Cluster', width=600, height=600)

# 3D
fig = px.scatter_3d(principalDf, x='PC_1', y='PC_2', z='PC_3', color='Cluster', 
                    hover_data=['Cluster'], width=800, height=800)
fig.show()

View a sample image of the chart below.

Cluster

print(len(df[SERIES_ID].unique()))
print(len(principalDf[SERIES_ID].unique()))

Retrieve counts for a cluster

Use the code below to determine the number of series in each cluster.

pd.DataFrame(principalDf.groupby(['Cluster'])['Cluster'].count())
Cluster
Cluster
0 163
1 201
2 12

Plot clusters at the series level

The code block below allows you to plot a sample of the series in each cluster to visualize the clustering results. Note that random_state has been set for reproducibility.

clusters_1 = principalDf.Cluster.unique().tolist()
clusters_1.sort()
clusters_1

for c in clusters_1:
    series_items = principalDf.loc[principalDf['Cluster'] == int(c)][SERIES_ID].sample(20, 
                                                                                       random_state=RANDOM_STATE,
                                                                                       replace=True).tolist()
    pdf = df[df[SERIES_ID].isin(series_items)]
    pdf = pdf[[TARGET, DATE_COL, SERIES_ID]] 

    pdf.sort_values(by=[DATE_COL, SERIES_ID], ascending=True, inplace=True)
    pd.pivot_table(pdf, 
                   aggfunc='sum',
                   values=TARGET, 
                   index=[DATE_COL],
                   columns=[SERIES_ID]).plot(figsize=(20, 8), legend=False, title="Cluster {0}". format(c))