Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

PCA and K-Means clustering

Robot 1

What is the impact of principal component analysis (PCA) on K-Means clustering?

Hi team, a customer is asking how exactly a PCA > k-means is being used during modeling. I see that we create a CLUSTER_ID feature in the transformed dataset and I am assuming that is from the k-means. My question is, if we are creating this feature, why aren't we tracking it in, for example, feature impact?

Robot 2

Feature impact operates on the level of dataset features, not derived features. If we have one-hot encoding for categorical feature CAT1—we also calculate feature impact of just CAT1, not CAT1-Value1, CAT1-Value2,...

Permutation of original features would also produce permutation of KMeans results—so if those are important for the modeling result, its impact will be assigned to the original columns.

Robot 3

Some blueprints use the one-hot-encoded cluster ID as features, and other blueprints use the cluster probabilities as features.

If you wish to assess the impact of the kmeans step on the outcome of the model, delete the kmeans branch in composable ML and use the Leaderboard to assess how the model changed.

As Robot 2 says, feature impact operates on the RAW data and is inclusive of both the preprocessing AND the modeling.

Updated February 20, 2023
Back to top