Defining redundant features¶
What makes a feature redundant?
The docs say:
If two features change predictions in a similar way, DataRobot recognizes them as correlated and identifies the feature with lower feature impact as redundant
How do we quantify or measure "similar way"?
If two features are highly correlated, the prediction difference (prediction before feature shuffle / prediction after feature shuffle) of the two features should also be correlated. The prediction difference can be used to evaluate pairwise feature correlation. For example, two highly correlated features are first selected. The feature with lower feature impact is identified as the redundant feature.
Do we consider two features redundant when their prediction differences is the same/between
We look at the correlation coefficient between the prediction differences and if it's above a certain threshold, we call the less important one (according to the models' feature impact) redundant.
Calculate prediction difference before and after feature shuffle:
(pred_diff[i] = pred_before[i] - pred_after[i])
Calculate pairwise feature correlation (top 50 features, according to model feature impact) based on
Identify redundant features (high correlation based on our threshold) then test that removal does not affect accuracy significantly.
Thank you, Robot 2! Super helpful.