Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Feature Reduction with FIRE

Access this AI accelerator on GitHub

At the heart of machine learning is the "art" of providing a model with the features or variables that are useful for making good predictions. Including redundant or extraneous features can lead to overly complex models that have less predictive power. Striking the right balance is known as feature selection. This page proposes a new, novel method for feature selection, based on using the feature impact scores (also known as feature importance scores in the wider industry) from several different models, which leads to a more robust and powerful result. This accelerator outlines how feature importance rank ensembling (FIRE) can be used to reduce the number of features, based on the feature impact scores DataRobot returns, while maintaining predictive performance.

During feature selection, data scientists try to keep the "three Rs" in mind:

  • Relevant: To reduce generalization risk, the features should be relevant to the business problem at hand.
  • Redundant: Avoid the use of redundant features—they weaken the interpretability of the model and its predictions.
  • Reduction: Fewer features mean less complexity, which translates to less time required for model training or inference. Using fewer features decreases the risk of overfitting and may even boost model performance.

The chart below shows an example of how feature selection is used to improve a model’s performance.

As the number of features are reduced from 501 to 13, the model’s performance improves, as indicated by a higher area under the curve (AUC). This visualization is known as a feature selection curve.

Feature selection approaches

There are three approaches to feature selection.

Filter methods select features on the basis of statistical tests. DataRobot users often do this by filtering a dataset from 10,000 features to 1,000 using the feature impact score. This score is based on the alternating conditional expectations (ACE) algorithm and conceptually shows the correlation between the target and the feature. The features are ranked and the top features are retained. One limitation of the DataRobot feature impact score is that it only accounts for the relationship between that feature in isolation and the target.

Embedded methods are algorithms that incorporate their own feature selection process. DataRobot uses embedded methods in approaches that include ElasticNet and a proprietary machine learning algorithm, Eureqa.

Wrapper methods are model-agnostic and typically include modeling on a subset of features to help identify which are most impactful. Wrapper methods are widely used and include forward selection, backward elimination, recursive feature elimination, and more sophisticated stochastic techniques, such as random hill climbing and simulated annealing.

While wrapper methods tend to provide a more optimal feature list than filter or embedded methods, they are more time-consuming, especially on datasets with hundreds of features.

The recursive feature elimination wrapper method is widely used in machine learning to reduce the feature list. A common criteria for removing features is to use the feature impact score, calculated via permutation impact, to remove features with the worst scores and then build a new model. The recursive feature selection approach was used to build the feature selection curve in the chart above.

Feature importance rank ensembling

Building on the recursive feature elimination approach, DataRobot combines the feature impact of multiple diverse models. This approach, known as model ensembling, is based on aggregating ranks of features using the Feature Impact score from several Leaderboard blueprints, as described below.

You can apply model ensembling, which provides improved accuracy and robustness, to feature selection. Selecting lists of important features from multiple models and combining them in a way that produces more a robust feature list is the foundation of feature importance rank ensembling (FIRE). While there are many ways to aggregate the results, the following general steps are recommended. See the accelerator for details on completing each step:

  1. Calculate feature impact for the top N Leaderboard models against the selected metric. You can calculate feature impact using permutation or SHAP impact.
  2. For each model with computed feature impact, get the feature ranking.
  3. Compute the median rank of each feature by aggregating the ranks of the features across all models.
  4. Sort the aggregated list by the computed median rank.
  5. Define the threshold number of features to select.
  6. Create a feature list based on the newly selected features.

To understand the effect of aggregating features, the graphic below shows the variation in feature impact across four different models trained on the readmission dataset. The aggregated feature impact is derived from four models:

  • LightGBM
  • XGBoost
  • Elastic net linear model
  • Keras deep learning model

As indicated by their high Normalized Impact score, the features at the top have consistently performed well across many models. The features at the bottom consistently have little signal (they perform poorly across many models). Some features with wide ranges, like num_lab_procedures and diag_x_desc, performed well in some models, but not in others.

Due to multicollinearity and the inherent nature of models, you see variation. That is, linear models are good at finding linear relationships while tree-based models are good at finding nonlinear relationships. Ensembling the feature impact scores helps identify which features are most important in the dataset views of each model. By iterating with FIRE, you can continue to reduce the feature list and build a feature selection curve. FIRE works best when you use models that have good performance to ensure that the feature impact is useful.

Results

The example below shows results on some wider datasets, several internal for illustrative purposes but also two publicly available—Madelon and KDD 1998. Use the AI accelerator linked at the top of this page to try this.

  • AE is an internal dataset with 374 features and 25,000 rows.

    • Prediction: Regression

    • DataRobot recommended metric: Gamma deviance

  • AF is an internal dataset with 203 features and 400,000 rows.

    • Prediction: Regression

    • DataRobot recommended metric: RMSE

  • G is an internal dataset with 478 features and 2,500 rows.

    • Prediction: Binary classification

    • DataRobot recommended metric: LogLoss

  • IH is an internal dataset with 283 features and 200,000 rows.

    • Prediction: Binary classification

    • DataRobot recommended metric: LogLoss

  • KDD 1998 is a publicly available dataset with 477 features and 76,000 rows.

    • Prediction: Regression

    • DataRobot recommended metric: Tweedie deviance

  • Madelon is a publicly available dataset with 501 features and 2,000 rows.

    • Prediction: Binary classification

    • DataRobot recommended metric: LogLoss

The example uses Autopilot to return these results and show the scores of the best-performing model. The metrics were selected based on the type of problem and distribution of the target. The example then uses Autopilot to build competing models. Lastly, it uses FIRE to develop new feature lists. The results show the performance of the feature list on the best-performing model along with the standard deviation using 10-fold cross-validation.

For feature lists, Informative Features is the default list that includes all features that pass a "reasonableness" check. DR Reduced Features is a one-time reduced feature list using permutation impact. FIRE uses the feature impact with the median rank aggregation approach using an adaptive threshold (the N features that possess 95% of total feature impact).

The table below shows feature selection results on six wide datasets. The bold formatting in the result rows indicates the best-performing result in terms of the mean cross-validation score (lower values indicate a better score).

The chart below compares the performance of feature selection methods.

Across all of these datasets, FIRE consistently had similar or better performance than the use of all features. It even outperformed the DR Reduced Features method by reducing the feature set without any loss in accuracy. For the Madelon dataset, by looking at the feature selection curve in the graphic at the top of the page, you can see how reducing the number of features provided better performance. As FIRE parsed down features from 501 to 18, the model’s performance improved.

Note that if you use FIRE, you must build more models during model training so that you have feature impact across diverse models for ensembling. The information from these other models is very useful for feature selection.

Conclusion

Improved accuracy and parsimony are possible when you use the FIRE method for feature selection. There are many variations left to validate: feature impact (SHAP, permutation), choice of models, perturbing the data or model, and the method for aggregating (median rank, unnormalized impact).


Updated December 3, 2024