Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Feature Impact

Availability information

Support for the new Feature Impact in Workbench is on by default.

Feature flag: Universal SHAP in NextGen

Feature Impact, available for all model types, provides a high-level visualization that identifies which features are most strongly driving model decisions. It informs:

  • Which features are the most important—is it demographic data, transaction data, or something else driving model results? Does it align with the knowledge of industry experts? By understanding which features are important to model outcomes, you can more easily validate if the model complies with business rules.

  • Are there opportunities to improve the model? For example, there may be features with negative accuracy. Dropping them by creating a new feature list might increase model accuracy and speed. Some features may have unexpectedly low importance, which may be worth investigating. Is there a problem in the data? Were data types defined incorrectly?

Note

Feature Impact differs from the feature importance measure shown in the Data page. The green bars displayed in the Importance column of the Data page are a measure of how much a feature, by itself, is correlated with the target variable. By contrast, Feature Impact measures how important a feature is in the context of a model.

Use the controls in the insight to change the display:

Option Description
Data slice Select, or create (by selecting Manage slices), a data slice that allows you to view a subpopulation of a model's data based on feature value.
Compute method Choose the compute method that is the basis of the insight, either SHAP or permutation. This is an on-demand feature for all but the recommended model, which computes permutation impact by default.
Sort by Set the sort method—either by impact (importance) or alphabetically by name—and the sort order. The default is sorting by decreasing impact, that is, most impactful features first.
Use quick-compute Control the sample size used in the chart.
Search Update the chart to include only those features matching the search string.
Actions dropdown Either:
  • Export a CSV containing each feature and its relative importance, a PNG of the chart, or a ZIP file containing both.
  • Create a feature list from the top-ranked features.
Load more features Expands the chart to display all features used in the experiment, loading 25 features with each click. By default the chart represents the top 25, highest impact features. Leaving the insight returns the display to the top 25.

Select a data slice

Sliced insights provide the option to view a subpopulation of a model's data based on feature values—either raw or derived.

Use the segment-based accuracy information gleaned from sliced insights, or compare the segments to the "global" slice (all data), to improve training data. Initially, each feature shows a blue bar that indicates the importance to the target, calculated on all the data used to train the model. If you select or create a new slice, you must first recompute the insight to reflect just the values from the identified subpopulation. Then, the chart updates to show the same top 25 features (or more, if loaded). Now, the blue bar represents the subpopulation's importance to the target. A yellow marker allows you to compare the value in the context of all the data. Hover on a feature for more detail.

Slices are, in effect, a filter for categorical, numeric, or both types of features. See the full documentation on creating, comparing, and using data slices.

Select a compute method

You can select either SHAP or permutation impact as the computation methodology. By default, DataRobot calculates permutation impact for the recommended model . To see SHAP—or either method—for any other model, you must recompute for each.

  • SHAP-based shows how much, on average, each feature affects training data prediction values. For supervised projects, SHAP is available for AutoML projects only. See also the SHAP reference and SHAP considerations.

  • Permutation-based shows how much the error of a model would increase, based on a sample of the training data, if values in the column are shuffled.

Some notable characteristics of the methodologies:

  • SHAP- and permutation-based impact offers a model-agnostic approach that works for all modeling techniques.

  • SHAP Feature Impact is faster and more robust on a smaller sample size than permutation-based Feature Impact.

Quick-compute

When working with Feature Impact, the Use quick-compute option controls the sample size used in the visualization. The row count used to build the visualization is based on the toggle setting and whether a data slice is applied.

For unsliced Feature Impact, when toggled:

  • On: DataRobot uses 2500 rows or the number of rows in the model training sample size, whichever is smaller.

  • Off: DataRobot uses 100,000 rows or the number of rows in the model training sample size, whichever is smaller.

When a data slice is applied, when toggled:

  • On: DataRobot uses 2500 rows or the number of rows available after a slice is applied, whichever is smaller.

  • Off: DataRobot uses 100,000 rows or the number of rows available after a slice is applied, whichever is smaller.

You may want to use this option, for example, to train Feature Impact at a sample size higher than the default 2500 rows (or less, if downsampled) in order to get more accurate and stable results.

Note

When you run Feature Effects before Feature Impact, DataRobot initiates the Feature Impact calculation first. In that case, the quick-compute option is available on the Feature Effects screen and sets the basis of the Feature Impact calculation.

Create a feature list

You can export Feature Impact data or create a feature list based on the relative impact of features. To create a feature list, choose + Create impact-based feature lists from the Actions dropdown.

  1. In the Select features for new list modal, select the number of features to include in the new list and click Next.

  2. Use the Show features from dropdown to change the displayed features that are available for selection. The default display lists features from the Raw Features list. All automatically generated and custom lists are available from the dropdown.

  3. Select the box next to each feature you want to include.

  4. Optionally, use the search field to update the display to show only those features, within the Show features from selection, that match the search string.

  5. Save the list.

Bulk feature list actions

To add multiple features at a time, choose a method from the Bulk selection dropdown:

Use Select by variable type to create a list containing all features from the dataset that are of the selected variable type. While you can only select one variable type, you can individually add any other features (of any type), after selection.

Use Select by existing feature list to add all features in the chosen list.

Note that the bulk actions are secondary to the Show features from dropdown. For example, showing features from "Top5" lists the five features added in your custom list. If you then use Select by existing feature list > Informative features, all features in "Top5" that are also in "Informative Features" are selected. Conversely, if you show informative features and select by "Top5" feature list, those five features are selected.

Use Select N most important to add the specified number of "most important" features from the features available in the list select in the Show features from dropdown. The importance score indicates the degree to which a feature is correlated with the target—representing a measure of predictive power if you were to use only that variable to predict the target.

Save feature list

Once all features for the list are selected, optionally rename the list and provide a description in the Feature list summary. The summary also provides count and type of features included in the list.

Then, click Create feature list to save the information. The new list will display in the listing on the Feature lists tab.

Feature Impact deep dive

Feature Impact is an on-demand feature, meaning that you must initiate a calculation for each model to see the results. The exception is that, as part of the model recommendation process, permutation-based results are calculated for the "recommended for deployment" model. It is calculated using training data, sorted from most to least important by default, and the accuracy of the most important model is always normalized to 1.

Method calculations

This section contains technical details on computation for each of the two available methodologies:

  • Permutation-based Feature Impact
  • SHAP-based Feature Impact

Permutation-based Feature Impact

Permutation-based Feature Impact measures a drop in model accuracy when feature values are shuffled. To compute values, DataRobot:

  1. Makes predictions on a sample of training records—2500 rows by default, maximum 100,000 rows.
  2. Alters the training data (shuffles values in a column).
  3. Makes predictions on the new (shuffled) training data and computes a drop in accuracy that resulted from shuffling.
  4. Computes the average drop.
  5. Repeats steps 2-4 for each feature.
  6. Normalizes the results (i.e., the top feature has an impact of 100%).

The sampling process corresponds to one of the following criteria:

  • For balanced data, random sampling is used.
  • For imbalanced binary data, smart downsampling is used; DataRobot attempts to make the distribution for imbalanced binary targets closer to 50/50 and adjusts the sample weights used for scoring.
  • For zero-inflated regression data, smart downsampling is used; DataRobot groups the non-zero elements into the minority class.
  • For imbalanced multiclass data, random sampling is used.

SHAP-based Feature Impact

SHAP-based Feature Impact measures how much, on average, each feature affects training data prediction value. To compute values, DataRobot:

  1. Takes a sample of records from the training data (5000 rows by default, with a maximum of 100,000 rows).
  2. Computes SHAP values for each record in the sample, generating the local importance of each feature in each record.
  3. Computes global importance by taking the average of abs(SHAP values) for each feature in the sample.
  4. Normalizes the results (i.e., the top feature has an impact of 100%).

Feature Impact considerations

Consider the following when evaluating Feature Impact:

  • Feature Impact is calculated using a sample of the model's training data. Because sample size can affect results, you may want to recompute the values on a larger sample size.

  • Occasionally, due to random noise in the data, there may be features that have negative feature impact scores. In extremely unbalanced data, they may be largely negative. Consider removing these features.

  • The choice of project metric can have a significant effect on permutation-based Feature Impact results. Some metrics, such as AUC, are less sensitive to small changes in model output and, therefore, may be less optimal for assessing how changing features affect model accuracy.

  • Under some conditions, Feature Impact results can vary due to the function of the algorithm used for modeling. This could happen, for example, in the case of multicollinearity. In this case, for algorithms using L1 penalty—such as some linear models—the impact will be concentrated on one signal only, while for trees, the impact will be spread uniformly over the correlated signals.


Updated July 8, 2024