SHAP分布:特徴量ごと¶
タブ | 説明 |
---|---|
説明 | 特定の予測が平均とは異なることに各特徴量がどの程度関与しているかを推定するため、何が予測の根拠となっているかを行単位で理解するのに役立ちます。 |
Two insights are available to provide alternative visualizations of this impact:
インサイト | 説明 |
---|---|
SHAP Distributions: Per Feature (this page) | Shows the distribution and density of scores per feature using a violin plot for the visualization. |
SHAPベースの個々の予測説明 | Shows the effect of each feature on prediction on a row-by-row basis. |
See the deep dive for more information on methodology and interpretability. For more details about working with SHAP in DataRobot, see the related considerations and SHAP reference.
Visualization overview¶
SHAP分布:特徴量ごとは、バイオリン図とも呼ばれ、さまざまなカテゴリー間でデータセットの確率分布を比較するための統計図です。 They show the smooth, continuous shape of your data. The vertical axis represents the features in the dataset while the horizontal axis represents the SHAP score. As a result, the plotted density shapes represent the distribution, based on a sampling of up to 1,000 rows, of the data at different values. The width of the "violin" at a given value shows how many data points fall around that value—do they cluster, peak, or spread evenly? See the distribution specifics for examples.
The visualization shows the top 10 features, based on the sort order, with an option to Load more features.
インサイトフィルター¶
インサイトのコントロールを使用して、予測分布チャートを変更します。
オプション | 説明 |
---|---|
データスライス | Select or create (by selecting Create slice) a data slice to see a how a specific cohort impacts prediction outcome. |
すべてのデータと比較 | Show the feature's impact for a subpopulation (slice). Results are overlaid with the impact when the full population is considered. |
ソート条件 | ソート方法(インパクト(有用性)またはアルファベット順の名前)およびソート順を設定します。 デフォルトは、インパクトの降順なので、最もインパクトの大きい特徴量が最初に表示されます。 |
Y軸のオートスケール | Scale feature distributions relative to the feature with the most rows present. |
外れ値を表示 | Adjust the scale to include outlier values. |
検索 | Adjust the display to show only features matching the search string. |
エクスポート | Download the data, image, or both for the visualization. |
データスライス¶
Data slices are a way to view a subpopulation of a model's data based on feature value. When a slice is selected, note that the order of features changes to match sort according to the new population. 例:
Without a slice, the plot shows:
Showing males over the age of 70:
すべてのデータと比較¶
When one or more slices are configured, enable the Compare to All Data toggle. This allows you to compare a selected slice against the full dataset option, helping to visualize how the distribution differs (or doesn't) when compared to unsliced data.
When toggled on, the transparent violin with a white outline represents the unsliced data, and the colored violin within represents the selected slice.
If a slice matches All data very closely, the white outline of the violin appears transparent because it closely matches the sliced subset.
Y軸のオートスケール¶
Use the toggle to normalize vertical scaling. The setting controls whether the insight shows all distribution detail for each violin or shows a distribution that scales proportionally to the feature with the least distributed (most consistent) value count. At-a-glance you can see distribution—tall, short, wide, narrow. This option provides a rough idea of how impactful the distribution is. The more spread out a violin, the more dispersed the values. 例:
See more about auto-scaling below.
外れ値を表示¶
Where auto-scaling applies to the vertical access (distribution height), outliers apply to the horizontal axis. Outliers show values that are far, relatively speaking, from the main violin and are calculated after binning. 例:
See more about calculating outliers below.
Deep dive: Interpret SHAP distributions¶
Each violin shown is based on up to 1,000 rows of SHAP values. If there are fewer than 1,000 rows—either in the entire dataset or in a sliced subset of the dataset— the violin will have fewer. For each feature, DataRobot then divides the 1,000 data points into uniform bins. In general, use the insight to compare shapes, which represent the distributions. Assessing whether violins are clusters (cohorts) or uniform tells you whether a feature has groups of values that are strongly, positively, or barely impactful.
The SHAP per-feature visualization uses color to represent the density and distribution of features that have impact on the predictions:
- Numeric and binary (continuous) features are plotted on a color spectrum of purple (low frequency) to yellow (high frequency), indicating where higher and lower feature values lie.
- Categorical (discrete) features are shown as gray, with violins embedded within each violin, representing the different categories.
- All other values are shown as blue.
The horizontal axis represents the effect of features on prediction outcomes, with 0
representing no effect. A large cohort of the violin on the left side of zero (i.e., less than zero) means the feature subtracts from the prediction outcome; cohorts on the right of zero add to the prediction outcome. Features often fall to both sides because because of the values of individual rows. See below for help further interpreting the visualization.
Visualize feature distribution¶
The example below, the continuous feature num_lab_procedures
, you can see a majority of the cohort on the right side, meaning it adds to the prediction outcome.
- The more purple seen in the full violin, the fewer the number of lab procedures; the more yellow, the higher the number of lab procedure.
- While it is not all purple or all yellow on either side of zero, coloration indicates that higher number of lab procedures tend to add to the prediction value. Hover on the feature to understand its individual distribution (up to 1,000 rows) in the context of the displayed data (all data or a selected slice).
Hover on a discrete feature (gray on on the full plot) to see up to the top seven classes in the full distribution. All others are represented as gray. For example, from admission_type_id
for patients under 40 you can see that the value emergency
seems to have little impact on the outcome whereas urgent
more strongly influences prediction outcomes.
自動スケーリング¶
DataRobot bins the same number of rows (up to 1,000) to plot a violin and each violin row uses the same vertical pixel height. The maximum height on the plot is reflects the bin containing the most rows, i.e., it is the "tallest peak".
Distribution of rows for each violin row differs, and sometimes by quite large amounts. One row may have the majority of its values in a (horizontally) narrow span, which in turn results in a tall peak as the height of the distribution. Another may have rows distributed broader horizontally, which in turn means a smaller peak of the height of the distribution.
The maximum height is dynamically computed from the features present in the visualization at any given time. If you change the display—with slices, search, loading more features, for example—that changes the maximum height and other violins will be rescaled accordingly.
外れ値を計算中¶
If both of the following conditions both are met, the bin is considered an outlier:
- The bin contains less than 2% membership of the the maximum bin.
- The bin is at least one empty bin away from the nearest bin.
Outliers are always marked with a circle of the same size regardless of the number of outliers in the outlier bin. That is, there is no distinction between a bin with 0.3% membership and a bin with 0.7% membership.