DataRobot API resources > API user guide > Common use cases > Generate SHAP-based Prediction Explanations

Generate SHAP-based Prediction Explanations¶

One of the most useful features of DataRobot is the ability to generate specific Prediction Explanations for any prediction. An AI model doesn’t need to be a black box; DataRobot’s Prediction Explanations allow you to observe why a certain prediction is being made.

You can download this notebook using the icon in the top right of this page.

Home value scorecard¶

View an example home value scorecard below:

No description has been provided for this image

Predicted price: $94,670

Best qualities for this home¶

Feature	Price impact
Large Living Space	+$5,943
Garage	+$5,059
2nd Floor Sq Footage	+$3,134

The scorecard informs homeowners of the most valuable contributors and biggest detractors to their home's sale price. DataRobot provides a native way to generate Prediction Explanations that is applicable to use cases like this one.

DataRobot supports two techniques to generate Prediction Explanations. The default is XEMP, a technique developed by DataRobot as an improvement to the academic LIME (Local-Interpretable Model-agnostic Explanations) framework. XEMP is the default algorithm for explanations because it can be run on all project types including multiclass and unsupervised models.

The other technique is SHAP (SHapley Additive exPlanations). SHAP values are the average marginal contribution a feature makes to the overall prediction. Since SHAP can directly estimate the contribution of every input feature, it is ideal for helping you understand and make recommendations using the model. (This concept is explored throughout this notebook.) SHAP-based Prediction Explanations have three clear advantages:

They are faster to calculate than XEMP.
SHAP values are "additive", meaning that in most cases, the feature strengths sum is equal to the predicted value showing the contribution of each feature.
The SHAP algorithm is entirely open-source, which may be preferred by compliance or audit teams.

The following sections outline how you can use SHAP values in a DataRobot project.

Import libraries and data¶

This notebook uses the Housing Prices dataset from OpenML. It's based on assessor records in Ames, IA, and has over 79 features describing the homes with SalePrice as the target.

In [3]:

Copied!





from IPython.display import display, Markdown
import altair as alt
import datarobot as dr
import pandas as pd
from sklearn.datasets import fetch_openml

%load_ext lab_black

# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')

# Obtain the raw data
raw_data = fetch_openml(data_id="42563")
from IPython.display import display, Markdown
import altair as alt
import datarobot as dr
import pandas as pd
from sklearn.datasets import fetch_openml

%load_ext lab_black

# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')

# Obtain the raw data
raw_data = fetch_openml(data_id="42563")

In [4]:

Copied!

raw_data["data"].describe()
raw_data["data"].describe()

Out[4]:

	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	...	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold
count	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	1460.000000	...	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	46.549315	...	472.980137	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753
std	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	161.319273	...	213.804841	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095
min	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000
25%	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	0.000000	...	334.500000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000
50%	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	0.000000	...	480.000000	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000
75%	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	0.000000	...	576.000000	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000
max	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	1474.000000	...	1418.000000	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000

8 rows × 36 columns

Create a project and initiate Autopilot¶

After fetching the raw data, create a DataRobot project with "SHAP only" selected. Then, begin building models.

In [ ]:

Copied!





advanced_options = dr.AdvancedOptions(
    shap_only_mode=True, blend_best_models=False
)  # blenders don't support shap so disabling them for now


dataset = dr.Dataset.create_from_in_memory_data(
    raw_data["data"].assign(SalePrice=raw_data["target"])
)
dataset.modify(name="SHAP Home Sales Sample")
project = dr.Project.create_from_dataset(
    dataset_id=dataset.id, project_name="SHAP Home Sales Example"
)

project.analyze_and_model(
    target="SalePrice",
    advanced_options=advanced_options,
)
advanced_options = dr.AdvancedOptions(
    shap_only_mode=True, blend_best_models=False
)  # blenders don't support shap so disabling them for now


dataset = dr.Dataset.create_from_in_memory_data(
    raw_data["data"].assign(SalePrice=raw_data["target"])
)
dataset.modify(name="SHAP Home Sales Sample")
project = dr.Project.create_from_dataset(
    dataset_id=dataset.id, project_name="SHAP Home Sales Example"
)

project.analyze_and_model(
    target="SalePrice",
    advanced_options=advanced_options,
)

In [6]:

Copied!

project.wait_for_autopilot()
project.wait_for_autopilot()

In progress: 0, queued: 0 (waited: 0s)

In [7]:

Copied!

raw_data["data"].assign(SalePrice=raw_data["target"]).columns
raw_data["data"].assign(SalePrice=raw_data["target"]).columns

Out[7]:

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'],
      dtype='object')

Calculate feature effects¶

Before examining SHAP values, generate feature effects using the top-performing model. SHAP values can be used to recreate feature effects.

In [8]:

Copied!

# Retrieve the top model
top_model = project.get_top_model("R Squared")

top_model.request_feature_effect()
dr.ShapImpact.create(project.id, top_model.id)
# Retrieve the top model
top_model = project.get_top_model("R Squared")

top_model.request_feature_effect()
dr.ShapImpact.create(project.id, top_model.id)

{'validation': 0.92356, 'crossValidation': 0.877534, 'holdout': 0.89028, 'training': None, 'backtestingScores': None, 'backtesting': None}

Out[8]:

Job(shapImpact, status=COMPLETED)

Make predictions¶

Next, calculate the predictions. The output will be merged back into the original data to create some comparisons.

In [9]:

Copied!





sample_df = (
    raw_data["data"]
    .assign(SalePrice=raw_data["target"])
    .sample(round(raw_data["data"].shape[0] / 2))
)
sample_dataset = dr.Dataset.create_from_in_memory_data(sample_df)
project_dataset = project.upload_dataset_from_catalog(sample_dataset.id)
predictions_job = top_model.request_predictions(
    dataset_id=project_dataset.id,
    explanation_algorithm="shap",
    max_explanations=None,
)

predictions = predictions_job.get_result_when_complete()
sample_df = (
    raw_data["data"]
    .assign(SalePrice=raw_data["target"])
    .sample(round(raw_data["data"].shape[0] / 2))
)
sample_dataset = dr.Dataset.create_from_in_memory_data(sample_df)
project_dataset = project.upload_dataset_from_catalog(sample_dataset.id)
predictions_job = top_model.request_predictions(
    dataset_id=project_dataset.id,
    explanation_algorithm="shap",
    max_explanations=None,
)

predictions = predictions_job.get_result_when_complete()

In [10]:

Copied!

review_data = sample_df.reset_index().join(
    predictions,
)

review_data.head()
review_data = sample_df.reset_index().join(
    predictions,
)

review_data.head()

Out[10]:

	index	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	Explanation_37_feature_value	Explanation_37_strength	Explanation_38_feature_name	Explanation_38_feature_value	Explanation_38_strength	Explanation_39_feature_name	Explanation_39_strength	shap_remaining_total	shap_base_value
0	225	160.0	RM	21.0	1680.0	Pave	NaN	Reg	Lvl	AllPub	...	0.0	-0.007096	OpenPorchSF	0.0	-0.003912	ScreenPorch	-0.003247	None	12.029002
1	280	60.0	RL	82.0	11287.0	Pave	NaN	Reg	Lvl	AllPub	...	0.0	-0.007130	OpenPorchSF	84.0	0.005903	ScreenPorch	-0.003221	None	12.029002
2	945	50.0	RM	98.0	8820.0	Pave	NaN	Reg	Lvl	AllPub	...	48.0	-0.009160	OpenPorchSF	0.0	-0.003873	ScreenPorch	-0.003123	None	12.029002
3	790	120.0	RL	43.0	3182.0	Pave	NaN	Reg	Lvl	AllPub	...	100.0	-0.004616	OpenPorchSF	16.0	-0.005681	ScreenPorch	-0.003124	None	12.029002
4	410	20.0	RL	68.0	9571.0	Pave	NaN	Reg	Lvl	AllPub	...	0.0	-0.011766	OpenPorchSF	0.0	-0.004172	ScreenPorch	-0.002619	None	12.029002

5 rows × 202 columns

View Prediction Explanations¶

DataRobot outputs the SHAP values into a table. The code below reformats the table into a form which is easier to analyze:

In [11]:

Copied!





def parse_row(index, row, explanation_col=1):
    return [
        index,
        row[f"Explanation_{explanation_col}_feature_name"],
        row[f"Explanation_{explanation_col}_strength"],
        row[f"Explanation_{explanation_col}_feature_value"],
    ]


data_container = []
for r in range(39):
    for index, row in review_data.iterrows():
        data_container.append(parse_row(index, row, r + 1))

features_by_home = pd.DataFrame(
    data_container,
    columns=["row_number", "feature_name", "feature_strength", "feature_value"],
).dropna()


features_by_home.sort_values(by=["row_number", "feature_strength"]).head()
def parse_row(index, row, explanation_col=1):
    return [
        index,
        row[f"Explanation_{explanation_col}_feature_name"],
        row[f"Explanation_{explanation_col}_strength"],
        row[f"Explanation_{explanation_col}_feature_value"],
    ]


data_container = []
for r in range(39):
    for index, row in review_data.iterrows():
        data_container.append(parse_row(index, row, r + 1))

features_by_home = pd.DataFrame(
    data_container,
    columns=["row_number", "feature_name", "feature_strength", "feature_value"],
).dropna()


features_by_home.sort_values(by=["row_number", "feature_strength"]).head()

Out[11]:

	feature_name	feature_strength	feature_value
17520	TotalBsmtSF	-0.061025	630.0
12410	LotArea	-0.056740	1680.0
18250	1stFlrSF	-0.056393	630.0
10220	SaleCondition	-0.049429	Abnorml
13140	OverallQual	-0.041500	5.0

SHAP values are additive, meaning that when added, they sum up to the prediction. This is generally true, but when models transform the target data, the models make use of a "link" function. For example, you can use the logit function to transform a target from (-inf, +inf) to (0, 1). To calculate the actual values, use the inverse of the link function. For a regression model, the link function is generally the natural logarithm log, so the inverse function is exp. In this case, there is also a base prediction, representing the average prediction across the dataset.

You can observe how this works by following the example below:

In [12]:

Copied!





from math import exp

home_number = 150
# Identify the link function
home_value_prediction = review_data.iloc[home_number].prediction
home_feature_strengths = features_by_home[features_by_home.row_number == home_number]
shap_base_value = review_data.loc[101]["shap_base_value"]


Markdown(
    f"""

#### The Actual Prediction
The actual prediction of home number {home_number} is ${round(review_data['prediction'].loc[home_number],2)}.

Using the python `exp` function, you can get the natural log of the SHAP values, which yields the same value as the prediction. 

The Natural Log of the SHAP values is {round(exp(
    (
        features_by_home[features_by_home.row_number == home_number]
        .sort_values(by="feature_strength")
        .sum()["feature_strength"]
    )
    + shap_base_value
), 2)

}
"""
)
from math import exp

home_number = 150
# Identify the link function
home_value_prediction = review_data.iloc[home_number].prediction
home_feature_strengths = features_by_home[features_by_home.row_number == home_number]
shap_base_value = review_data.loc[101]["shap_base_value"]


Markdown(
    f"""

#### The Actual Prediction
The actual prediction of home number {home_number} is ${round(review_data['prediction'].loc[home_number],2)}.

Using the python `exp` function, you can get the natural log of the SHAP values, which yields the same value as the prediction. 

The Natural Log of the SHAP values is {round(exp(
    (
        features_by_home[features_by_home.row_number == home_number]
        .sort_values(by="feature_strength")
        .sum()["feature_strength"]
    )
    + shap_base_value
), 2)

}
"""
)

Out[12]:

The Actual Prediction¶

The actual prediction of home number 150 is $162925.19

Using the python exp function, we can get the natural log of the SHAP values, which yields the same value as the prediction.

The Natural Log of the SHAP values is 162925.19

DataRobot uses two link functions in SHAP predictions:

log is the inverse of that which can be calculated using Python exp(pred).
logit is the equivalent of exp(pred) / (1 + exp(pred)).

DataRobot blueprints may or may not use a link function on the target value. In practice, almost all binary classification problems use the logit link function. For more information view the SHAP reference documentation.

To create a scorecard, you must convert the SHAP strength scores themselves. Do so using the following steps.

Calculate the share of a feature's contribution to the overall SHAP Feature strengths.

In [13]:

Copied!

feature_strength = home_feature_strengths[home_feature_strengths.feature_name == "LotArea"][
    "feature_strength"
]

share_of_feature_strength = feature_strength / home_feature_strengths.feature_strength.sum()

share_of_feature_strength
feature_strength = home_feature_strengths[home_feature_strengths.feature_name == "LotArea"][
    "feature_strength"
]

share_of_feature_strength = feature_strength / home_feature_strengths.feature_strength.sum()

share_of_feature_strength

Out[13]:

12560   -0.035705
Name: feature_strength, dtype: float64

Calculate the difference between the predicted value and the average predicted value. In this case, the home is worth $32,000 less than the average home in Ames, IA.

In [14]:

Copied!

prediction_distance = home_value_prediction - exp(shap_base_value)
prediction_distance
prediction_distance = home_value_prediction - exp(shap_base_value)
prediction_distance

Out[14]:

-4618.900248170481

Use the share calculated in step 1 to estimate the share of that feature's contribution to the prediction difference in step 2.

In [15]:

Copied!

share_of_feature_strength * prediction_distance
share_of_feature_strength * prediction_distance

Out[15]:

12560    164.916169
Name: feature_strength, dtype: float64

Notice that the lot area of the home is helping raise the predicted sale price. You can now generalize this approach to the entire dataset using the functions below:

In [16]:

Copied!





import math


def estimate_shap_strengths(
    shap_values_frame: pd.Series,
    shap_base_value: float,
    actual_prediction: float,
    link_function=exp,
):
    if link_function is None:
        link_function = lambda x: x
    sum_of_shap_strengths = shap_values_frame.sum()
    base_prediction = link_function(shap_base_value)
    prediction_distance = actual_prediction - base_prediction
    shap_value_share = shap_values_frame.apply(
        lambda shap_strength_value: (shap_strength_value / sum_of_shap_strengths)
        * prediction_distance
    )
    return shap_value_share
import math


def estimate_shap_strengths(
    shap_values_frame: pd.Series,
    shap_base_value: float,
    actual_prediction: float,
    link_function=exp,
):
    if link_function is None:
        link_function = lambda x: x
    sum_of_shap_strengths = shap_values_frame.sum()
    base_prediction = link_function(shap_base_value)
    prediction_distance = actual_prediction - base_prediction
    shap_value_share = shap_values_frame.apply(
        lambda shap_strength_value: (shap_strength_value / sum_of_shap_strengths)
        * prediction_distance
    )
    return shap_value_share

In [17]:

Copied!





def grouper(df: pd.DataFrame):
    df = df.set_index("feature_name")
    return estimate_shap_strengths(
        shap_values_frame=df["feature_strength"],
        shap_base_value=df["shap_base_value"].iloc[0],
        actual_prediction=df["prediction"].iloc[0],
    ).reset_index()


feature_strengths_in_real_dollars = (
    features_by_home.join(review_data[["prediction", "shap_base_value"]], on="row_number")
    .groupby("row_number")
    .apply(lambda df: grouper(df))
)
def grouper(df: pd.DataFrame):
    df = df.set_index("feature_name")
    return estimate_shap_strengths(
        shap_values_frame=df["feature_strength"],
        shap_base_value=df["shap_base_value"].iloc[0],
        actual_prediction=df["prediction"].iloc[0],
    ).reset_index()


feature_strengths_in_real_dollars = (
    features_by_home.join(review_data[["prediction", "shap_base_value"]], on="row_number")
    .groupby("row_number")
    .apply(lambda df: grouper(df))
)

Create a home value scorecard¶

This is the data that can power your scorecard. Populate a report using the newly calculated data and the template below:

In [25]:

Copied!





from IPython.display import display, HTML, Markdown

home_features_in_dollars = (
    feature_strengths_in_real_dollars.loc[(home_number,)]
    .sort_values(by="feature_strength")
    .set_index("feature_name")
)
display(
    Markdown(
        f"""

### Home value scorecard
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a3/Home_in_Pella%2C_Iowa_in_Winter_%2824594812035%29.jpg" height="250px" width="300px">

**Predicted Price:  $162,925**

     
### Current best qualities for this home

|Feature|Price impact|
|-------|------------|
|Finished Basement Size|<span style="color: green;">{round(home_features_in_dollars.loc['BsmtFinSF1']['feature_strength'])}</span>|
|Total Living Area|<span style="color: green;">{round(home_features_in_dollars.loc['GrLivArea']['feature_strength'])}</span>|
|Basement Full Bath|<span style="color: green;">{round(home_features_in_dollars.loc['BsmtFullBath']['feature_strength'])}</span>|

### Areas of concern


|Feature|Price impact|
|-------|------------|
|Second floor size|<span style="color: red;">{round(home_features_in_dollars.loc['2ndFlrSF']['feature_strength'])}</span>|
|Neighborhood|<span style="color: red;">{round(home_features_in_dollars.loc['Neighborhood']['feature_strength'])}</span>|
|Fireplaces|<span style="color: red;">{round(home_features_in_dollars.loc['Fireplaces']['feature_strength'])}</span>|
|Overall Paint Quality|<span style="color: red;">{round(home_features_in_dollars.loc['OverallQual']['feature_strength'])}</span>|

"""
    )
)
from IPython.display import display, HTML, Markdown

home_features_in_dollars = (
    feature_strengths_in_real_dollars.loc[(home_number,)]
    .sort_values(by="feature_strength")
    .set_index("feature_name")
)
display(
    Markdown(
        f"""

### Home value scorecard


**Predicted Price:  $162,925**

     
### Current best qualities for this home

|Feature|Price impact|
|-------|------------|
|Finished Basement Size|{round(home_features_in_dollars.loc['BsmtFinSF1']['feature_strength'])}|
|Total Living Area|{round(home_features_in_dollars.loc['GrLivArea']['feature_strength'])}|
|Basement Full Bath|{round(home_features_in_dollars.loc['BsmtFullBath']['feature_strength'])}|

### Areas of concern


|Feature|Price impact|
|-------|------------|
|Second floor size|{round(home_features_in_dollars.loc['2ndFlrSF']['feature_strength'])}|
|Neighborhood|{round(home_features_in_dollars.loc['Neighborhood']['feature_strength'])}|
|Fireplaces|{round(home_features_in_dollars.loc['Fireplaces']['feature_strength'])}|
|Overall Paint Quality|{round(home_features_in_dollars.loc['OverallQual']['feature_strength'])}|

"""
    )
)

Home Value Scorecard¶

Predicted Price: $162,925

Current Best Qualities for This Home¶

Feature	Price Impact
Finished Basement Size	10669
Total Living Area	7300
Basement Full Bath	2286

Areas of Concern¶

Feature	Price Impact
Second floor size	-5609
Neighborhood	-4715
Fireplaces	-3832
Overall Paint Quality	-7665

SHAP Prediction Explanation clustering¶

You can get a lot of value from grouping predictions by prediction explanation values. Consider a health insurer identifying patients at risk of future hospitalization. Prediction clustering can help a hospital differentiate patients whose risk stems from having to take multiple medications daily, referred to as polypharmacy versus patients whose conditions could be exacerbated by hot, hazy weather.

SHAP values, due to their additive nature, are useful for clustering.

For this healthcare use case, you can see the utility of this approach by looking at the top features used in the model. The chart below shows Feature Importance as calculated by DataRobot:

In [26]:

Copied!





feature_effects_data = [
    (fe["feature_name"], fe["feature_impact_score"])
    for fe in top_model.get_feature_effect(source="training").feature_effects
]

alt.Chart(
    pd.DataFrame(feature_effects_data, columns=["Feature Name", "Feature Impact Score"])
    .sort_values("Feature Impact Score")
    .tail(15),
    title="Feature Importance of key Numerical Features",
    width=350,
    height=450,
).mark_bar().encode(y=alt.Y("Feature Name:N", sort="-x"), x="Feature Impact Score:Q")
feature_effects_data = [
    (fe["feature_name"], fe["feature_impact_score"])
    for fe in top_model.get_feature_effect(source="training").feature_effects
]

alt.Chart(
    pd.DataFrame(feature_effects_data, columns=["Feature Name", "Feature Impact Score"])
    .sort_values("Feature Impact Score")
    .tail(15),
    title="Feature Importance of key Numerical Features",
    width=350,
    height=450,
).mark_bar().encode(y=alt.Y("Feature Name:N", sort="-x"), x="Feature Impact Score:Q")

Out[26]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

The most important features are Overall Home Quality, Gross Living Area, Second Floor Square Footage, and TotalBsmtSF.

Next, use these high-importance features to look at potential clusters in the data.

In [56]:

Copied!





top_features = ["OverallQual", "GrLivArea", "2ndFlrSF", "TotalBsmtSF"]
chart_data = (
    feature_strengths_in_real_dollars[
        feature_strengths_in_real_dollars.feature_name.isin(top_features)
    ]
    .reset_index()[["row_number", "feature_name", "feature_strength"]]
    .set_index(["row_number", "feature_name"])
    .unstack()["feature_strength"]
)
chart_data["above_average"] = review_data["prediction"] > exp(shap_base_value)

alt.Chart(chart_data, width=150, height=150).mark_point().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y("OverallQual:Q"),
    alt.Color("above_average", title="Above Average Predicted Price"),
).repeat(column=top_features[1:])
top_features = ["OverallQual", "GrLivArea", "2ndFlrSF", "TotalBsmtSF"]
chart_data = (
    feature_strengths_in_real_dollars[
        feature_strengths_in_real_dollars.feature_name.isin(top_features)
    ]
    .reset_index()[["row_number", "feature_name", "feature_strength"]]
    .set_index(["row_number", "feature_name"])
    .unstack()["feature_strength"]
)
chart_data["above_average"] = review_data["prediction"] > exp(shap_base_value)

alt.Chart(chart_data, width=150, height=150).mark_point().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y("OverallQual:Q"),
    alt.Color("above_average", title="Above Average Predicted Price"),
).repeat(column=top_features[1:])

Out[56]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

You can notice some clusters start to form. This row of charts compares Overall Quality with Living Area, Second Floor Square Footage, and Basement Square Footage. In each case, there are clusters of homes with lower quality but larger living areas. These might be targets for investment or renovation.

This is a very simple test for clusters. Leveraging DataRobot, you can automatically cluster homes by prediction explanations. First, assemble your dataset from the sample run previously. Then, run an unsupervised clustering project to find common clusters among the data.

In [58]:

Copied!





clustering_train_data = (
    feature_strengths_in_real_dollars.reset_index()[
        ["row_number", "feature_name", "feature_strength"]
    ]
    .set_index(["row_number", "feature_name"])
    .unstack()["feature_strength"]
)
clustering_train_data.head()
clustering_train_data = (
    feature_strengths_in_real_dollars.reset_index()[
        ["row_number", "feature_name", "feature_strength"]
    ]
    .set_index(["row_number", "feature_name"])
    .unstack()["feature_strength"]
)
clustering_train_data.head()

Out[58]:

feature_name	1stFlrSF	2ndFlrSF	BsmtExposure	BsmtFinSF1	BsmtFinType1	BsmtFullBath	BsmtQual	Condition1	ExterQual	Exterior1st	...	OpenPorchSF	OverallCond	OverallQual	SaleCondition	ScreenPorch	TotRmsAbvGrd	TotalBsmtSF	WoodDeckSF	YearBuilt	YearRemodAdd
row_number
0	-4864.412466	8318.669683	1167.712273	-7128.366814	-303.705816	-1398.859873	-93.676896	1030.275956	-437.636247	183.058581	...	1739.112240	-5072.552694	8588.601149	251.847563	-589.154691	1602.871030	-2392.637449	-436.865815	4598.017824	1265.036695
1	377.039109	12390.197968	-1651.889720	1253.800734	-2254.803171	-1268.625300	-1458.751706	-6458.263645	-403.516312	1610.439662	...	-631.327110	-4494.773379	-7642.660179	228.400425	-534.304086	2393.297967	1933.905169	-957.052433	-3011.197277	-2621.767936
2	6612.957378	4683.927144	-1708.318748	-7012.302307	-284.839438	-1311.961906	-1382.710958	972.453930	-417.300545	-2993.981573	...	125.019205	14566.189749	8236.270875	236.202650	-540.429039	1919.767194	-6874.544475	-1143.954450	-10795.126293	1204.752171
3	9105.013762	-6009.236170	-1853.510365	-7244.091932	-309.048210	-1423.466782	-50.223149	1041.302861	-452.767311	148.536967	...	-447.098427	-5029.804347	9421.857124	256.277735	-598.374190	103.922677	10715.081166	-1305.711755	4686.829664	1089.874286
4	-1917.996341	8951.753337	-1927.297336	5288.131184	1432.447402	2082.421990	-87.560816	-5435.143204	130.814098	806.096103	...	-708.545779	-5275.930552	9846.145139	266.479975	-621.818136	2246.006525	-181.978763	-874.704637	4736.449385	2090.240300

5 rows × 39 columns

In [59]:

Copied!





clustering_project = dr.Project.start(
    clustering_train_data,
    unsupervised_mode=True,
    unsupervised_type=dr.enums.UnsupervisedTypeEnum.CLUSTERING,
)

clustering_project.wait_for_autopilot()
clustering_project = dr.Project.start(
    clustering_train_data,
    unsupervised_mode=True,
    unsupervised_type=dr.enums.UnsupervisedTypeEnum.CLUSTERING,
)

clustering_project.wait_for_autopilot()

In progress: 8, queued: 4 (waited: 0s)
In progress: 8, queued: 4 (waited: 1s)
In progress: 8, queued: 4 (waited: 1s)
In progress: 8, queued: 4 (waited: 2s)
In progress: 8, queued: 4 (waited: 4s)
In progress: 8, queued: 4 (waited: 6s)
In progress: 8, queued: 4 (waited: 10s)
In progress: 8, queued: 4 (waited: 17s)
In progress: 4, queued: 0 (waited: 30s)
In progress: 0, queued: 0 (waited: 50s)

Review the most impactful features that distinguish the clusters in the sale data.

In [ ]:

Copied!





cluster_top_model = clustering_project.get_models()
cluster_top_model.sort(key=lambda m: m.metrics["Silhouette Score"]["training"])
cluster_top_model = cluster_top_model[-1]
try:
    cluster_top_model.request_feature_impact().wait_for_completion()
except dr.errors.JobAlreadyRequested:
    pass

cluster_feature_impact = pd.DataFrame(cluster_top_model.get_feature_impact())

alt.Chart(
    cluster_feature_impact.head(15), title="Top 15 Features to Determine Clusters"
).mark_bar().encode(y=alt.Y("featureName:N", sort="-x"), x=alt.X("impactNormalized:Q"))
cluster_top_model = clustering_project.get_models()
cluster_top_model.sort(key=lambda m: m.metrics["Silhouette Score"]["training"])
cluster_top_model = cluster_top_model[-1]
try:
    cluster_top_model.request_feature_impact().wait_for_completion()
except dr.errors.JobAlreadyRequested:
    pass

cluster_feature_impact = pd.DataFrame(cluster_top_model.get_feature_impact())

alt.Chart(
    cluster_feature_impact.head(15), title="Top 15 Features to Determine Clusters"
).mark_bar().encode(y=alt.Y("featureName:N", sort="-x"), x=alt.X("impactNormalized:Q"))

The results indicate targets for renovation: Kitchen Quality and Basement Quality.

In [183]:

Copied!





chart_data = clustering_train_data.assign(
    predicted_value=lambda df: df.sum(axis=1),
    above_average=lambda df: df.predicted_value > exp(shap_base_value),
)
alt.Chart(chart_data, title="Cohorts based on Kitchen and Overall Home Quality").mark_point(
    filled=True
).encode(
    x="OverallQual:Q",
    y=alt.Y("KitchenQual:Q", scale=alt.Scale(type="sqrt")),
    color="above_average",
)
chart_data = clustering_train_data.assign(
    predicted_value=lambda df: df.sum(axis=1),
    above_average=lambda df: df.predicted_value > exp(shap_base_value),
)
alt.Chart(chart_data, title="Cohorts based on Kitchen and Overall Home Quality").mark_point(
    filled=True
).encode(
    x="OverallQual:Q",
    y=alt.Y("KitchenQual:Q", scale=alt.Scale(type="sqrt")),
    color="above_average",
)

Out[183]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Updated October 11, 2023

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?

Thanks for your feedback!