Generate SHAP-based Prediction Explanations¶
One of the most useful features of DataRobot is the ability to generate specific Prediction Explanations for any prediction. An AI model doesn’t need to be a black box; DataRobot’s Prediction Explanations allow you to observe why a certain prediction is being made.
You can download this notebook using the icon in the top right of this page.
The scorecard informs homeowners of the most valuable contributors and biggest detractors to their home's sale price. DataRobot provides a native way to generate Prediction Explanations that is applicable to use cases like this one.
DataRobot supports two techniques to generate Prediction Explanations. The default is XEMP, a technique developed by DataRobot as an improvement to the academic LIME (Local-Interpretable Model-agnostic Explanations) framework. XEMP is the default algorithm for explanations because it can be run on all project types including multiclass and unsupervised models.
The other technique is SHAP (SHapley Additive exPlanations). SHAP values are the average marginal contribution a feature makes to the overall prediction. Since SHAP can directly estimate the contribution of every input feature, it is ideal for helping you understand and make recommendations using the model. (This concept is explored throughout this notebook.) SHAP-based Prediction Explanations have three clear advantages:
- They are faster to calculate than XEMP.
- SHAP values are "additive", meaning that in most cases, the feature strengths sum is equal to the predicted value showing the contribution of each feature.
- The SHAP algorithm is entirely open-source, which may be preferred by compliance or audit teams.
The following sections outline how you can use SHAP values in a DataRobot project.
Import libraries and data¶
This notebook uses the Housing Prices dataset from OpenML. It's based on assessor records in Ames, IA, and has over 79 features describing the homes with SalePrice
as the target.
from IPython.display import display, Markdown
import altair as alt
import datarobot as dr
import pandas as pd
from sklearn.datasets import fetch_openml
%load_ext lab_black
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
# Obtain the raw data
raw_data = fetch_openml(data_id="42563")
raw_data["data"].describe()
MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | ... | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | 1460.000000 | ... | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
mean | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | 46.549315 | ... | 472.980137 | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 |
std | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | 161.319273 | ... | 213.804841 | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 |
min | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 |
25% | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 334.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 |
50% | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | 0.000000 | ... | 480.000000 | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 |
75% | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | 0.000000 | ... | 576.000000 | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 |
max | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | 1474.000000 | ... | 1418.000000 | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 |
8 rows × 36 columns
Create a project and initiate Autopilot¶
After fetching the raw data, create a DataRobot project with "SHAP only" selected. Then, begin building models.
advanced_options = dr.AdvancedOptions(
shap_only_mode=True, blend_best_models=False
) # blenders don't support shap so disabling them for now
dataset = dr.Dataset.create_from_in_memory_data(
raw_data["data"].assign(SalePrice=raw_data["target"])
)
dataset.modify(name="SHAP Home Sales Sample")
project = dr.Project.create_from_dataset(
dataset_id=dataset.id, project_name="SHAP Home Sales Example"
)
project.analyze_and_model(
target="SalePrice",
advanced_options=advanced_options,
)
project.wait_for_autopilot()
In progress: 0, queued: 0 (waited: 0s)
raw_data["data"].assign(SalePrice=raw_data["target"]).columns
Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype='object')
Calculate feature effects¶
Before examining SHAP values, generate feature effects using the top-performing model. SHAP values can be used to recreate feature effects.
# Retrieve the top model
top_model = project.get_top_model("R Squared")
top_model.request_feature_effect()
dr.ShapImpact.create(project.id, top_model.id)
{'validation': 0.92356, 'crossValidation': 0.877534, 'holdout': 0.89028, 'training': None, 'backtestingScores': None, 'backtesting': None}
Job(shapImpact, status=COMPLETED)
Make predictions¶
Next, calculate the predictions. The output will be merged back into the original data to create some comparisons.
sample_df = (
raw_data["data"]
.assign(SalePrice=raw_data["target"])
.sample(round(raw_data["data"].shape[0] / 2))
)
sample_dataset = dr.Dataset.create_from_in_memory_data(sample_df)
project_dataset = project.upload_dataset_from_catalog(sample_dataset.id)
predictions_job = top_model.request_predictions(
dataset_id=project_dataset.id,
explanation_algorithm="shap",
max_explanations=None,
)
predictions = predictions_job.get_result_when_complete()
review_data = sample_df.reset_index().join(
predictions,
)
review_data.head()
index | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | Explanation_37_feature_value | Explanation_37_strength | Explanation_38_feature_name | Explanation_38_feature_value | Explanation_38_strength | Explanation_39_feature_name | Explanation_39_feature_value | Explanation_39_strength | shap_remaining_total | shap_base_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 225 | 160.0 | RM | 21.0 | 1680.0 | Pave | NaN | Reg | Lvl | AllPub | ... | 0.0 | -0.007096 | OpenPorchSF | 0.0 | -0.003912 | ScreenPorch | 0.0 | -0.003247 | None | 12.029002 |
1 | 280 | 60.0 | RL | 82.0 | 11287.0 | Pave | NaN | Reg | Lvl | AllPub | ... | 0.0 | -0.007130 | OpenPorchSF | 84.0 | 0.005903 | ScreenPorch | 0.0 | -0.003221 | None | 12.029002 |
2 | 945 | 50.0 | RM | 98.0 | 8820.0 | Pave | NaN | Reg | Lvl | AllPub | ... | 48.0 | -0.009160 | OpenPorchSF | 0.0 | -0.003873 | ScreenPorch | 0.0 | -0.003123 | None | 12.029002 |
3 | 790 | 120.0 | RL | 43.0 | 3182.0 | Pave | NaN | Reg | Lvl | AllPub | ... | 100.0 | -0.004616 | OpenPorchSF | 16.0 | -0.005681 | ScreenPorch | 0.0 | -0.003124 | None | 12.029002 |
4 | 410 | 20.0 | RL | 68.0 | 9571.0 | Pave | NaN | Reg | Lvl | AllPub | ... | 0.0 | -0.011766 | OpenPorchSF | 0.0 | -0.004172 | ScreenPorch | 0.0 | -0.002619 | None | 12.029002 |
5 rows × 202 columns
View Prediction Explanations¶
DataRobot outputs the SHAP values into a table. The code below reformats the table into a form which is easier to analyze:
def parse_row(index, row, explanation_col=1):
return [
index,
row[f"Explanation_{explanation_col}_feature_name"],
row[f"Explanation_{explanation_col}_strength"],
row[f"Explanation_{explanation_col}_feature_value"],
]
data_container = []
for r in range(39):
for index, row in review_data.iterrows():
data_container.append(parse_row(index, row, r + 1))
features_by_home = pd.DataFrame(
data_container,
columns=["row_number", "feature_name", "feature_strength", "feature_value"],
).dropna()
features_by_home.sort_values(by=["row_number", "feature_strength"]).head()
row_number | feature_name | feature_strength | feature_value | |
---|---|---|---|---|
17520 | 0 | TotalBsmtSF | -0.061025 | 630.0 |
12410 | 0 | LotArea | -0.056740 | 1680.0 |
18250 | 0 | 1stFlrSF | -0.056393 | 630.0 |
10220 | 0 | SaleCondition | -0.049429 | Abnorml |
13140 | 0 | OverallQual | -0.041500 | 5.0 |
SHAP values are additive, meaning that when added, they sum up to the prediction. This is generally true, but when models transform the target data, the models make use of a "link" function. For example, you can use the logit
function to transform a target from (-inf, +inf) to (0, 1). To calculate the actual values, use the inverse of the link function. For a regression model, the link function is generally the natural logarithm log
, so the inverse function is exp
. In this case, there is also a base prediction, representing the average prediction across the dataset.
You can observe how this works by following the example below:
from math import exp
home_number = 150
# Identify the link function
home_value_prediction = review_data.iloc[home_number].prediction
home_feature_strengths = features_by_home[features_by_home.row_number == home_number]
shap_base_value = review_data.loc[101]["shap_base_value"]
Markdown(
f"""
#### The Actual Prediction
The actual prediction of home number {home_number} is ${round(review_data['prediction'].loc[home_number],2)}.
Using the python `exp` function, you can get the natural log of the SHAP values, which yields the same value as the prediction.
The Natural Log of the SHAP values is {round(exp(
(
features_by_home[features_by_home.row_number == home_number]
.sort_values(by="feature_strength")
.sum()["feature_strength"]
)
+ shap_base_value
), 2)
}
"""
)
The Actual Prediction¶
The actual prediction of home number 150 is $162925.19
Using the python exp
function, we can get the natural log of the SHAP values, which yields the same value as the prediction.
The Natural Log of the SHAP values is 162925.19
DataRobot uses two link functions in SHAP predictions:
log
is the inverse of that which can be calculated using Pythonexp(pred)
.logit
is the equivalent ofexp(pred) / (1 + exp(pred))
.
DataRobot blueprints may or may not use a link function on the target value. In practice, almost all binary classification problems use the logit
link function. For more information view the SHAP reference documentation.
To create a scorecard, you must convert the SHAP strength scores themselves. Do so using the following steps.
Calculate the share of a feature's contribution to the overall SHAP Feature strengths.
feature_strength = home_feature_strengths[home_feature_strengths.feature_name == "LotArea"][
"feature_strength"
]
share_of_feature_strength = feature_strength / home_feature_strengths.feature_strength.sum()
share_of_feature_strength
12560 -0.035705 Name: feature_strength, dtype: float64
Calculate the difference between the predicted value and the average predicted value. In this case, the home is worth $32,000 less than the average home in Ames, IA.
prediction_distance = home_value_prediction - exp(shap_base_value)
prediction_distance
-4618.900248170481
Use the share calculated in step 1 to estimate the share of that feature's contribution to the prediction difference in step 2.
share_of_feature_strength * prediction_distance
12560 164.916169 Name: feature_strength, dtype: float64
Notice that the lot area of the home is helping raise the predicted sale price. You can now generalize this approach to the entire dataset using the functions below:
import math
def estimate_shap_strengths(
shap_values_frame: pd.Series,
shap_base_value: float,
actual_prediction: float,
link_function=exp,
):
if link_function is None:
link_function = lambda x: x
sum_of_shap_strengths = shap_values_frame.sum()
base_prediction = link_function(shap_base_value)
prediction_distance = actual_prediction - base_prediction
shap_value_share = shap_values_frame.apply(
lambda shap_strength_value: (shap_strength_value / sum_of_shap_strengths)
* prediction_distance
)
return shap_value_share
def grouper(df: pd.DataFrame):
df = df.set_index("feature_name")
return estimate_shap_strengths(
shap_values_frame=df["feature_strength"],
shap_base_value=df["shap_base_value"].iloc[0],
actual_prediction=df["prediction"].iloc[0],
).reset_index()
feature_strengths_in_real_dollars = (
features_by_home.join(review_data[["prediction", "shap_base_value"]], on="row_number")
.groupby("row_number")
.apply(lambda df: grouper(df))
)
Create a home value scorecard¶
This is the data that can power your scorecard. Populate a report using the newly calculated data and the template below:
from IPython.display import display, HTML, Markdown
home_features_in_dollars = (
feature_strengths_in_real_dollars.loc[(home_number,)]
.sort_values(by="feature_strength")
.set_index("feature_name")
)
display(
Markdown(
f"""
### Home value scorecard
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a3/Home_in_Pella%2C_Iowa_in_Winter_%2824594812035%29.jpg" height="250px" width="300px">
**Predicted Price: $162,925**
### Current best qualities for this home
|Feature|Price impact|
|-------|------------|
|Finished Basement Size|<span style="color: green;">{round(home_features_in_dollars.loc['BsmtFinSF1']['feature_strength'])}</span>|
|Total Living Area|<span style="color: green;">{round(home_features_in_dollars.loc['GrLivArea']['feature_strength'])}</span>|
|Basement Full Bath|<span style="color: green;">{round(home_features_in_dollars.loc['BsmtFullBath']['feature_strength'])}</span>|
### Areas of concern
|Feature|Price impact|
|-------|------------|
|Second floor size|<span style="color: red;">{round(home_features_in_dollars.loc['2ndFlrSF']['feature_strength'])}</span>|
|Neighborhood|<span style="color: red;">{round(home_features_in_dollars.loc['Neighborhood']['feature_strength'])}</span>|
|Fireplaces|<span style="color: red;">{round(home_features_in_dollars.loc['Fireplaces']['feature_strength'])}</span>|
|Overall Paint Quality|<span style="color: red;">{round(home_features_in_dollars.loc['OverallQual']['feature_strength'])}</span>|
"""
)
)
Home Value Scorecard¶
Predicted Price: $162,925
Current Best Qualities for This Home¶
Feature | Price Impact |
---|---|
Finished Basement Size | 10669 |
Total Living Area | 7300 |
Basement Full Bath | 2286 |
Areas of Concern¶
Feature | Price Impact |
---|---|
Second floor size | -5609 |
Neighborhood | -4715 |
Fireplaces | -3832 |
Overall Paint Quality | -7665 |
SHAP Prediction Explanation clustering¶
You can get a lot of value from grouping predictions by prediction explanation values. Consider a health insurer identifying patients at risk of future hospitalization. Prediction clustering can help a hospital differentiate patients whose risk stems from having to take multiple medications daily, referred to as polypharmacy versus patients whose conditions could be exacerbated by hot, hazy weather.
SHAP values, due to their additive nature, are useful for clustering.
For this healthcare use case, you can see the utility of this approach by looking at the top features used in the model. The chart below shows Feature Importance as calculated by DataRobot:
feature_effects_data = [
(fe["feature_name"], fe["feature_impact_score"])
for fe in top_model.get_feature_effect(source="training").feature_effects
]
alt.Chart(
pd.DataFrame(feature_effects_data, columns=["Feature Name", "Feature Impact Score"])
.sort_values("Feature Impact Score")
.tail(15),
title="Feature Importance of key Numerical Features",
width=350,
height=450,
).mark_bar().encode(y=alt.Y("Feature Name:N", sort="-x"), x="Feature Impact Score:Q")
The most important features are Overall Home Quality
, Gross Living Area
, Second Floor Square Footage
, and TotalBsmtSF
.
Next, use these high-importance features to look at potential clusters in the data.
top_features = ["OverallQual", "GrLivArea", "2ndFlrSF", "TotalBsmtSF"]
chart_data = (
feature_strengths_in_real_dollars[
feature_strengths_in_real_dollars.feature_name.isin(top_features)
]
.reset_index()[["row_number", "feature_name", "feature_strength"]]
.set_index(["row_number", "feature_name"])
.unstack()["feature_strength"]
)
chart_data["above_average"] = review_data["prediction"] > exp(shap_base_value)
alt.Chart(chart_data, width=150, height=150).mark_point().encode(
alt.X(alt.repeat("column"), type="quantitative"),
alt.Y("OverallQual:Q"),
alt.Color("above_average", title="Above Average Predicted Price"),
).repeat(column=top_features[1:])
You can notice some clusters start to form. This row of charts compares Overall Quality
with Living Area
, Second Floor Square Footage
, and Basement Square Footage
. In each case, there are clusters of homes with lower quality but larger living areas. These might be targets for investment or renovation.
This is a very simple test for clusters. Leveraging DataRobot, you can automatically cluster homes by prediction explanations. First, assemble your dataset from the sample run previously. Then, run an unsupervised clustering project to find common clusters among the data.
clustering_train_data = (
feature_strengths_in_real_dollars.reset_index()[
["row_number", "feature_name", "feature_strength"]
]
.set_index(["row_number", "feature_name"])
.unstack()["feature_strength"]
)
clustering_train_data.head()
feature_name | 1stFlrSF | 2ndFlrSF | BsmtExposure | BsmtFinSF1 | BsmtFinType1 | BsmtFullBath | BsmtQual | Condition1 | ExterQual | Exterior1st | ... | OpenPorchSF | OverallCond | OverallQual | SaleCondition | ScreenPorch | TotRmsAbvGrd | TotalBsmtSF | WoodDeckSF | YearBuilt | YearRemodAdd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
row_number | |||||||||||||||||||||
0 | -4864.412466 | 8318.669683 | 1167.712273 | -7128.366814 | -303.705816 | -1398.859873 | -93.676896 | 1030.275956 | -437.636247 | 183.058581 | ... | 1739.112240 | -5072.552694 | 8588.601149 | 251.847563 | -589.154691 | 1602.871030 | -2392.637449 | -436.865815 | 4598.017824 | 1265.036695 |
1 | 377.039109 | 12390.197968 | -1651.889720 | 1253.800734 | -2254.803171 | -1268.625300 | -1458.751706 | -6458.263645 | -403.516312 | 1610.439662 | ... | -631.327110 | -4494.773379 | -7642.660179 | 228.400425 | -534.304086 | 2393.297967 | 1933.905169 | -957.052433 | -3011.197277 | -2621.767936 |
2 | 6612.957378 | 4683.927144 | -1708.318748 | -7012.302307 | -284.839438 | -1311.961906 | -1382.710958 | 972.453930 | -417.300545 | -2993.981573 | ... | 125.019205 | 14566.189749 | 8236.270875 | 236.202650 | -540.429039 | 1919.767194 | -6874.544475 | -1143.954450 | -10795.126293 | 1204.752171 |
3 | 9105.013762 | -6009.236170 | -1853.510365 | -7244.091932 | -309.048210 | -1423.466782 | -50.223149 | 1041.302861 | -452.767311 | 148.536967 | ... | -447.098427 | -5029.804347 | 9421.857124 | 256.277735 | -598.374190 | 103.922677 | 10715.081166 | -1305.711755 | 4686.829664 | 1089.874286 |
4 | -1917.996341 | 8951.753337 | -1927.297336 | 5288.131184 | 1432.447402 | 2082.421990 | -87.560816 | -5435.143204 | 130.814098 | 806.096103 | ... | -708.545779 | -5275.930552 | 9846.145139 | 266.479975 | -621.818136 | 2246.006525 | -181.978763 | -874.704637 | 4736.449385 | 2090.240300 |
5 rows × 39 columns
clustering_project = dr.Project.start(
clustering_train_data,
unsupervised_mode=True,
unsupervised_type=dr.enums.UnsupervisedTypeEnum.CLUSTERING,
)
clustering_project.wait_for_autopilot()
In progress: 8, queued: 4 (waited: 0s) In progress: 8, queued: 4 (waited: 1s) In progress: 8, queued: 4 (waited: 1s) In progress: 8, queued: 4 (waited: 2s) In progress: 8, queued: 4 (waited: 4s) In progress: 8, queued: 4 (waited: 6s) In progress: 8, queued: 4 (waited: 10s) In progress: 8, queued: 4 (waited: 17s) In progress: 4, queued: 0 (waited: 30s) In progress: 0, queued: 0 (waited: 50s)
Review the most impactful features that distinguish the clusters in the sale data.
cluster_top_model = clustering_project.get_models()
cluster_top_model.sort(key=lambda m: m.metrics["Silhouette Score"]["training"])
cluster_top_model = cluster_top_model[-1]
try:
cluster_top_model.request_feature_impact().wait_for_completion()
except dr.errors.JobAlreadyRequested:
pass
cluster_feature_impact = pd.DataFrame(cluster_top_model.get_feature_impact())
alt.Chart(
cluster_feature_impact.head(15), title="Top 15 Features to Determine Clusters"
).mark_bar().encode(y=alt.Y("featureName:N", sort="-x"), x=alt.X("impactNormalized:Q"))
The results indicate targets for renovation: Kitchen Quality
and Basement Quality
.
chart_data = clustering_train_data.assign(
predicted_value=lambda df: df.sum(axis=1),
above_average=lambda df: df.predicted_value > exp(shap_base_value),
)
alt.Chart(chart_data, title="Cohorts based on Kitchen and Overall Home Quality").mark_point(
filled=True
).encode(
x="OverallQual:Q",
y=alt.Y("KitchenQual:Q", scale=alt.Scale(type="sqrt")),
color="above_average",
)