%matplotlib inline
import datarobot as dr
from datarobot.enums import AUTOPILOT_MODE
from datarobot.errors import ClientError
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import numpy as np
import pandas as pd
Configure DataRobot API authentication¶
Read more about different options for connecting to DataRobot API from the client.
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
Import data¶
data_path = "10k_diabetes.csv"
df = pd.read_csv(data_path)
df.head()
race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | ... | glipizide_metformin | glimepiride_pioglitazone | metformin_rosiglitazone | metformin_pioglitazone | change | diabetesMed | readmitted | diag_1_desc | diag_2_desc | diag_3_desc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Caucasian | Female | [50-60) | ? | Elective | Discharged to home | Physician Referral | 1 | CP | Surgery-Neuro | ... | No | No | No | No | No | No | False | Spinal stenosis in cervical region | Spinal stenosis in cervical region | Effusion of joint, site unspecified |
1 | Caucasian | Female | [20-30) | [50-75) | Urgent | Discharged to home | Physician Referral | 2 | UN | ? | ... | No | No | No | No | No | No | False | First-degree perineal laceration, unspecified ... | Diabetes mellitus of mother, complicating preg... | Sideroblastic anemia |
2 | Caucasian | Male | [80-90) | ? | Not Available | Discharged/transferred to home with home healt... | NaN | 7 | MC | Family/GeneralPractice | ... | No | No | No | No | No | Yes | True | Pneumococcal pneumonia [Streptococcus pneumoni... | Congestive heart failure, unspecified | Hyperosmolality and/or hypernatremia |
3 | AfricanAmerican | Female | [50-60) | ? | Emergency | Discharged to home | Transfer from another health care facility | 4 | UN | ? | ... | No | No | No | No | No | Yes | False | Cellulitis and abscess of face | Streptococcus infection in conditions classifi... | Diabetes mellitus without mention of complicat... |
4 | AfricanAmerican | Female | [50-60) | ? | Emergency | Discharged to home | Emergency Room | 5 | ? | Psychiatry | ... | No | No | No | No | Ch | Yes | False | Bipolar I disorder, single manic episode, unsp... | Diabetes mellitus without mention of complicat... | Depressive type psychosis |
5 rows × 51 columns
project = dr.Project.create(data_path, project_name="10K Diabetes Adv Modeling")
print("Project ID: {}".format(project.id))
Project ID: 635c2f3ba5c95929466f3cb7
Start Autopilot¶
project.analyze_and_model(
target="readmitted",
worker_count=-1,
)
Project(10K Diabetes Adv Modeling)
project.wait_for_autopilot()
In progress: 14, queued: 0 (waited: 0s) In progress: 14, queued: 0 (waited: 1s) In progress: 14, queued: 0 (waited: 1s) In progress: 14, queued: 0 (waited: 2s) In progress: 14, queued: 0 (waited: 3s) In progress: 14, queued: 0 (waited: 5s) In progress: 11, queued: 0 (waited: 9s) In progress: 10, queued: 0 (waited: 16s) In progress: 6, queued: 0 (waited: 29s) In progress: 1, queued: 0 (waited: 49s) In progress: 7, queued: 0 (waited: 70s) In progress: 1, queued: 0 (waited: 90s) In progress: 16, queued: 0 (waited: 111s) In progress: 10, queued: 0 (waited: 131s) In progress: 6, queued: 0 (waited: 151s) In progress: 2, queued: 0 (waited: 172s) In progress: 0, queued: 0 (waited: 192s) In progress: 5, queued: 0 (waited: 213s) In progress: 1, queued: 0 (waited: 233s) In progress: 4, queued: 0 (waited: 253s) In progress: 1, queued: 0 (waited: 274s) In progress: 1, queued: 0 (waited: 294s) In progress: 0, queued: 0 (waited: 315s) In progress: 0, queued: 0 (waited: 335s)
Get the top-performing model¶
model = project.get_top_model()
Model insights¶
The following sections outline the various model insights DataRobot has to offer. Before proceeding, set color constants to replicate the visual style of DataRobot.
dr_dark_blue = "#08233F"
dr_blue = "#1F77B4"
dr_orange = "#FF7F0E"
dr_red = "#BE3C28"
Feature Impact¶
Feature Impact measures how important a feature is in the context of a model. It measures how much the accuracy of a model would decrease if that feature was removed.
Feature Impact is available for all model types and works by altering input data and observing the effect on a model’s score. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once DataRobot computes the feature impact for a model, that information is saved with the project.
feature_impacts = model.get_or_request_feature_impact()
# Formats the ticks from a float into a percent
percent_tick_fmt = mtick.PercentFormatter(xmax=1.0)
impact_df = pd.DataFrame(feature_impacts)
impact_df.sort_values(by="impactNormalized", ascending=True, inplace=True)
# Positive values are blue, negative are red
bar_colors = impact_df.impactNormalized.apply(lambda x: dr_red if x < 0 else dr_blue)
ax = impact_df.plot.barh(
x="featureName", y="impactNormalized", legend=False, color=bar_colors, figsize=(10, 14)
)
ax.xaxis.set_major_formatter(percent_tick_fmt)
ax.xaxis.set_tick_params(labeltop=True)
ax.xaxis.grid(True, alpha=0.2)
ax.set_facecolor(dr_dark_blue)
plt.ylabel("")
plt.xlabel("Effect")
plt.xlim((None, 1)) # Allow for negative impact
plt.title("Feature Impact", y=1.04)
Text(0.5, 1.04, 'Feature Impact')
Histogram¶
The histogram chart "buckets" numeric feature values into equal-sized ranges to show frequency distribution of the variable—the target observation (Y-axis) plotted against the frequency of the value (X-axis). The height of each bar represents the number of rows with values in that range.
The helper function below, matplotlib_pair_histogram
, is used to draw histograms paired with the project's target feature (readamitted
in this case). The function includes an orange line in every histogram bin that indicates the average target feature value for rows in that bin.
def matplotlib_pair_histogram(labels, counts, target_avgs, bin_count, ax1, feature):
# Rotate categorical labels
if feature.feature_type in ["Categorical", "Text"]:
ax1.tick_params(axis="x", rotation=45)
ax1.set_ylabel(feature.name, color=dr_blue)
ax1.bar(labels, counts, color=dr_blue)
# Instantiate a second axes that shares the same x-axis
ax2 = ax1.twinx()
ax2.set_ylabel(target_feature_name, color=dr_orange)
ax2.plot(labels, target_avgs, marker="o", lw=1, color=dr_orange)
ax1.set_facecolor(dr_dark_blue)
title = "Histogram for {} ({} bins)".format(feature.name, bin_count)
ax1.set_title(title)
The next function, draw_feature_histogram
, gets the histogram data and draws the histogram using the previous helper function.
Before using the function, you can retrieve downsampled histogram data using the snippet below:
feature = dr.Feature.get(project.id, "num_lab_procedures")
feature.get_histogram(bin_limit=6).plot
[{'label': '1.0', 'count': 755, 'target': 0.36026490066225164}, {'label': '14.5', 'count': 895, 'target': 0.3240223463687151}, {'label': '28.0', 'count': 1875, 'target': 0.3744}, {'label': '41.5', 'count': 2159, 'target': 0.38490041685965726}, {'label': '55.0', 'count': 1603, 'target': 0.45414847161572053}, {'label': '68.5', 'count': 557, 'target': 0.5080789946140036}]
For best accuracy, DataRobot recommends using divisors of 60 for bin_limit
. Any value less than or equal to 60 can be used.
The target
values are project target input average values for a given bin.
def draw_feature_histogram(feature_name, bin_count):
feature = dr.Feature.get(project.id, feature_name)
# Retrieve downsampled histogram data from server
# based on desired bin count
data = feature.get_histogram(bin_count).plot
labels = [row["label"] for row in data]
counts = [row["count"] for row in data]
target_averages = [row["target"] for row in data]
f, axarr = plt.subplots()
f.set_size_inches((10, 4))
matplotlib_pair_histogram(labels, counts, target_averages, bin_count, axarr, feature)
Lastly, specify the feature name, target, and desired bin count to create the feature histograms. You can view an example below:
feature_name = "num_lab_procedures"
target_feature_name = "readmitted"
draw_feature_histogram("num_lab_procedures", 12)
Categorical and other feature types are supported as well:
feature_name = "medical_specialty"
draw_feature_histogram("medical_specialty", 10)
Lift Chart¶
A lift chart shows you how close model predictions are to the actual values of the target in the training data. The lift chart data includes the average predicted value and the average actual values of the target, sorted by the prediction values in ascending order and split into up to 60 bins.
lc = model.get_lift_chart("validation")
lc
LiftChart(validation)
bins_df = pd.DataFrame(lc.bins)
bins_df.head()
actual | predicted | bin_weight | |
---|---|---|---|
0 | 0.000000 | 0.076155 | 27.0 |
1 | 0.148148 | 0.117283 | 27.0 |
2 | 0.076923 | 0.146873 | 26.0 |
3 | 0.148148 | 0.168664 | 27.0 |
4 | 0.111111 | 0.182873 | 27.0 |
The following snippet defines functions for rebinning and plotting.
def rebin_df(raw_df, number_of_bins):
cols = ["bin", "actual_mean", "predicted_mean", "bin_weight"]
new_df = pd.DataFrame(columns=cols)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
x_index = 1
bin_size = 60 / number_of_bins
for rowId, data in raw_df.iterrows():
current_prediction_total += data["predicted"] * data["bin_weight"]
current_actual_total += data["actual"] * data["bin_weight"]
current_row_total += data["bin_weight"]
if (rowId + 1) % bin_size == 0:
x_index += 1
bin_properties = {
"bin": ((round(rowId + 1) / 60) * number_of_bins),
"actual_mean": current_actual_total / current_row_total,
"predicted_mean": current_prediction_total / current_row_total,
"bin_weight": current_row_total,
}
new_df = new_df.append(bin_properties, ignore_index=True)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
return new_df
def matplotlib_lift(bins_df, bin_count, ax):
grouped = rebin_df(bins_df, bin_count)
ax.plot(range(1, len(grouped) + 1), grouped["predicted_mean"], marker="+", lw=1, color=dr_blue)
ax.plot(range(1, len(grouped) + 1), grouped["actual_mean"], marker="*", lw=1, color=dr_orange)
ax.set_xlim([0, len(grouped) + 1])
ax.set_facecolor(dr_dark_blue)
ax.legend(loc="best")
ax.set_title("Lift chart {} bins".format(bin_count))
ax.set_xlabel("Sorted Prediction")
ax.set_ylabel("Value")
return grouped
Note that while this method works for any bin count less then 60, the most reliable result can be achieved when the number of bins is a divisor of 60.
Additionally, this visualization method does not work for a bin count greater than 60 because DataRobot does not provide enough information for a larger resolution.
bin_counts = [10, 12, 15, 20, 30, 60]
f, axarr = plt.subplots(len(bin_counts))
f.set_size_inches((8, 4 * len(bin_counts)))
rebinned_dfs = []
for i in range(len(bin_counts)):
rebinned_dfs.append(matplotlib_lift(bins_df, bin_counts[i], axarr[i]))
plt.tight_layout()
No handles with labels found to put in legend. No handles with labels found to put in legend. No handles with labels found to put in legend. No handles with labels found to put in legend. No handles with labels found to put in legend. No handles with labels found to put in legend.