No-show forecasting notebook¶
In this use case you will build a model that identifies patients most likely to miss appointments, with correlating reasons. This data can then be used by staff to target outreach on those patients and additionally to understand, and perhaps address, associated issues.
Download the sample training dataset here.
Import libraries¶
# Install the shapely and datarobot libraries
!pip install shapely --quiet
!pip install datarobot --quiet
# First load the needed libraries
import datarobot as dr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as datetime
import shapely.wkt
import shapely.geometry
import os
from IPython.display import HTML
import json
import requests
import yaml
%matplotlib inline
Connect to DataRobot¶
Use the snippet below to authenticate and connect to DataRobot. You can read more about different options for connecting to DataRobot from the client.
dr.Client()
# The `config_path` should only be specified if the config file is not in the default location described in the API Quickstart guide
# dr.Client(config_path = 'path-to-drconfig.yaml')
<datarobot.rest.RESTClientObject at 0x7f9d40f31160>
Define functions¶
# Define some useful functions, primarily used at prediction time
def predict_deployment_expl(deployment, data):
# Get the DataRobot key, prediction server URL, and deployment ID
pred_server = deployment.default_prediction_server
datarobot_key = pred_server['datarobot-key']
deployment_url = pred_server['url']
deployment_id = deployment.id
# Set HTTP headers. Match the charset to the contents of the file
headers = {'Content-Type': 'application/json; charset=UTF-8', 'datarobot-key': datarobot_key}
# Set Prediction Explanations parameters
params = {
'maxCodes': 3,
'thresholdHigh': 0.1,
'thresholdLow': 0.01,
}
url = f'{deployment_url}/predApi/v1.0/deployments/{deployment_id}/predictionExplanations'
# Make an API request for predictions
predictions_response = requests.post(
url,
auth=(creds['username'], creds['token']),
data=data,
headers=headers,
params=params
)
return predictions_response.json()
# The prediction API returns values in nested JSON objects. The function below flattens the results to tabular format.
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
def make_prediction(deployment, df_scoring):
# Convert to JSON format
data_to_pred = json.dumps(df_scoring.to_dict(orient='records'))
# Get and process predictions with explanations
preds_raw = predict_deployment_expl(deployment, data_to_pred)
preds_lst = [flatten_json(row) for row in preds_raw['data']]
df_preds = pd.DataFrame(preds_lst)
cols_to_rename = {'predictionValues_0_value': 'Prediction',
'predictionExplanations_0_feature': 'Primary Feature',
'predictionExplanations_0_featureValue': 'Primary Feature Value',
'predictionExplanations_0_qualitativeStrength':'Primary Feature Strength',
'predictionExplanations_1_feature': 'Secondary Feature',
'predictionExplanations_1_featureValue': 'Secondary Feature Value',
'predictionExplanations_1_qualitativeStrength':'Secondary Feature Strength',
'predictionExplanations_2_feature': 'Tertiary Feature',
'predictionExplanations_2_featureValue': 'Tertiary Feature Value',
'predictionExplanations_2_qualitativeStrength':'Tertiary Feature Strength'}
df_preds = df_preds.rename(columns=cols_to_rename)
return df_preds
def merge_results(df_scoring, df_preds):
cols_to_add = ['passthroughValues_Appointment ID', 'Prediction',
'Primary Feature Value', 'Primary Feature Strength', 'Primary Feature',
'Secondary Feature Value', 'Secondary Feature Strength', 'Secondary Feature',
'Tertiary Feature Value', 'Tertiary Feature Strength', 'Tertiary Feature',
]
df_scoring = df_scoring.merge(df_preds[cols_to_add], left_on='Appointment ID', right_on='passthroughValues_Appointment ID')
df_scoring['no_show'] = df_scoring.Prediction.apply(lambda x: True if x > 0.5 else False)
df_scoring = df_scoring.drop(columns=['passthroughValues_Appointment ID'])
return df_scoring
def data_prep_sample(sample_df):
sample_df['Schedule Date'] = pd.to_datetime(sample_df['Schedule Date'])
sample_df['Appointment Date'] = pd.to_datetime(sample_df['Appointment Date'])
# Calculate the delta between scheduled day and appointment day. Remove time to return in days only.
sample_df["Appointment/Schedule Delta"] = (sample_df['Appointment Date'] - sample_df['Schedule Date'].apply(lambda t : t.replace(minute=0, hour=0, second=0, microsecond=0))).apply(lambda t: t.days)
# Calculate a new column for the distance between the neighborhood and the clinic
sample_df['Distance'] = df.apply(lambda l: calcDistance(l['Neighborhood'], l['Clinic Location']), axis=1)
sample_df['Appointment Date'] = sample_df['Appointment Date'].apply(lambda l: str(l))
sample_df['Schedule Date'] = sample_df['Schedule Date'].apply(lambda l: str(l))
return sample_df
def load_file_and_make_predictions(file_name):
print("loading file " + file_name + "...")
df_scoring_csv = pd.read_csv(file_name)
df_scoring = data_prep_sample(df_scoring_csv)
print("loading file " + file_name + "...done")
print("Generating predictions...")
df_preds = make_prediction(deployment, df_scoring)
results_df = merge_results(df_scoring, df_preds)
print("Generating predictions...done")
print("No shows predicted for " + file_name)
display(results_df.loc[results_df['no_show'] == True])
print()
return results_df
Import data¶
df = pd.read_csv('no_show.csv')
df.info()
df.head(2)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80000 entries, 0 to 79999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_show 80000 non-null bool 1 Patient ID 80000 non-null object 2 Appointment ID 80000 non-null object 3 Gender 80000 non-null object 4 Age 80000 non-null int64 5 Alcohol Consumption 80000 non-null object 6 Hypertension 80000 non-null bool 7 Diabetes 80000 non-null bool 8 Appointment Date 80000 non-null object 9 Schedule Date 80000 non-null object 10 Appointment Reason 80000 non-null object 11 Clinic Location 80000 non-null object 12 Specialty 80000 non-null object 13 Neighborhood 80000 non-null object dtypes: bool(3), int64(1), object(10) memory usage: 6.9+ MB
no_show | Patient ID | Appointment ID | Gender | Age | Alcohol Consumption | Hypertension | Diabetes | Appointment Date | Schedule Date | Appointment Reason | Clinic Location | Specialty | Neighborhood | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 649e3901-e56b-41d9-b2d3-f61ce708a415 | 659a5257-c2f0-4eda-a2bb-ebc4bd9ce4e4 | F | 43 | 5/week | False | False | 2021-01-14T10:30:00 | 2020-10-26T00:00:00 | CHIROPRACT MANJ 3-4 REGIONS | Mission Bay | Human Performance Center | Russian Hill |
1 | False | 3028fd02-a20a-4233-ac16-b571dde4540c | 7ae6e7f8-3788-48d2-9fbc-11114ec28bfe | F | 37 | 0/week | False | True | 2021-02-17T14:00:00 | 2021-01-25T00:00:00 | OFFICE/OUTPATIENT VISIT EST | Mission Bay | Endocrine, Diabetes & Pregnancy Program | Ocean View |
The sample data contains the following features:
- no_show: Target variable. True if the patient missed, False if they attended.
- Patient ID: Unique identifier for the patient.
- Appointment ID: Unique identifier for the appointment.
- Gender: Gender of the patient.
- Age: Age of the patient.
- Alcohol Consumption: A categorical variable representing alcohol consumption per week.
- Hypertension: Boolean, True if the patient is hypertensive.
- Diabetes: Boolean, True if the patient is diabetic.
- Appointment Date: Appointment date and time.
- Schedule Date: The date the appointment was scheduled.
- Appointment Reason: Text field containing reason(s) for the appointment.
- Clinic Location: Location of the clinic (in this sample data, locations in San Francisco).
- Specialty: Specialist the patient is seeing.
- Neighborhood: Neighborhood the patient lives in (in this sample data, locations in San Francisco).
The following returns some intersting facts found in the data.
# Return the attendance representation of all patients in the sample. True for patients who missed, False for those who attended.
print("no_show:")
print(df['no_show'].value_counts(normalize=True))
# Return the number of clinic locations.
print()
print('Number of Clinic Locations: ' + str(df['Clinic Location'].nunique()))
# Return the number of neighborhoods represented by the patient sample.
print()
print('Number of Neighborhoods: ' + str(df['Neighborhood'].nunique()))
# Plot the number of appointments per patient. Notice that most appointments are from patients that are new or have few historical appointments.
print()
print('Patient ID:')
df['Patient ID'].value_counts().plot(kind='hist', logy=True)
no_show: False 0.959825 True 0.040175 Name: no_show, dtype: float64 Number of Clinic Locations: 9 Number of Neighborhoods: 37 Patient ID:
<AxesSubplot:ylabel='Frequency'>
In the graph above, the Y axis is the number of patients; the X axis is number of visits. This graph illustrates that many patients have had (or scheduled) more than one visit. This is important to note because you want to ensure that patients don't appear in both the training and validation sets. This data is important when setting up group partitioning for modeling later in the process.
Looking at the list of features returned, you can see a few interesting facts:
no_show, the target variable, shows that the percentage of no-shows, of all visits for the dataset, is 4%.
Patient ID represents the patient going to the appointment. Because the data contains multiple visits for one patient, choose the Group Partitioning method when setting up modeling in DataRobot.
Number of Clinic Locations reports there are 9 clinic locations represented in the data.
Number of Neighborhoods reports there are 37 San Francisco neighborhoods represented in the data. (Later, you will calculate the distance between the patient neighborhood and the clinic.)
Appointment Date and Schedule Date are event dates (the date and the time in the case of appointment date). Engineering a feature that calculates the difference between these dates would be useful as well.
Feature engineering¶
Engineering some features, which DataRobot can use when model building, will help to best identify liklihood and cause. The sections below calculate the following new features using the included supplemental datasets:
Elapsed time, in days, between appointment date and schedule date (this provides the real insight).
Distance between the patient home address and the clinic appointment address. This accelerator includes two additional datasets that will help to enrich the data:
- planning_neighborhoods—neighborhoods of San Francisco and their WKT geodata polygons.
- clinics—latitude and longitude of the identified clinics.
Historical no-show rates per patient.
- no_show_historical—historical patient-by-patient data.
There are many tools available to calculate these features. This use case uses Python and Pandas.
Elapsed time¶
The following calculates the difference, in days, between the date the appointment was made (Schedule Date
) and the date it was scheduled for (Appointment Date
). This feature will be represented by the column Appointment/Schedule Delta.
# Change the scheduled and appointment dates from string to date
df['Schedule Date'] = pd.to_datetime(df['Schedule Date'])
df['Appointment Date'] = pd.to_datetime(df['Appointment Date'])
# Add a new column that calulates the schedule delta between scheduled day and appointment day. Remove time to return in days only.
df["Appointment/Schedule Delta"] = (df['Appointment Date'] - df['Schedule Date'].apply(lambda t : t.replace(minute=0, hour=0, second=0, microsecond=0))).apply(lambda t: t.days)
Distance to appointment¶
Now, calculate the distance between the patient home address and the clinic appointment address using the included datasets. The distance is calculated as the geographic distance between the centroid of the neighborhood the patient lives in to the location (latitude/longitude) of the clinic. The neighborhood and clinic names remain in the final dataset as they may be meaningful to the data. The following adds a new column, Distance.
# Create a map of the neighborhood to the geometry of that neighborhood
geo_df = pd.read_csv('planning_neighborhoods.csv')
geo_df = geo_df.reset_index()
geo = {}
for index, row in geo_df.iterrows():
geo[row['neighborho']] = row['the_geom']
# Create a map of the clinic location to the latitude/longitude of the clinic
clinic_df = pd.read_csv('clinics.csv')
clinic_df = clinic_df.reset_index()
clinic = {}
for index, row in clinic_df.iterrows():
clinic[row['name']] = [row['lat'], row['long']]
# Calculate the distance between the neighborhood and the clinic
def calcDistance(neigh, loc):
# Load the geometry (WKT format).
p = shapely.wkt.loads(geo[neigh])
lat = clinic[loc][0]
long = clinic[loc][1]
# Convert the lat/long into a point.
point = shapely.geometry.Point(long, lat)
# Calculate the distance between the centroid of the neighborhood and the point of the clinic
return p.distance(point)
# Add a new column that is the distance between the neighborhood and the clinic
df['Distance'] = df.apply(lambda l: calcDistance(l['Neighborhood'], l['Clinic Location']), axis=1)
Historical no-show rates¶
Next, load and calculate the historical no-show rate for each patient. For this example, assume that all historical data is stored in a CSV file keyed off of the Patient ID. (You would likely fetch these into a CSV or query a database.) This task adds a new column, Hist No Show.
The primary dataset is at the visit level. The historical dataset is aggregated at the patient level.
Historical no-show rate is calculated as #no-show / #visits
over a fixed period of time in the past (for example, over the past 6 months). In other words, does a patient's previous tendency for missing appointments contribute to predicting the current visit?
# Create a map of patient ID to historical no show rates
hist_no_show_df = pd.read_csv('no_show_historical.csv')
hist_no_show_df = hist_no_show_df.reset_index()
hist_no_show = {}
for index, row in hist_no_show_df.iterrows():
hist_no_show[row['Patient ID']] = row['no_show']
# Add a new column that has the historical no show rate for that patient, or 0.0 if there is no data
df['Hist No Show'] = df['Patient ID'].apply(lambda l: hist_no_show.get(l, 0.0))
# Drop the column "Appointment ID" as it will have no meaning in DataRobot
df = df.drop(columns=['Appointment ID'])
Final data¶
The final dataset, including engineered features, can be seen below:
df.info()
pd.set_option('display.max_rows', 5)
display(df.loc[df['no_show'] == False])
display(df.loc[df['no_show'] == True])
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80000 entries, 0 to 79999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_show 80000 non-null bool 1 Patient ID 80000 non-null object 2 Gender 80000 non-null object 3 Age 80000 non-null int64 4 Alcohol Consumption 80000 non-null object 5 Hypertension 80000 non-null bool 6 Diabetes 80000 non-null bool 7 Appointment Date 80000 non-null datetime64[ns] 8 Schedule Date 80000 non-null datetime64[ns] 9 Appointment Reason 80000 non-null object 10 Clinic Location 80000 non-null object 11 Specialty 80000 non-null object 12 Neighborhood 80000 non-null object 13 Appointment/Schedule Delta 80000 non-null int64 14 Distance 80000 non-null float64 15 Hist No Show 80000 non-null float64 dtypes: bool(3), datetime64[ns](2), float64(2), int64(2), object(7) memory usage: 8.2+ MB
no_show | Patient ID | Gender | Age | Alcohol Consumption | Hypertension | Diabetes | Appointment Date | Schedule Date | Appointment Reason | Clinic Location | Specialty | Neighborhood | Appointment/Schedule Delta | Distance | Hist No Show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 649e3901-e56b-41d9-b2d3-f61ce708a415 | F | 43 | 5/week | False | False | 2021-01-14 10:30:00 | 2020-10-26 | CHIROPRACT MANJ 3-4 REGIONS | Mission Bay | Human Performance Center | Russian Hill | 80 | 0.035496 | 0.00 |
1 | False | 3028fd02-a20a-4233-ac16-b571dde4540c | F | 37 | 0/week | False | True | 2021-02-17 14:00:00 | 2021-01-25 | OFFICE/OUTPATIENT VISIT EST | Mission Bay | Endocrine, Diabetes & Pregnancy Program | Ocean View | 23 | 0.072859 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
79998 | False | 3570b112-021d-44da-be37-d2f76274ac90 | F | 41 | 5/week | False | False | 2020-07-06 15:30:00 | 2020-06-17 | ELECTROCARDIOGRAM REPORT | Mission Bay | Antenatal Testing Center | Ocean View | 19 | 0.072859 | 0.00 |
79999 | False | d066a33a-f9b8-40e9-9188-704e243202f8 | F | 36 | 5/week | False | False | 2021-09-08 10:30:00 | 2021-07-20 | OFFICE/OUTPATIENT VISIT EST,LOCM 300-399mg/ml ... | Laurel Village | Primary Care at Laurel Village | Outer Mission | 50 | 0.051993 | 0.26 |
76786 rows × 16 columns
no_show | Patient ID | Gender | Age | Alcohol Consumption | Hypertension | Diabetes | Appointment Date | Schedule Date | Appointment Reason | Clinic Location | Specialty | Neighborhood | Appointment/Schedule Delta | Distance | Hist No Show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
89 | True | bdf1e07e-2e60-44d7-8ccb-0a485a0bdb5b | F | 58 | 5/week | True | False | 2021-05-14 15:45:00 | 2021-02-17 | OFFICE/OUTPATIENT VISIT EST | Mount Zion | High-Risk Skin Cancer Clinic | Excelsior | 86 | 0.054118 | 0.00 |
153 | True | 74152ac2-673f-4a12-9062-e01122d2af0f | M | 47 | > 14/week | False | False | 2021-04-27 11:00:00 | 2021-02-01 | SUBSEQUENT HOSPITAL CARE | Mission Bay | Human Performance Center | Presidio | 85 | 0.061707 | 0.18 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
79958 | True | f7f896bf-266f-42f0-acb8-9f1844c61cec | F | 36 | 0/week | False | False | 2020-10-02 16:00:00 | 2020-06-15 | Chest x-ray 2vw frontal&latl | Parnassus | California Center for Pituitary Disorders | Lakeshore | 109 | 0.028724 | 0.30 |
79961 | True | 715f74a8-325b-4b39-b28b-7722111abb58 | M | 56 | 0/week | False | False | 2020-07-28 09:45:00 | 2020-03-17 | OFFICE/OUTPATIENT VISIT EST,Ferumoxytol non-esrd | Mount Zion | Pain Management Center | Excelsior | 133 | 0.054118 | 0.25 |
3214 rows × 16 columns
Load the data¶
You can now load the final dataset into DataRobot. The example below uses the API to do this (see the documentation on installing and using the Python client). You could also modify the code above to save the final data as a CSV and use that to create the project in the DataRobot UI.
# Create a client connection to DR
# WARNING: Change the following file path to your config.yaml location
path = os.path.abspath(os.path.expanduser(os.path.expandvars("~/.config/datarobot/drconfig.yaml")))
with open(path, 'r') as stream:
creds = yaml.safe_load(stream)
dr.Client(config_path = path)
<datarobot.rest.RESTClientObject at 0x11d6add80>
# Write out the final version of the data
ct = datetime.datetime.now()
file_name = f"no_show_enriched_{int(ct.timestamp())}.csv"
dataset = dr.Dataset.create_from_in_memory_data(df)
dataset.modify(name=file_name)
dataset
#df.to_csv("./" + file_name, index=False)
Dataset(name='no_show_enriched_1655491559.csv', id='62accbf1ca7d695f57eea2cf')
Modeling and insights¶
The following steps use code to create the project and then uses the DataRobot UI to interpret results based on Leaderboard visualizations.
Create the project¶
The following code creates the project using the DataRobot API. To create the project from the UI, use the following settings:
Setting | Value |
---|---|
Target | no_show |
Partitioning | Group, with Group ID Feature set to Patient ID |
Feature List | Create a list that excludes Approval Date and Schedule Date from modeling. Specifically, the list will keep the components of the Approval Date and Schedule Date, but not keep the dates themselves. That is, the actual dates should not be used for modeling because future dates won't be in the training data and will be ignored. Instead, this case extracts components like month, day of week, and time of day—features derived from dates but not the dates themselves. |
The following code creates the project in DataRobot using the above settings.
# Create the project
ct = datetime.datetime.now()
project_name = f"no_show_enriched_{int(ct.timestamp())}.csv"
project = dataset.create_project(project_name=project_name)
# Create a new Feature List that contains all of the features in the "Informative Features" and then removes some features
flists = project.get_featurelists()
flist = next(x for x in flists if x.name == "Informative Features")
# Remove the date features from the list, but retain the components created by DataRobot
new_features = [x for x in flist.features if x not in ['Appointment Date','Schedule Date']]
# Create the new Feature List
fl = project.create_featurelist(name='No Dates', features=new_features)
# Uncomment the next line to print out a link that can be used to access the project from the UI.
#display(HTML(f'You can <a target="_blank" rel="noopener noreferrer" href="https://app.datarobot.com/projects/{project.id}/eda">click into the project</a> and see the new Feature List. '))
# Set the project options--target, partitioning method, and Feature List. Then, start the <a href="https://docs.datarobot.com/en/docs/modeling/reference/model-detail/model-ref.html#quick-autopilot">Quick Autopilot</a>.
group_cv = dr.GroupCV(20, 5, ['Patient ID'])
project.analyze_and_model(
# Set the target as the no_show column
target='no_show',
# Don't limit the worker count
worker_count = '-1',
# Group partition on the Patient ID column
partitioning_method=group_cv,
# Set the feature list that was created above
featurelist_id=fl.id
)
# Uncomment the next line to print out a link that can be used to access the project from the UI.
#display(HTML(f'Now you can navigate to the models page of the new project or <a target="_blank" rel="noopener noreferrer" href="https://app.datarobot.com/projects/{project.id}/models">click this link</a>'))
project.wait_for_autopilot(verbosity=0)
Navigate to project in the UI¶
Once the project is created and Autopilot has finished, open the project in the UI. Find the name of this project with the following code, and then open the project from the Manage Projects:
print(project_name)
no_show_enriched_1655491640.csv
Exploratory Data Analysis (EDA)¶
From the UI, navigate to the Data tab to learn more about your data.
Click each feature to see a variety of information, including a histogram that represents the relationship of the feature with the target.
Use the Feature Associations tab to visualize clusters of related features.
Leaderboard evaluation¶
Click on the Models tab to view the Leaderboard, which populates with models as they build and ranks them based on the chosen optimization metric. Autopilot will continue building models until it selects the best predictive model for the specified target feature. This model is at the top of the Leaderboard, marked with the Recommended for Deployment badge.
Note: See a complete list of visualizations created by DataRobot (availability depends on project type). The visualizations below are those most helpful to this use case.
Blueprint tab¶
To investigate models, first click on any model (even when Autopilot is still running) to reveal the model Blueprint—the pipeline of preprocessing steps, modeling algorithms, and post-processing steps used to create the model.
ROC Curve tools¶
Next, view the ROC Curve tab tools. The information here can be used to calculate the ROI of the solution, which is directly related to how you operationalize the model.
For this use case, the goal is to build a call list that results in changing the outcome for patients that might have missed appointments, while minimizing the number of people called who would have attended without prompting. Achieving this balance is done by assigning payoffs in the Profit Curve's Payoff Matrix. Choose Add Payoff in the Matrix pane and enter values:
Field | Description | Example |
---|---|---|
True Positive | Reflects the savings gained for every patient that attends once called. | TP=$120 represents a saving of $120. |
False Positive | Reflects the cost of making the call on what would have been a "false alarm." | FP=-$5 represents a false positive payout of -$5. |
Metrics | Provides standard statistics to help describe model performance at the selected display threshold. | Select False Positive Rate (Fallout), Positive Predictive Value (Precision), and Average Profit. |
Based on the metrics, consider the following:
The decision boundary suggested for maximizing profit is not practical because almost 20% of appointments (the Fallout rate) would require a personalized reminder. Adjust where on the curve to set that target for the model.
Click on different points of the ROC curve to see the recall rate with Fallout values of 5% and 10%. Notice:
- At 5%, the precision is roughly 26% and average profit is roughly \$3.
- At 10%, the precision is roughly 21% and the average profit is roughly \$3.40.