Loan default notebook¶
Many credit decisioning systems are driven by scorecards, which are very simplistic rules-based systems. These are built by end-user organizations through industry knowledge or through simple statistical systems. Some organizations go a step further and obtain scorecards from third parties which may not be customized for an individual organization’s book. An AI-based approach can help financial institutions learn signals from their own book and assess risk at a more granular level. Once the risk is calculated, a strategy may be implemented to use this information for interventions. If you can predict someone is going to default, this may lead to intervention steps such as sending earlier notices or rejecting loan applications.
Setup¶
This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions. Retrieve your DataRobot API Token by logging into DataRobot and navigating to the Developer Tools in your profile.
!pip install datarobot umap-learn nbformat hdbscan
import datarobot as dr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
light_blue = "#598fd6"
grey_blue = "#5f728b"
orange = "#dd6b3d"
Connect to DataRobot¶
Read more about different options for connecting to DataRobot from the client.
dr.Client(config_path = '<path-to-drconfig.yaml>')
<datarobot.rest.RESTClientObject at 0x7fdc4898fe50>
Import data¶
The data file is hosted by DataRobot using the URL in the following cell. Read in the data directly from the URL into a Pandas DataFrame and display the results to verify all of the data looks correct. If you have your own data files you can access that data in several ways.
data_path = "https://s3.amazonaws.com/datarobot-use-case-datasets/Lending+Club+Dataset+Train.csv"
pathfinder_df = pd.read_csv(data_path, encoding = "ISO-8859-1")
pathfinder_df.rename(columns={'loan_is_bad': 'is_bad'}, inplace=True)
pathfinder_df.head(100)
id | member_id | loan_amnt | funded_amnt | installment | grade | sub_grade | emp_title | emp_length | home_ownership | ... | revol_util | total_acc | initial_list_status | collections_12_mths_ex_med | mths_since_last_major_derog | application_type | acc_now_delinq | tot_coll_amt | tot_cur_bal | is_bad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3296446 | 4068857 | 11200 | 11200 | 343.89 | A | A2 | Nokia Siemens Network | 10.0 | OWN | ... | 66.20% | 21 | f | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 187717.0 | False |
1 | 3286412 | 4058853 | 10000 | 10000 | 328.06 | B | B2 | creative financial group | 2.0 | MORTGAGE | ... | 74.20% | 11 | f | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 16623.0 | True |
2 | 3286406 | 4058848 | 8000 | 8000 | 282.41 | C | C4 | Techtron Systems | 7.0 | RENT | ... | 72% | 17 | w | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 17938.0 | False |
3 | 3296434 | 4068843 | 16000 | 16000 | 500.65 | A | A4 | Bristol Hospital | 10.0 | MORTGAGE | ... | 75.20% | 56 | f | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 372771.0 | False |
4 | 3286395 | 4058836 | 4000 | 4000 | 125.17 | A | A4 | Aspen Skiing Company | 10.0 | MORTGAGE | ... | 95.50% | 21 | w | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 331205.0 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 3286021 | 4058369 | 4000 | 4000 | 146.12 | D | D3 | Morton Plant Hospital | 6.0 | MORTGAGE | ... | 84.80% | 21 | f | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 112600.0 | False |
96 | 2835185 | 3417435 | 8500 | 8500 | 264.88 | A | A3 | The Adocate/Hearst News | 5.0 | MORTGAGE | ... | 28.70% | 19 | w | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 112238.0 | False |
97 | 3241124 | 3984059 | 13000 | 13000 | 432.54 | B | B3 | Southwest ISD | 3.0 | MORTGAGE | ... | 57.90% | 18 | f | 0 | NaN | INDIVIDUAL | 0 | 202.0 | 120076.0 | False |
98 | 3198040 | 3930968 | 20000 | 20000 | 608.72 | A | A1 | west texas a&m university | 5.0 | MORTGAGE | ... | 39.90% | 27 | f | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 230748.0 | False |
99 | 3188257 | 3921251 | 16000 | 16000 | 387.40 | C | C3 | Antea Group | 1.0 | MORTGAGE | ... | 54.60% | 28 | f | 0 | NaN | INDIVIDUAL | 0 | 0.0 | 194506.0 | False |
100 rows × 34 columns
Visualize data¶
Use the following snippets to display unique aspects of the data. The first cell groups the dataframe by average annual income for loans that default and those that do not. The second cell shows how often a loan defaults based on the emp_length value
. The third cells shows the average default rate for loans for each state.
df1 = pathfinder_df.groupby('is_bad').agg({'annual_inc': 'mean'}).reset_index()
df1
is_bad | annual_inc | |
---|---|---|
0 | False | 72630.592408 |
1 | True | 64442.019078 |
avg_value_df = pathfinder_df.groupby("emp_length").agg({'is_bad':'mean'}).reset_index()
avg_value_df
emp_length | is_bad | |
---|---|---|
0 | 1.0 | 0.150927 |
1 | 2.0 | 0.154332 |
2 | 3.0 | 0.164575 |
3 | 4.0 | 0.156938 |
4 | 5.0 | 0.149490 |
5 | 6.0 | 0.162082 |
6 | 7.0 | 0.168053 |
7 | 8.0 | 0.160577 |
8 | 9.0 | 0.169727 |
9 | 10.0 | 0.149395 |
avg_value_df = pathfinder_df.groupby("addr_state").agg({'is_bad':'mean'}).reset_index()
avg_value_df
addr_state | is_bad | |
---|---|---|
0 | AK | 0.098837 |
1 | AL | 0.188563 |
2 | AR | 0.165803 |
3 | AZ | 0.168299 |
4 | CA | 0.151439 |
5 | CO | 0.138211 |
6 | CT | 0.155527 |
7 | DC | 0.091549 |
8 | DE | 0.157895 |
9 | FL | 0.180175 |
10 | GA | 0.153999 |
11 | HI | 0.192926 |
12 | IL | 0.129351 |
13 | IN | 0.166667 |
14 | KS | 0.127615 |
15 | KY | 0.140515 |
16 | LA | 0.161716 |
17 | MA | 0.150280 |
18 | MD | 0.165319 |
19 | MI | 0.173405 |
20 | MN | 0.138249 |
21 | MO | 0.153558 |
22 | MT | 0.131579 |
23 | NC | 0.147039 |
24 | NE | 1.000000 |
25 | NH | 0.101695 |
26 | NJ | 0.181864 |
27 | NM | 0.160000 |
28 | NV | 0.175192 |
29 | NY | 0.173567 |
30 | OH | 0.150431 |
31 | OK | 0.163121 |
32 | OR | 0.136858 |
33 | PA | 0.159287 |
34 | RI | 0.200855 |
35 | SC | 0.145000 |
36 | SD | 0.104839 |
37 | TN | 0.000000 |
38 | TX | 0.133150 |
39 | UT | 0.161290 |
40 | VA | 0.164849 |
41 | VT | 0.208696 |
42 | WA | 0.159657 |
43 | WI | 0.149385 |
44 | WV | 0.110092 |
45 | WY | 0.103448 |
Initiate modeling¶
Create a DataRobot project to train models against the assembled dataset.
# Create the project and upload data
project = dr.Project.create(sourcedata = pathfinder_df,
project_name = 'Predict loan defaults')
# Set the project target to the appropriate feature. Use the LogLoss metric to measure performance and specify Autopilot to run on a subset of models
project.set_target(target='is_bad',
mode=dr.AUTOPILOT_MODE.QUICK,
worker_count='-1')
# Uncomment and replace the project ID if the project already exists
# project = dr.Project.get("612cb904ce5d5617d67af394")
# Get the project metric (i.e LogLoss, RMSE, etc...)
metric = project.metric
# Get project URL
project_url = project.get_leaderboard_ui_permalink()
# Get project ID
project_id = project.id
View project in UI¶
If you want to view any aspects of the project in the DataRobot UI, you can retrieve the URL for the project with the snippet below and use it to navigate to the DataRobot application in your browser.
# Display project URL
project_url
'https://app.datarobot.com/projects/62cda041ab0bc3275f7a4a86/models'
Initiate modeling¶
project.wait_for_autopilot(check_interval=30)
In progress: 0, queued: 0 (waited: 0s)
Evaluate model performance¶
In order to measure model performance, first select the top model based on a specific performance metric (i.e., LogLoss
) and then evaluate several different types of charts, such as Lift Chart, ROC Curve, and Feature Importance. There are two helper functions (detailed below) that assist in producing these charts.
You can reference more information about how to evaluate model performance in the DataRobot platform documentation.
In the snippet below, use models built during Autopilot to create a list of the top-performing models based on their accuracy.
def sorted_by_metric(models, test_set, metric):
models_with_score = [model for model in models if
model.metrics[metric][test_set] is not None]
return sorted(models_with_score,
key=lambda model: model.metrics[metric][test_set])
models = project.get_models()
# Uncomment if this is not set above in the create project cell
metric = project.metric
# Get the top-performing model
model_top = sorted_by_metric(models, 'crossValidation', metric)[0]
print('''The top performing model is {model} using metric, {metric}'''.format(model = str(model_top), metric = metric))
The top performing model is Model('Elastic-Net Classifier (L2 / Binomial Deviance)') using metric, LogLoss
# Set styling
dr_dark_blue = '#08233F'
dr_blue = '#1F77B4'
dr_orange = '#FF7F0E'
dr_red = '#BE3C28'
# Function to build histograms
def rebin_df(raw_df, number_of_bins):
cols = ['bin', 'actual_mean', 'predicted_mean', 'bin_weight']
new_df = pd.DataFrame(columns=cols)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
x_index = 1
bin_size = 60 / number_of_bins
for rowId, data in raw_df.iterrows():
current_prediction_total += data['predicted'] * data['bin_weight']
current_actual_total += data['actual'] * data['bin_weight']
current_row_total += data['bin_weight']
if ((rowId + 1) % bin_size == 0):
x_index += 1
bin_properties = {
'bin': ((round(rowId + 1) / 60) * number_of_bins),
'actual_mean': current_actual_total / current_row_total,
'predicted_mean': current_prediction_total / current_row_total,
'bin_weight': current_row_total
}
new_df = new_df.append(bin_properties, ignore_index=True)
current_prediction_total = 0
current_actual_total = 0
current_row_total = 0
return new_df
Lift chart¶
A lift chart shows you how close model predictions are to the actual values of the target in the training data. The lift chart data includes the average predicted value and the average actual values of the target, sorted by the prediction values in ascending order and split into up to 60 bins.
# Function to build lift charts
def matplotlib_lift(bins_df, bin_count, ax):
grouped = rebin_df(bins_df, bin_count)
ax.plot(range(1, len(grouped) + 1), grouped['predicted_mean'],
marker='+', lw=1, color=dr_blue, label='predicted')
ax.plot(range(1, len(grouped) + 1), grouped['actual_mean'],
marker='*', lw=1, color=dr_orange, label='actual')
ax.set_xlim([0, len(grouped) + 1])
ax.set_facecolor(dr_dark_blue)
ax.legend(loc='best')
ax.set_title('Lift chart {} bins'.format(bin_count))
ax.set_xlabel('Sorted Prediction')
ax.set_ylabel('Value')
return grouped
lift_chart = model_top.get_lift_chart('validation')
# Save the result into a Pandas dataframe
lift_df = pd.DataFrame(lift_chart.bins)
bin_counts = [10, 15]
f, axarr = plt.subplots(len(bin_counts))
f.set_size_inches((8, 4 * len(bin_counts)))
rebinned_dfs = []
for i in range(len(bin_counts)):
rebinned_dfs.append(matplotlib_lift(lift_df, bin_counts[i], axarr[i]))
plt.tight_layout()
No handles with labels found to put in legend. No handles with labels found to put in legend.
ROC Curve¶
The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
roc = model_top.get_roc_curve('validation')
#Save the result into a pandas dataframe
roc_df = pd.DataFrame(roc.roc_points)
roc_df
accuracy | f1_score | false_negative_score | true_negative_score | true_positive_score | false_positive_score | true_negative_rate | false_positive_rate | true_positive_rate | matthews_correlation_coefficient | positive_predictive_value | negative_predictive_value | threshold | fraction_predicted_as_positive | fraction_predicted_as_negative | lift_positive | lift_negative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.843687 | 0.000000 | 1238 | 6682 | 0 | 0 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.843687 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 |
1 | 0.844066 | 0.004835 | 1235 | 6682 | 3 | 0 | 1.000000 | 0.000000 | 0.002423 | 0.045224 | 1.000000 | 0.844007 | 0.514413 | 0.000379 | 0.999621 | 6.397415 | 1.000379 |
2 | 0.843939 | 0.006431 | 1234 | 6680 | 4 | 2 | 0.999701 | 0.000299 | 0.003231 | 0.038695 | 0.666667 | 0.844074 | 0.477247 | 0.000758 | 0.999242 | 4.264943 | 1.000459 |
3 | 0.844318 | 0.014388 | 1229 | 6678 | 9 | 4 | 0.999401 | 0.000599 | 0.007270 | 0.059846 | 0.692308 | 0.844568 | 0.449702 | 0.001641 | 0.998359 | 4.428980 | 1.001045 |
4 | 0.844444 | 0.025316 | 1222 | 6672 | 16 | 10 | 0.998503 | 0.001497 | 0.012924 | 0.072549 | 0.615385 | 0.845199 | 0.423901 | 0.003283 | 0.996717 | 3.936871 | 1.001792 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
110 | 0.184848 | 0.276071 | 7 | 233 | 1231 | 6449 | 0.034870 | 0.965130 | 0.994346 | 0.061893 | 0.160286 | 0.970833 | 0.043229 | 0.969697 | 0.030303 | 1.025419 | 1.150703 |
111 | 0.175253 | 0.274061 | 5 | 155 | 1233 | 6527 | 0.023197 | 0.976803 | 0.995961 | 0.049450 | 0.158892 | 0.968750 | 0.039698 | 0.979798 | 0.020202 | 1.016497 | 1.148234 |
112 | 0.165657 | 0.272086 | 3 | 77 | 1235 | 6605 | 0.011523 | 0.988477 | 0.997577 | 0.033049 | 0.157526 | 0.962500 | 0.033700 | 0.989899 | 0.010101 | 1.007756 | 1.140826 |
113 | 0.156439 | 0.270394 | 0 | 1 | 1238 | 6681 | 0.000150 | 0.999850 | 1.000000 | 0.004837 | 0.156333 | 1.000000 | 0.019673 | 0.999874 | 0.000126 | 1.000126 | 1.185274 |
114 | 0.156313 | 0.270365 | 0 | 0 | 1238 | 6682 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.156313 | 0.000000 | 0.000044 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
115 rows × 17 columns
Feature Impact¶
Feature Impact measures how important a feature is in the context of a model. It measures how much the accuracy of a model would decrease if that feature was removed.
Feature Impact is available for all model types and works by altering input data and observing the effect on a model’s score. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once DataRobot computes the feature impact for a model, that information is saved with the project.
feature_impacts = model_top.get_or_request_feature_impact()
# Limit size to make chart look good. Display top 25 values
if len(feature_impacts) > 25:
feature_impacts = feature_impacts[0:24]
# Formats the ticks from a float into a percent
percent_tick_fmt = mtick.PercentFormatter(xmax=1.0)
impact_df = pd.DataFrame(feature_impacts)
impact_df.sort_values(by='impactNormalized', ascending=True, inplace=True)
# Positive values are blue, negative are red
bar_colors = impact_df.impactNormalized.apply(lambda x: dr_red if x < 0
else dr_blue)
ax = impact_df.plot.barh(x='featureName', y='impactNormalized',
legend=False,
color=bar_colors,
figsize=(10, 8))
ax.xaxis.set_major_formatter(percent_tick_fmt)
ax.xaxis.set_tick_params(labeltop=True)
ax.xaxis.grid(True, alpha=0.2)
ax.set_facecolor(dr_dark_blue)
plt.ylabel('')
plt.xlabel('Effect')
plt.xlim((None, 1)) # Allow for negative impact
plt.title('Feature Impact', y=1.04);
Text(0.5, 1.04, 'Feature Impact')
Deploy a model¶
Deploy the top-performing model in order to make predictions in a production environment.
# Create a prediction server
prediction_server = dr.PredictionServer.list()[0]
# Get the top-performing model. Uncomment if this did not execute in the previous cell
# model_top = sorted_by_metric(models, 'crossValidation', metric)[0]
deployment = dr.Deployment.create_from_learning_model(
model_top.id, label='Predicting Loan Defaults', description='Predicting Loan Defaults',
default_prediction_server_id=prediction_server.id)
deployment
data_path_scoring = "https://s3.amazonaws.com/datarobot-use-case-datasets/Lending+Club+Dataset+Pred.csv"
scoring_df = pd.read_csv(data_path_scoring, encoding = "ISO-8859-1")
pathfinder_df.rename(columns={'loan_is_bad': 'is_bad'}, inplace=True)
scoring_df
Make predictions¶
After the model has been deployed, DataRobot creates an endpoint for real time scoring. Lastly, display the results to demonstrate how to write your results to an external data store (S3, Snowflake, etc…)
prediction_dataset = project.upload_dataset(scoring_df)
predict_job = model_top.request_predictions(prediction_dataset.id)
prediction_dataset.id
predictions = predict_job.get_result_when_complete()
pd.concat([scoring_df, predictions], axis=1)
Export predictions to Snowflake¶
You may want to write the forecasted results back to a Snowflake Data Warehouse. This can be achieved with the following code. This approach is often used when you are interested in consuming the model predictions in other tools such as BI solutions for visualizations.
# Create forecast CSV
predictions.to_csv("predictions.csv", sep=',', encoding='utf-8',index=False)
# Write Data file to Snowflake
cur.execute("PUT file://predictions.csv @%PREDICTIONS_STAGE;")
cur.execute("COPY INTO PREDICTIONS_STAGE FROM @%PREDICTIONS_TABLE FILE_FORMAT=(type=csv field_delimiter=',', skip_header=1 PURGE=TRUE;")
print("Predictions updated back in Snowflake")