{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lead scoring\n",
"\n",
"This notebook outlines a lead scoring use case that predicts whether a prospect will become a customer. You can frame this use case as a binary classification problem.\n",
"\n",
"The dataset used in this notebook is from the UCI Machine Learning Repository and includes information from a direct telemarketing campaign of a Portuguese bank. It was published in a paper by Sérgio Moro and colleagues in 2014. The target is indicated by the feature “**y**”; a “yes” means that the prospect purchased the product being offered and “no” means that they did not.\n",
" \n",
"\n",
"*[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014*\n",
"\n",
"## Prerequisites\n",
"\n",
"* A DataRobot login \n",
"* A DataRobot API key \n",
"* [The sample training dataset](bank-full.csv)\n",
"* Python 3.7+\n",
"* DataRobot API version 2.21+"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"### Import Libraries"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
":219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject\n"
]
}
],
"source": [
"import datarobot as dr\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to DataRobot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call\n",
"# dr.Client(config_path='path-to-drconfig.yaml')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read more about different options for [connecting to DataRobot from the client](https://docs.datarobot.com/en/docs/api/api-quickstart/api-qs.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload a dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" job \n",
" marital \n",
" education \n",
" default \n",
" balance \n",
" housing \n",
" loan \n",
" contact \n",
" day \n",
" month \n",
" duration \n",
" campaign \n",
" pdays \n",
" previous \n",
" poutcome \n",
" y \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 58 \n",
" management \n",
" married \n",
" tertiary \n",
" no \n",
" 2143 \n",
" yes \n",
" no \n",
" unknown \n",
" 5 \n",
" may \n",
" 261 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 1 \n",
" 44 \n",
" technician \n",
" single \n",
" secondary \n",
" no \n",
" 29 \n",
" yes \n",
" no \n",
" unknown \n",
" 5 \n",
" may \n",
" 151 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 2 \n",
" 33 \n",
" entrepreneur \n",
" married \n",
" secondary \n",
" no \n",
" 2 \n",
" yes \n",
" yes \n",
" unknown \n",
" 5 \n",
" may \n",
" 76 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 3 \n",
" 47 \n",
" blue-collar \n",
" married \n",
" unknown \n",
" no \n",
" 1506 \n",
" yes \n",
" no \n",
" unknown \n",
" 5 \n",
" may \n",
" 92 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 4 \n",
" 33 \n",
" unknown \n",
" single \n",
" unknown \n",
" no \n",
" 1 \n",
" no \n",
" no \n",
" unknown \n",
" 5 \n",
" may \n",
" 198 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age job marital education default balance housing loan \\\n",
"0 58 management married tertiary no 2143 yes no \n",
"1 44 technician single secondary no 29 yes no \n",
"2 33 entrepreneur married secondary no 2 yes yes \n",
"3 47 blue-collar married unknown no 1506 yes no \n",
"4 33 unknown single unknown no 1 no no \n",
"\n",
" contact day month duration campaign pdays previous poutcome y \n",
"0 unknown 5 may 261 1 -1 0 unknown no \n",
"1 unknown 5 may 151 1 -1 0 unknown no \n",
"2 unknown 5 may 76 1 -1 0 unknown no \n",
"3 unknown 5 may 92 1 -1 0 unknown no \n",
"4 unknown 5 may 198 1 -1 0 unknown no "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_path = \"https://docs.datarobot.com/en/docs/api/guide/common-case/bank-full.csv\"\n",
"\n",
"df = pd.read_csv(data_path) # Add your dataset here\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a project\n",
"\n",
"Start the project with the dataset **bank-full.csv** and indicate the target as “**y**”. Set the Autopilot modeling mode to \"**Quick**\".\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"project = dr.Project.create(project_name='Lead-Scoring',\n",
" sourcedata= df)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"project.set_target(target=\"y\", worker_count=\"-1\", mode=dr.AUTOPILOT_MODE.QUICK)\n",
"\n",
"project.wait_for_autopilot() # Wait for autopilot to complete"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It can be onerous to rerun Autopilot every time you want to run the script. If your project is already created, then comment out the last line of code above to ensure you do not rerun Autopilot. You can then simply refer to the project using the `GetProject` function (uncomment and use the code below).\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# project = dr.Project.get(project_id='YOUR_PROJECT_ID')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Select a model to evaluate\n",
"\n",
"DataRobot recommends evaluating the 80% version of the top model using the code below."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"models = project.get_models(\n",
" search_params={\n",
" \"sample_pct__gt\": 80,\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = models[1]\n",
"model.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get validation scores\n",
"\n",
"You can get the validation and cross-validation scores for every possible metric of the model using the code below. You can also pull these scores for multiple models if you want to compare them programmatically."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ROC Curve\n",
"\n",
"After obtaining the overall performance of the model, you can plot the ROC curve using the code below."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" accuracy \n",
" f1_score \n",
" false_negative_score \n",
" true_negative_score \n",
" true_positive_score \n",
" false_positive_score \n",
" true_negative_rate \n",
" false_positive_rate \n",
" true_positive_rate \n",
" matthews_correlation_coefficient \n",
" positive_predictive_value \n",
" negative_predictive_value \n",
" threshold \n",
" fraction_predicted_as_positive \n",
" fraction_predicted_as_negative \n",
" lift_positive \n",
" lift_negative \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0.883021 \n",
" 0.000000 \n",
" 4231 \n",
" 31938 \n",
" 0 \n",
" 0 \n",
" 1.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.000000 \n",
" 0.883021 \n",
" 1.000000 \n",
" 0.000000 \n",
" 1.000000 \n",
" 0.000000 \n",
" 1.000000 \n",
" \n",
" \n",
" 1 \n",
" 0.883049 \n",
" 0.000473 \n",
" 4230 \n",
" 31938 \n",
" 1 \n",
" 0 \n",
" 1.000000 \n",
" 0.000000 \n",
" 0.000236 \n",
" 0.014447 \n",
" 1.000000 \n",
" 0.883046 \n",
" 0.980059 \n",
" 0.000028 \n",
" 0.999972 \n",
" 8.548570 \n",
" 1.000028 \n",
" \n",
" \n",
" 2 \n",
" 0.884542 \n",
" 0.029289 \n",
" 4168 \n",
" 31930 \n",
" 63 \n",
" 8 \n",
" 0.999750 \n",
" 0.000250 \n",
" 0.014890 \n",
" 0.106300 \n",
" 0.887324 \n",
" 0.884537 \n",
" 0.930287 \n",
" 0.001963 \n",
" 0.998037 \n",
" 7.585351 \n",
" 1.001716 \n",
" \n",
" \n",
" 3 \n",
" 0.888772 \n",
" 0.110939 \n",
" 3980 \n",
" 31895 \n",
" 251 \n",
" 43 \n",
" 0.998654 \n",
" 0.001346 \n",
" 0.059324 \n",
" 0.207523 \n",
" 0.853741 \n",
" 0.889059 \n",
" 0.875943 \n",
" 0.008129 \n",
" 0.991871 \n",
" 7.298269 \n",
" 1.006838 \n",
" \n",
" \n",
" 4 \n",
" 0.889657 \n",
" 0.131069 \n",
" 3930 \n",
" 31877 \n",
" 301 \n",
" 61 \n",
" 0.998090 \n",
" 0.001910 \n",
" 0.071142 \n",
" 0.223533 \n",
" 0.831492 \n",
" 0.890245 \n",
" 0.862702 \n",
" 0.010009 \n",
" 0.989991 \n",
" 7.108065 \n",
" 1.008180 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" accuracy f1_score false_negative_score true_negative_score \\\n",
"0 0.883021 0.000000 4231 31938 \n",
"1 0.883049 0.000473 4230 31938 \n",
"2 0.884542 0.029289 4168 31930 \n",
"3 0.888772 0.110939 3980 31895 \n",
"4 0.889657 0.131069 3930 31877 \n",
"\n",
" true_positive_score false_positive_score true_negative_rate \\\n",
"0 0 0 1.000000 \n",
"1 1 0 1.000000 \n",
"2 63 8 0.999750 \n",
"3 251 43 0.998654 \n",
"4 301 61 0.998090 \n",
"\n",
" false_positive_rate true_positive_rate matthews_correlation_coefficient \\\n",
"0 0.000000 0.000000 0.000000 \n",
"1 0.000000 0.000236 0.014447 \n",
"2 0.000250 0.014890 0.106300 \n",
"3 0.001346 0.059324 0.207523 \n",
"4 0.001910 0.071142 0.223533 \n",
"\n",
" positive_predictive_value negative_predictive_value threshold \\\n",
"0 0.000000 0.883021 1.000000 \n",
"1 1.000000 0.883046 0.980059 \n",
"2 0.887324 0.884537 0.930287 \n",
"3 0.853741 0.889059 0.875943 \n",
"4 0.831492 0.890245 0.862702 \n",
"\n",
" fraction_predicted_as_positive fraction_predicted_as_negative \\\n",
"0 0.000000 1.000000 \n",
"1 0.000028 0.999972 \n",
"2 0.001963 0.998037 \n",
"3 0.008129 0.991871 \n",
"4 0.010009 0.989991 \n",
"\n",
" lift_positive lift_negative \n",
"0 0.000000 1.000000 \n",
"1 8.548570 1.000028 \n",
"2 7.585351 1.001716 \n",
"3 7.298269 1.006838 \n",
"4 7.108065 1.008180 "
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"roc = model.get_roc_curve(\"crossValidation\")\n",
"\n",
"# Save the result into a pandas dataframe\n",
"df = pd.DataFrame(roc.roc_points)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dr_roc_green = \"#03c75f\"\n",
"white = \"#ffffff\"\n",
"dr_purple = \"#65147D\"\n",
"dr_dense_green = \"#018f4f\"\n",
"dr_dark_blue = \"#08233F\"\n",
"\n",
"fig = plt.figure(figsize=(8, 8))\n",
"axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)\n",
"\n",
"plt.scatter(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)\n",
"plt.plot(df.false_positive_rate, df.true_positive_rate, color=dr_roc_green)\n",
"plt.plot([0, 1], [0, 1], color=white, alpha=0.25)\n",
"plt.title(\"ROC curve\")\n",
"plt.xlabel(\"False Positive Rate (Fallout)\")\n",
"plt.xlim([0, 1])\n",
"plt.ylabel(\"True Positive Rate (Sensitivity)\")\n",
"plt.ylim([0, 1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"View a sample ROC Curve below."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython import display\n",
"\n",
"display.Image(\"./roccurve.png\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get Feature Impact\n",
"\n",
"Use the code below to understand which features have the highest impact on driving model decisions."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" redundantWith \n",
" featureName \n",
" impactNormalized \n",
" impactUnnormalized \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" None \n",
" duration \n",
" 1.000000 \n",
" 0.256413 \n",
" \n",
" \n",
" 1 \n",
" None \n",
" month \n",
" 0.370865 \n",
" 0.095095 \n",
" \n",
" \n",
" 2 \n",
" None \n",
" day \n",
" 0.186467 \n",
" 0.047813 \n",
" \n",
" \n",
" 3 \n",
" None \n",
" contact \n",
" 0.116615 \n",
" 0.029902 \n",
" \n",
" \n",
" 4 \n",
" None \n",
" poutcome \n",
" 0.086397 \n",
" 0.022153 \n",
" \n",
" \n",
" 5 \n",
" None \n",
" balance \n",
" 0.080238 \n",
" 0.020574 \n",
" \n",
" \n",
" 6 \n",
" None \n",
" age \n",
" 0.070169 \n",
" 0.017992 \n",
" \n",
" \n",
" 7 \n",
" None \n",
" housing \n",
" 0.065196 \n",
" 0.016717 \n",
" \n",
" \n",
" 8 \n",
" None \n",
" pdays \n",
" 0.055162 \n",
" 0.014144 \n",
" \n",
" \n",
" 9 \n",
" None \n",
" campaign \n",
" 0.053662 \n",
" 0.013760 \n",
" \n",
" \n",
" 10 \n",
" None \n",
" education \n",
" 0.024921 \n",
" 0.006390 \n",
" \n",
" \n",
" 11 \n",
" None \n",
" job \n",
" 0.023866 \n",
" 0.006120 \n",
" \n",
" \n",
" 12 \n",
" None \n",
" marital \n",
" 0.018290 \n",
" 0.004690 \n",
" \n",
" \n",
" 13 \n",
" None \n",
" previous \n",
" 0.010864 \n",
" 0.002786 \n",
" \n",
" \n",
" 14 \n",
" None \n",
" loan \n",
" 0.008976 \n",
" 0.002302 \n",
" \n",
" \n",
" 15 \n",
" None \n",
" default \n",
" 0.001796 \n",
" 0.000461 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" redundantWith featureName impactNormalized impactUnnormalized\n",
"0 None duration 1.000000 0.256413\n",
"1 None month 0.370865 0.095095\n",
"2 None day 0.186467 0.047813\n",
"3 None contact 0.116615 0.029902\n",
"4 None poutcome 0.086397 0.022153\n",
"5 None balance 0.080238 0.020574\n",
"6 None age 0.070169 0.017992\n",
"7 None housing 0.065196 0.016717\n",
"8 None pdays 0.055162 0.014144\n",
"9 None campaign 0.053662 0.013760\n",
"10 None education 0.024921 0.006390\n",
"11 None job 0.023866 0.006120\n",
"12 None marital 0.018290 0.004690\n",
"13 None previous 0.010864 0.002786\n",
"14 None loan 0.008976 0.002302\n",
"15 None default 0.001796 0.000461"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get Feature Impact\n",
"feature_impact = model.get_or_request_feature_impact()\n",
"\n",
"# Save feature impact in pandas dataframe\n",
"fi_df = pd.DataFrame(feature_impact)\n",
"fi_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feature Impact is calculated using [permutation](https://docs.datarobot.com/en/docs/modeling/analyze-models/understand/feature-impact.html#shared-permutation-based-feature-impact). In the example output above, the most impactful feature is **duration**, followed by **month** and **day**. To plot these Feature Impact scores:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots(figsize=(12, 5))\n",
"\n",
"# Plot feature impact\n",
"sns.barplot(x=\"impactNormalized\", y=\"featureName\", data=fi_df, color=\"#2D8FE2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unlock holdout\n",
"\n",
"By default, DataRobot uses a five-fold cross-validation and 20% holdout partitioning . The holdout data is not used during model training, however you can unlock it and pull the new scores to see how your model predicts on new data. "
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"training_predictions_job = model.request_training_predictions(dr.enums.DATA_SUBSET.HOLDOUT)\n",
"training_predictions = training_predictions_job.get_result_when_complete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the code below to download the predicitions as a CSV."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"training_predictions.download_to_csv(\"predictions.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}