Anti-money laundering notebook¶
In this use case you will build a model that uses historical data, including customer and transactional information, to identify which alerts resulted in a Suspicious Activity Report (SAR). The model can then be used to assign a suspicious activity score to future alerts and improve the efficiency of an AML compliance program using rank ordering by score.
Download the sample training dataset here.
Import libraries¶
!pip install datarobot --quiet
import datarobot as dr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import datetime as datetime
import os
import sys
from IPython.display import HTML
import json
import requests
import yaml
%matplotlib inline
light_blue = "#598fd6"
grey_blue = "#5f728b"
orange = "#dd6b3d"
Connect to DataRobot¶
dr.Client()
# The `config_path` should only be specified if the config file is not in the default location described in the API Quickstart guide
# dr.Client(config_path = 'path-to-drconfig.yaml')
<datarobot.rest.RESTClientObject at 0x7fb7a067f940>
Import data¶
# Load the training dataset.
df = pd.read_csv("https://s3.amazonaws.com/datarobot-use-case-datasets/DR_Demo_AML_Alert_train.csv",
encoding = "ISO-8859-1")
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ALERT 10000 non-null int64 1 SAR 10000 non-null int64 2 kycRiskScore 10000 non-null int64 3 income 9800 non-null float64 4 tenureMonths 10000 non-null int64 5 creditScore 10000 non-null int64 6 state 10000 non-null object 7 nbrPurchases90d 10000 non-null int64 8 avgTxnSize90d 10000 non-null float64 9 totalSpend90d 10000 non-null float64 10 csrNotes 10000 non-null object 11 nbrDistinctMerch90d 10000 non-null int64 12 nbrMerchCredits90d 10000 non-null int64 13 nbrMerchCreditsRndDollarAmt90d 10000 non-null int64 14 totalMerchCred90d 10000 non-null float64 15 nbrMerchCreditsWoOffsettingPurch 10000 non-null int64 16 nbrPayments90d 10000 non-null int64 17 totalPaymentAmt90d 10000 non-null float64 18 overpaymentAmt90d 10000 non-null float64 19 overpaymentInd90d 10000 non-null int64 20 nbrCustReqRefunds90d 10000 non-null int64 21 indCustReqRefund90d 10000 non-null int64 22 totalRefundsToCust90d 10000 non-null float64 23 nbrPaymentsCashLike90d 10000 non-null int64 24 maxRevolveLine 10000 non-null int64 25 indOwnsHome 10000 non-null int64 26 nbrInquiries1y 10000 non-null int64 27 nbrCollections3y 10000 non-null int64 28 nbrWebLogins90d 10000 non-null int64 29 nbrPointRed90d 10000 non-null int64 30 PEP 10000 non-null int64 dtypes: float64(7), int64(22), object(2) memory usage: 2.4+ MB
ALERT | SAR | kycRiskScore | income | tenureMonths | creditScore | state | nbrPurchases90d | avgTxnSize90d | totalSpend90d | ... | indCustReqRefund90d | totalRefundsToCust90d | nbrPaymentsCashLike90d | maxRevolveLine | indOwnsHome | nbrInquiries1y | nbrCollections3y | nbrWebLogins90d | nbrPointRed90d | PEP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 110300.0 | 5 | 757 | PA | 10 | 153.80 | 1538.00 | ... | 1 | 45.82 | 5 | 6000 | 0 | 3 | 0 | 6 | 1 | 0 |
1 | 1 | 0 | 2 | 107800.0 | 6 | 715 | NY | 22 | 1.59 | 34.98 | ... | 1 | 67.40 | 0 | 10000 | 1 | 3 | 0 | 87 | 0 | 0 |
2 | 1 | 0 | 1 | 74000.0 | 13 | 751 | MA | 7 | 57.64 | 403.48 | ... | 1 | 450.69 | 0 | 10000 | 0 | 3 | 0 | 6 | 0 | 0 |
3 | 1 | 0 | 0 | 57700.0 | 1 | 659 | NJ | 14 | 29.52 | 413.28 | ... | 1 | 71.43 | 0 | 8000 | 1 | 5 | 0 | 7 | 2 | 0 |
4 | 1 | 0 | 1 | 59800.0 | 3 | 709 | PA | 54 | 115.77 | 6251.58 | ... | 1 | 2731.39 | 3 | 7000 | 1 | 1 | 0 | 8 | 1 | 0 |
5 rows × 31 columns
The sample data contains the following features:
- ALERT: Alert Indicator
- SAR: Target variable, SAR Indicator
- kycRiskScore: Account relationship (Know Your Customer) score used at time of account opening
- income: Annual income
- tenureMonths: Account tenure in months
- creditScore: Credit bureau score
- state: Account billing address state
- nbrPurchases90d: Number of purchases in last 90 days
- avgTxnSize90d: Average transaction size in last 90 days
- totalSpend90d: Total spend in last 90 days
- csrNotes: Customer Service Representative notes and codes based on conversations with customer
- nbrDistinctMerch90d: Number of distinct merchants purchased at in last 90 days
- nbrMerchCredits90d: Number of credits from merchants in last 90 days
- nbrMerchCreditsRndDollarAmt90d: Number of credits from merchants in round dollar amounts in last 90 days
- totalMerchCred90d: Total merchant credit amount in last 90 days
- nbrMerchCreditsWoOffsettingPurch: Number of merchant credits without an offsetting purchase in last 90 days
- nbrPayments90d: Number of payments in last 90 days
- totalPaymentAmt90d: Total payment amount in last 90 days
- overpaymentAmt90d: Total amount overpaid in last 90 days
- overpaymentInd90d: Indicator that account was overpaid in last 90 days
- nbrCustReqRefunds90d: Number refund requests by the customer in last 90 days
- indCustReqRefund90d: Indicator that customer requested a refund in last 90 days
- totalRefundsToCust90d: Total refund amount in last 90 days
- nbrPaymentsCashLike90d: Number of cash-like payments (e.g., money orders) in last 90 days
- maxRevolveLine: Maximum revolving line of credit
- indOwnsHome: Indicator that the customer owns a home
- nbrInquiries1y: Number of credit inquiries in the past year
- nbrCollections3y: Number of collections in the past year
- nbrWebLogins90d: Number of logins to the bank website in the last 90 days
- nbrPointRed90d: Number of loyalty point redemptions in the last 90 days
- PEP: Politically Exposed Person indicator
Analyze, clean, and curate data¶
Preparing data is an iterative process. Even if you clean and prep your training data prior to uploading it, you can still improve its quality by performing Exploratory Data Analysis.
First, get a feel for the data. It is important to understand the dimensions of each feature and this provides an opportunity to ask questions if a feature doesn't make sense. Look at the distribution of the target variable and double check that the whole population only contains alerts (since this is the only population you care about).
print("ALERT:")
print(df['ALERT'].value_counts(normalize=True))
print("SAR:")
print(df['SAR'].value_counts(normalize=True))
ALERT: 1 1.0 Name: ALERT, dtype: float64 SAR: 0 0.8974 1 0.1026 Name: SAR, dtype: float64
From the output can see that the false positive rate (SAR=0
) is roughly 90%. This is expected for AML problems.
Next, assess the data quality by checking if any missing values are present in the data.
df.isnull().sum()/len(df)
ALERT 0.00 SAR 0.00 kycRiskScore 0.00 income 0.02 tenureMonths 0.00 creditScore 0.00 state 0.00 nbrPurchases90d 0.00 avgTxnSize90d 0.00 totalSpend90d 0.00 csrNotes 0.00 nbrDistinctMerch90d 0.00 nbrMerchCredits90d 0.00 nbrMerchCreditsRndDollarAmt90d 0.00 totalMerchCred90d 0.00 nbrMerchCreditsWoOffsettingPurch 0.00 nbrPayments90d 0.00 totalPaymentAmt90d 0.00 overpaymentAmt90d 0.00 overpaymentInd90d 0.00 nbrCustReqRefunds90d 0.00 indCustReqRefund90d 0.00 totalRefundsToCust90d 0.00 nbrPaymentsCashLike90d 0.00 maxRevolveLine 0.00 indOwnsHome 0.00 nbrInquiries1y 0.00 nbrCollections3y 0.00 nbrWebLogins90d 0.00 nbrPointRed90d 0.00 PEP 0.00 dtype: float64
The data looks to be in good shape, with only 2% (which is very small) missing from the income
variable. This is a time to question: is income missing because the person is unemployed or is this a system error?
Next, use the following code to visulize the data for better understanding. This is good practice that can help to spot anomalies or get additional information. This code selects numerical features (if there are categorical features, you can look at them later).
plt_hist = df.select_dtypes(include='number').hist(figsize=(20, 20), xlabelsize=8, ylabelsize=8)
Some quick observations based on the histograms:
- Some features do not provide any useful information because they are all zeroes or have a single value (example:
PEP
). - Some features are zero-inflated (example:
nbrPaymentsCashLike90d
). - Some numerical features can be turned into categorical features (example:
indOwnsHome
)
Use the API to predict a SAR¶
Once you have configured your API credentials, endpoints, and environment, you can use the DataRobot API to do the following:
- Upload a dataset.
- Train a model to learn from the dataset on the Informative Features feature list.
- Test prediction outcomes on the model with new data.
- Deploy the model.
- Predict outcomes on the deployed model using new data.
# Upload a dataset
ct = datetime.datetime.now()
file_name = f"AML_Alert_train_{int(ct.timestamp())}.csv"
dataset = dr.Dataset.create_from_in_memory_data(df)
dataset.modify(name=file_name)
dataset
Dataset(name='AML_Alert_train_1658253378.csv', id='62d6f04257fa4f9a325372d5')
# Create a new project based on dataset
ct = datetime.datetime.now()
project_name = f"Anti Money Laundering Alert Scoring_{int(ct.timestamp())}"
# The following steps use code to create the project and then use the DataRobot user interface (UI) to interpret results based on Leaderboard visualizations.
# If you use the UI to create the project, choose `SAR` as the target.
project = dataset.create_project(project_name=project_name)
project_url = project.get_leaderboard_ui_permalink()
# Display the project ID and name. Your URL and project ID will be different.
print(f'''Project Details
Project URL: {project_url}
Project ID: {project.id}
Project Name: {project.project_name}
''')
Project Details Project URL: https://app.datarobot.com/projects/62d6f07b5956dedb4c536e79/models Project ID: 62d6f07b5956dedb4c536e79 Project Name: Anti Money Laundering Alert Scoring_1658253435
# Feature lists control the subset of features that DataRobot uses to build models
# You can use one of the automatically created lists, e.g. "Informative Features"
flists = project.get_featurelists()
flist = next(x for x in flists if x.name == "Informative Features")
flist.features
['SAR', 'kycRiskScore', 'income', 'tenureMonths', 'creditScore', 'state', 'nbrPurchases90d', 'avgTxnSize90d', 'totalSpend90d', 'csrNotes', 'nbrDistinctMerch90d', 'nbrMerchCredits90d', 'nbrMerchCreditsRndDollarAmt90d', 'totalMerchCred90d', 'nbrMerchCreditsWoOffsettingPurch', 'nbrPayments90d', 'totalPaymentAmt90d', 'overpaymentAmt90d', 'overpaymentInd90d', 'nbrCustReqRefunds90d', 'totalRefundsToCust90d', 'nbrPaymentsCashLike90d', 'maxRevolveLine', 'indOwnsHome', 'nbrInquiries1y', 'nbrCollections3y', 'nbrWebLogins90d', 'nbrPointRed90d']
# Select modeling parameters and start the modeling process
project.set_target(target='SAR',
mode=dr.AUTOPILOT_MODE.QUICK,
featurelist_id=flist.id,
worker_count='-1')
display(HTML(f'For a full reference of available parameters, see <a target="_blank" rel="noopener noreferrer" href="https://datarobot-public-api-client.readthedocs-hosted.com/page/autodoc/api_reference.html#datarobot.models.Project.set_target">Project.set_target.</a>'))
project.wait_for_autopilot(check_interval=20.0, timeout=86400, verbosity=0)
Models in the UI¶
Once the project is started, you can start exploring in the UI—even while Autopilot is running and building models. You can open the project from the Manage Projects center. Or, retrieve the project URL with the snippet below and use the output to navigate to the DataRobot application.
# Display project ID and URL
project_url
'https://app.datarobot.com/projects/62d6f07b5956dedb4c536e79/models'
Exploratory Data Analysis (EDA)¶
Navigate to the Data tab to learn more about your data.
- Click each feature to see a variety of information, including a histogram that represents the relationship of the feature with the target.