Anti-Money Laundering (AML) Alert Scoring¶
In this use case you will build a model that uses historical data, including customer and transactional information, to identify which alerts resulted in a Suspicious Activity Report (SAR). The model can then be used to assign a suspicious activity score to future alerts and improve the efficiency of an AML compliance program using rank ordering by score.
Download the sample training dataset here.
Click here to jump directly to the notebook. Otherwise, the following several paragraphs describe the business justification and problem framing for this use case.
A key pillar of any AML compliance program is to monitor transactions for suspicious activity. The scope of transactions is broad, including deposits, withdrawals, fund transfers, purchases, merchant credits, and payments. Typically, monitoring starts with a rules-based system that scans customer transactions for red flags consistent with money laundering. When a transaction matches a predetermined rule, an alert is generated and the case is referred to the bank’s internal investigation team for manual review. If the investigators conclude the behavior is indicative of money laundering, then the bank will file a Suspicious Activity Report (SAR) with FinCEN.
Unfortunately, the standard transaction monitoring system described above has costly drawbacks. In particular, the rate of false-positives (cases incorrectly flagged as suspicious) generated by this rules-based system can reach 90% or more. Since the system is rules-based and rigid, it cannot dynamically learn the complex interactions and behaviors behind money laundering. The prevalence of false-positives makes investigators less efficient as they have to manually weed out cases that the rules-based system incorrectly marked as suspicious.
Compliance teams at financial institutions can have hundreds or even thousands of investigators, and the current systems prevent investigators from becoming more effective and efficient in their investigations. The cost of reviewing an alert ranges between
$30~$70. For a bank that receives 100,000 alerts a year, this is a substantial sum; on average, penalties imposed for proven money laundering amount to
$145 million per case. A reduction in false positives could result in savings between
$600,000~$4.2 million per year.
This use case builds a model that dynamically learns patterns in complex data and reduces false positive alerts. Financial crime compliance teams can then prioritize the alerts that legitimately require manual review and dedicate more resources to those cases most likely to be suspicious. By learning from historical data to uncover patterns related to money laundering, AI also helps identify which customer data and transaction activities are indicative of a high risk for potential money laundering.
The primary issues and corresponding opportunities that this use case addresses include:
|Potential regulatory fine||Mitigate the risk of missing suspicious activities due to lack of competency with alert investigations. Use alert scores to more effectively assign alerts—high risk alerts to more experienced investigators, low risk alerts to more junior team members.|
|Investigation productivity||Increase investigators' productivity by making the review process more effective and efficient, and by providing a more holistic view when assessing cases.|
Strategy/challenge: Help investigators focus their attention on cases that have the highest risk of money laundering while minimizing the time they spend reviewing false-positive cases.
For banks with large volumes of daily transactions, improvements in the effectiveness and efficiency of their investigations ultimately results in fewer cases of money laundering that go unnoticed. This allows banks to enhance their regulatory compliance and reduce the volume of financial crime present within their network.
Business driver: Improve the efficiency of AML transaction monitoring and lower operational costs.
With its ability to dynamically learn patterns in complex data, AI significantly improves accuracy in predicting which cases will result in a SAR filing. AI models for anti-money laundering can be deployed into the review process to score and rank all new cases.
Model solution: Assign a suspicious activity score to each AML alert, improving the efficiency of an AML compliance program.
Any case that exceeds a predetermined threshold of risk is sent to the investigators for manual review. Meanwhile, any case that falls below the threshold can be automatically discarded or sent to a lighter review. Once AI models are deployed into production, they can be continuously retrained on new data to capture any novel behaviors of money laundering. This data will come from the feedback of investigators.
Specifically, the model will use rules that trigger an alert whenever a customer requests a refund of any amount since small refund requests could be the money launderer’s way of testing the refund mechanism or trying to establish refund requests as a normal pattern for their account.
The following table summarizes aspects of this use case.
|Use case type||Anti-money laundering (false positive reduction)|
|Target audience||Data Scientist, Financial Crime Compliance Team|
The target variable for this use case is whether or not the alert resulted in a SAR after manual review by investigators, making this a binary classification problem. The unit of analysis is an individual alert—the model will be built on the alert level—and each alert will receive a score ranging from 0 to 1. The score indicates the probability of being a SAR.
The goal of applying a model to this use case is to lower the false positive rate, which means resources are not spent reviewing cases that are eventually determined not to be suspicious after an investigation.
In this use case, the False Positive Rate of the rules engine on the validation sample (1600 records) is:
The number of
SAR=0 divided by the total number of records =
ROI can be calculated as follows:
Avoided potential regulatory fine + Annual alert volume * false positive reduction rate * cost per alert
A high-level measurement of the ROI equation involves two parts.
The total amount of
avoided potential regulatory fineswill vary depending on the nature of the bank and must be estimated on a case-by-case basis.
The second part of the equation is where AI can have a tangible impact on improving investigation productivity and reducing operational costs. Consider this example:
- A bank generates 100,000 AML alerts every year.
- DataRobot achieves a 70% false positive reduction rate without losing any historical suspicious activities.
- The average cost per alert is
Result: The annual ROI of implementing the solution will be
100,000 * 70% * ($30~$70) = $2.1MM~$4.9MM.
Working with data¶
The linked synthetic dataset illustrates a credit card company’s AML compliance program. Specifically, the model detects the following money-laundering scenarios:
- The customer spends on the card but overpays their credit card bill and seeks a cash refund for the difference.
- The customer receives credits from a merchant without offsetting transactions and either spends the money or requests a cash refund from the bank.
The unit of analysis in this dataset is an individual alert, meaning a rule-based engine is in place to produce an alert to detect potentially suspicious activity consistent with the above scenarios.
Consider the following when working with data:
Define the scope of analysis: Collect alerts from a specific analytical window to start with; it’s recommended that you use 12–18 months of alerts for model building.
Define the target: Depending on the investigation processes, the target definition could be flexible. In this walkthrough, alerts are classified as
Level3-confirmed. These labels indicate at which level of the investigation the alert was closed (i.e., confirmed as a SAR). To create a binary target, treat
Level3-confirmedas SAR (denoted by 1) and the remaining levels as non-SAR alerts (denoted by 0).
Consolidate information from multiple data sources: Below is a sample entity-relationship diagram indicating the relationship between the data tables used for this use case.
Some features are static information—for example,
state of residence—these can be fetched directly from the reference tables.
For transaction behavior and payment history, the information will be derived from a specific time window prior to the alert generation date. This case uses 90 days as the time window to obtain the dynamic customer behavior, such as
Below is an example of one row in the training data after it is merged and aggregated (it is broken into multiple lines for easier visualization).
Features and sample data¶
The features in the sample dataset consist of KYC (Know-Your-Customer) information, demographic information, transactional behavior, and free-form text information from notes taken by customer service representatives. To apply this use case in your organization, your dataset should contain, at a minimum, the following features:
- Alert ID
- Binary classification target (
- Date/time of the alert
- "Know Your Customer" score used at the time of account opening
- Account tenure, in months
- Total merchant credit in the last 90 days
- Number of refund requests by the customer in the last 90 days
- Total refund amount in the last 90 days
Other helpful features to include are:
- Annual income
- Credit bureau score
- Number of credit inquiries in the past year
- Number of logins to the bank website in the last 90 days
- Indicator that the customer owns a home
- Maximum revolving line of credit
- Number of purchases in the last 90 days
- Total spend in the last 90 days
- Number of payments in the last 90 days
- Number of cash-like payments (e.g., money orders) in last 90 days
- Total payment amount in last 90 days
- Number of distinct merchants purchased from in the last 90 days
- Customer Service Representative notes and codes based on conversations with customer (cumulative)
The table below shows a sample feature list:
|Feature name||Data type||Description||Data source||Example|
|SAR||Binary(Target)||SAR Indicator (Binary Target)||tbl_alert||0|
|kycRiskScore||Numeric||Account relationship (Know Your Customer) score used at time of account opening||tbl_customer||2|
|tenureMonths||Numeric||Account tenure in months||tbl_customer||13|
|creditScore||Numeric||Credit bureau score||tbl_customer||780|
|state||Categorical||Account billing address state||tbl_account||VT|
|nbrPurchases90d||Numeric||Number of purchases in last 90 days||tbl_transaction||4|
|avgTxnSize90d||Numeric||Average transaction size in last 90 days||tbl_transaction||28.61|
|totalSpend90d||Numeric||Total spend in last 90 days||tbl_transaction||114.44|
|csrNotes||Text||Customer Service Representative notes and codes based on conversations with customer (cumulative)||tbl_customer_misc||call back password call back card password replace atm call back|
|nbrDistinctMerch90d||Numeric||Number of distinct merchants purchased at in last 90 days||tbl_transaction||1|
|nbrMerchCredits90d||Numeric||Number of credits from merchants in last 90 days||tbl_transaction||0|
|nbrMerchCredits-RndDollarAmt90d||Numeric||Number of credits from merchants in round dollar amounts in last 90 days||tbl_transaction||0|
|totalMerchCred90d||Numeric||Total merchant credit amount in last 90 days||tbl_transaction||0|
|nbrMerchCredits-WoOffsettingPurch||Numeric||Number of merchant credits without an offsetting purchase in last 90 days||tbl_transaction||0|
|nbrPayments90d||Numeric||Number of payments in last 90 days||tbl_transaction||3|
|totalPaymentAmt90d||Numeric||Total payment amount in last 90 days||tbl_account_bill||114.44|
|overpaymentAmt90d||Numeric||Total amount overpaid in last 90 days||tbl_account_bill||0|
|overpaymentInd90d||Numeric||Indicator that account was overpaid in last 90 days||tbl_account_bill||0|
|nbrCustReqRefunds90d||Numeric||Number refund requests by the customer in last 90 days||tbl_transaction||1|
|indCustReqRefund90d||Binary||Indicator that customer requested a refund in last 90 days||tbl_transaction||1|
|totalRefundsToCust90d||Numeric||Total refund amount in last 90 days||tbl_transaction||56.01|
|nbrPaymentsCashLike90d||Numeric||Number of cash like payments (e.g., money orders) in last 90 days||tbl_transaction||0|
|maxRevolveLine||Numeric||Maximum revolving line of credit||tbl_account||14000|
|indOwnsHome||Numeric||Indicator that the customer owns a home||tbl_transaction||1|
|nbrInquiries1y||Numeric||Number of credit inquiries in the past year||tbl_transaction||0|
|nbrCollections3y||Numeric||Number of collections in the past year||tbl_collection||0|
|nbrWebLogins90d||Numeric||Number of logins to the bank website in the last 90 days||tbl_account_login||7|
|nbrPointRed90d||Numeric||Number of loyalty point redemptions in the last 90 days||tbl_transaction||2|
|PEP||Binary||Politically Exposed Person indicator||tbl_customer||0|
Modeling and insights¶
DataRobot automates many parts of the modeling pipeline, including processing and partitioning the dataset, as described here. This document starts with the visualizations available once modeling has started.
Exploratory Data Analysis (EDA)¶
Navigate to the Data tab to learn more about your data—summary statistics based on sampled data known as EDA. Click each feature to see a variety of information, including a histogram that represents the relationship of the feature with the target.
While DataRobot is running Autopilot to find the champion model, use the Data > Feature Associations tab to view the feature association matrix and understand the correlations between each pair of input features. For example, the features
nbrDistinctMerch90d (top-left corner) have strong associations and are, therefore, ‘clustered’ together (where each color block in this matrix is a cluster).
DataRobot provides a variety of insights to interpret results and evaluate accuracy.
After Autopilot completes, the Leaderboard ranks each model based on the selected optimization metrics (LogLoss in this case).
The outcome of Autopilot is not only a selection of best-suited models, but also the identification of a recommended model—the model that best understands how to predict the target feature
SAR. Choosing the best model is a balance of accuracy, metric performance, and model simplicity. See the model recommendation process description for more detail.
Autopilot will continue building models until it selects the best predictive model for the specified target feature. This model is at the top of the Leaderboard, marked with the Recommended for Deployment badge.
To reduce false positives, you can choose other metrics like Gini Norm to sort the Leaderboard based on how good the models are at giving SAR a higher rank than the non-SAR alerts.
There are many visualizations within DataRobot that provide insight into why an alert might be SAR. Below are the most relevant for this use case.
Click on a model to reveal the model blueprint—the pipeline of preprocessing steps, modeling algorithms, and post-processing steps used to create the model.
Feature Impact reveals the association between each feature and the target. DataRobot identifies the top three most impactful features (which enable the machine to differentiate SAR from non-SAR alerts) as
total merchant credit in the last 90 days,
number refund requests by the customer in the last 90 days, and
total refund amount in the last 90 days.
To understand the direction of impact and the SAR risk at different levels of the input feature, DataRobot provides partial dependence graphs (within the Feature Effects tab) to depict how the likelihood of being a SAR changes when the input feature takes different values. In this example, the total merchant credit amount in the last 90 days is the most impactful feature, but the SAR risk is not linearly increasing when the amount increases.
- When the amount is below $1000, the SAR risk remains relatively low.
- SAR risk surges significantly when the amount is above $1000.
- SAR risk increase slows when the amount approaches $1500.
- SAR risk tilts again until it hits the peak and plateaus out at around $2200.
The partial dependence graph makes it very straightforward to interpret the SAR risk at different levels of the input features. This could also be converted to a data-driven framework to set up risk-based thresholds that augment the traditional rule-based system.
To turn the machine-made decisions into human-interpretable rationale, DataRobot provides Prediction Explanations for each alert scored and prioritized by the machine learning model. In the example below, the record with
ID=1269 has a very high likelihood of being a suspicious activity (prediction=90.2%), and the three main reasons are:
- Total merchant credit amount in the last 90 days is significantly greater than the others.
- Total spend in the last 90 days is much higher than average.
- Total payment amount in the last 90 days is much higher than average.
Prediction Explanations can also be used to cluster alerts into subgroups with different types of transactional behaviors, which could help triage alerts to different investigation approaches.
The Word Cloud allows you to explore how text fields affect predictions. The Word Cloud uses a color spectrum to indicate the word's impact on the prediction. In this example, red words indicate the alert is more likely to be associated with a SAR.
The following insights help evaluate accuracy.
The Lift Chart shows how effective the model is at separating the SAR and non-SAR alerts. After an alert in the out-of-sample partition gets scored by the model, it is assigned a risk score that measures the likelihood of the alert being a SAR risk or becoming a SAR. In the Lift Chart, alerts are sorted based on the SAR risk, broken down into 10 deciles, and displayed from lowest to the highest. For each decile, DataRobot computes the average predicted SAR risk (blue plus) as well as the average actual SAR event (orange circle) and depicts the two lines together. For the champion model built for this false positive reduction use case, the SAR rate of the top decile is 55%, which is a significant lift from the ~10% SAR rate in the training data. The top three deciles capture almost all SARs, which means that the 70% of alerts with very low predicted SAR risk rarely result in SAR.
Once you know the model is performing well, you select an explicit threshold to make a binary decision based on the continuous SAR risk predicted by DataRobot. The ROC Curve tools provide a variety of information to help make some of the important decisions in selecting the optimal threshold:
The false negative rate has to be as small as possible. False negatives are the alerts that DataRobot determines are not SARs which then turn out to be true SARs. Missing a true SAR is very dangerous and would potentially result in an MRA (matter requiring attention) or regulatory fine.
This case takes a conservative approach. To have a false negative rate of 0, the threshold has to be low enough to capture all the SARs.
Keep the alert volume as low as possible to reduce enough false positives. In this context, all alerts generated in the past that are not SARs are the de-facto false positives. The machine learning model is likely to assign a lower score to those non-SAR alerts; therefore, pick a high-enough threshold to reduce as many false positive alerts as possible.
Ensure the selected threshold is not only working on the seen data, but also on the unseen data, so that when the model gets deployed to the transaction monitoring system for ongoing scoring, it could still reduce false positives without missing any SARs.
Different choices of thresholds using the cross-validation data (the data used for model training and validation) determines that
0.03 is the optimal threshold since it satisfies the first two criteria. On the one hand, the false negative rate is 0; on the other hand, the alert volume is reduced from
2142, reducing false positive alerts by 73% (
5858/8000) without missing any SARs.
For the third criterion—does the threshold also work on the unseen alert—you can quickly validate it in DataRobot. By changing the data selection to Holdout and applying the same threshold (
0.03), the false negative rate remains 0, and the false positive reduction rate remains at 73% (
1457/2000). This proves that the model generalizes well and will perform as expected on unseen data.
From the Profit Curve tab, use the Payoff Matrix to set thresholds based on simulated profit. If the bank has a specific risk tolerance for missing a small portion of historical SAR, they can also apply the Payoff Matrix to pick up the optimal threshold for the binary cutoff. For example:
||Reflects the cost of remediating a SAR that was not detected.|
||Reflects the cost of investigating an alert that proved a "false alarm."|
|Metrics||False Positive Rate, False Negative Rate, and Average Profit||Provides standard statistics to help describe model performance at the selected display threshold.|
By setting the cost per false positive to
$50 (cost of investigating an alert) and the cost per false negative to
$200 (cost of remediating a SAR that was not detected), the threshold is optimized at
0.1183 which gives a minimum cost of
$53k ($6.6 * 8000) out of 8000 alerts and the highest ROI of
$347k ($50 * 8000 - $53k).
On the one hand, the false negative rate remains low (only 5 SARs were not detected); on the other hand, the alert volume is reduced from 8000 to 1988, meaning the number of investigations is reduced by more than 75% (6012/8000).
The threshold is optimized at
0.0619, which gives the highest ROI of $300k out of 8000 alerts. By setting this threshold, the bank will reduce false positives by 74.3% (
5940/8000) at the risk of missing only 3 SARs.
See the deep dive for information on handling class imbalance problems.
Once the modeling team decides on the champion model, they can download compliance documentation for the model. The resulting Microsoft Word document provides a 360-degree view of the entire model-building process, as well as all the challenger models that are compared to the champion model. Most of the machine learning models used for the Financial Crime Compliance domain require approval from the Model Risk Management (MRM) team. The compliance document provides comprehensive evidence and rationale for each step in the model development process.
Predict and deploy¶
Once you identify the model that best learns patterns in your data to predict SARs, you can deploy it into your desired decision environment. Decision environments are the ways in which the predictions generated by the model will be consumed by the appropriate organizational stakeholders, and how these stakeholders will make decisions using the predictions to impact the overall process. This is a critical step for implementing the use case, as it ensures that predictions are used in the real world to reduce false positives and improve efficiency in the investigation process.
The following applications of the alert-prioritization score from the false positive reduction model both automate and augment the existing rule-based transaction monitoring system.
If the FCC (Financial Crime Compliance) team is comfortable with removing the low-risk alerts (very low prioritization score) from the scope of investigation, then the binary threshold selected during the model-building stage will be used as the cutoff to remove those no-risk alerts. The investigation team will only investigate alerts above the cutoff, which will still capture all the SARs based on what was learned from the historical data.
Often regulatory agencies will consider auto-closure or auto-removal as an aggressive treatment for production alerts. If auto-closing is not the ideal way to use the model output, the alert prioritization score can still be used to triage alerts into different investigation processes, improving the operational efficiency.
The following table lists potential decision stakeholders:
|Decision Executors||Financial Crime Compliance Team|
|Decision Managers||Chief Compliance Officer|
|Decision Authors||Data scientists or business analysts|
Currently, the review process consists of a deep-dive analysis by investigators. The data related to the case is made available for review so that the investigators can develop a 360° view of the customer, including their profile, demographic, and transaction history. Additional data from third-party data providers and web crawling can supplement this information to complete the picture.
For transactions that do not get auto-closed or auto-removed, the model can help the compliance team create a more effective and efficient review process by triaging their reviews. The predictions and their explanations also give investigators a more holistic view when assessing cases.
Risk-based Alert Triage: Based on the prioritization score, the investigation team can take different investigation strategies.
For no-risk or low-risk alerts—alerts can be reviewed on a quarterly basis, instead of monthly. The frequently alerted entities without any SAR risk will be reviewed once every three months, which will significantly reduce the time of investigation.
For high-risk alerts with higher prioritization scores—investigations can fast-forward to the final stage in the alert escalation path. This will significantly reduce the effort spent on level 1 and level 2 investigations.
For medium-risk alerts—the standard investigation process can still be applied.
Smart Alert Assignment: For an alert investigation team that is geographically dispersed, the alert prioritization score can be used to assign alerts to different teams in a more effective manner. High-risk alerts can be assigned to the team with the most experienced investigators, while low-risk alerts are assigned to the less-experienced team. This will mitigate the risk of missing suspicious activities due to a lack of competency during alert investigations.
For both approaches, the definition of high/medium/low risk could be either a set of hard thresholds (for example, High: score>=0.5, Medium: 0.5>score>=0.3, Low: score<0.3), or based on the percentile of the alert scores on a monthly basis (for example, High: above 80th percentile, Medium: between 50th and 80th percentile, Low: below 50th percentile).
The predictions generated from DataRobot can be integrated with an alert management system which will let the investigation team know of high-risk transactions.
DataRobot will continuously monitor the model deployed on the dedicated prediction server. With DataRobot MLOps, the modeling team can monitor and manage the alert prioritization model by tracking the distribution drift of the input features as well as the performance deprecation over time.
When operationalizing this use case, consider the following, which may impact outcomes and require model re-evaluation:
- Change in the transactional behavior of the money launderers.
- Novel information introduced to the transaction, and customer records that are not seen by the machine learning models.
Deep dive: Imbalanced targets¶
In AML and Transaction Monitoring, the SAR rate is usually very low (1%–5%, depending on the detection scenarios); sometimes it could be even lower than 1% in extremely unproductive scenarios. In machine learning, such a problem is called class imbalance. The question becomes, how can you mitigate the risk of class imbalance and let the machine learn as much as possible from the limited known-suspicious activities?
DataRobot offers different techniques to handle class imbalance problems. Some techniques:
Evaluate the model with different metrics. For binary classification (the false positive reduction model here, for example), LogLoss is used as the default metric to rank models on the Leaderboard. Since the rule-based system is often unproductive, which leads to a very low SAR rate, it’s reasonable to take a look at a different metric, such as the SAR rate in the top 5% of alerts in the prioritization list. The objective of the model is to assign a higher prioritization score with a high risk alert, so it’s ideal to have a higher rate of SAR in the top tier of the prioritization score. In the example shown in the image below, the SAR rate in the top 5% of prioritization score is more than 70% (the original SAR rate is less than 10%), which indicates that the model is very effective in ranking the alert based on the SAR risk.
DataRobot also provides flexibility for modelers when tuning hyperparameters which could also help with the class imbalance problem. In the example below, the Random Forest Classifier is tuned by enabling the balance_boostrap (a random sample with an equal amount of SAR and non-SAR alerts in each decision tree in the forest); you can see the validation score of the new ‘Balanced Random Forest Classifier’ model is slightly better than the parent model.
- You can also use Smart Downsampling (from the Advanced Options tab) to intentionally downsample the majority class (i.e., non-SAR alerts) in order to build faster models with similar accuracy.
See the notebook here.