Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Anti-Money Laundering (AML) Alert Scoring

In this use case you will build a model that uses historical data, including customer and transactional information, to identify which alerts resulted in a Suspicious Activity Report (SAR). The model can then be used to assign a suspicious activity score to future alerts and improve the efficiency of an AML compliance program using rank ordering by score.

Download the sample training dataset here.

Click here to jump directly to the notebook. Otherwise, the following several paragraphs describe the business justification and problem framing for this use case.


A key pillar of any AML compliance program is to monitor transactions for suspicious activity. The scope of transactions is broad, including deposits, withdrawals, fund transfers, purchases, merchant credits, and payments. Typically, monitoring starts with a rules-based system that scans customer transactions for red flags consistent with money laundering. When a transaction matches a predetermined rule, an alert is generated and the case is referred to the bank’s internal investigation team for manual review. If the investigators conclude the behavior is indicative of money laundering, then the bank will file a Suspicious Activity Report (SAR) with FinCEN.

Unfortunately, the standard transaction monitoring system described above has costly drawbacks. In particular, the rate of false-positives (cases incorrectly flagged as suspicious) generated by this rules-based system can reach 90% or more. Since the system is rules-based and rigid, it cannot dynamically learn the complex interactions and behaviors behind money laundering. The prevalence of false-positives makes investigators less efficient as they have to manually weed out cases that the rules-based system incorrectly marked as suspicious.

Compliance teams at financial institutions can have hundreds or even thousands of investigators, and the current systems prevent investigators from becoming more effective and efficient in their investigations. The cost of reviewing an alert ranges between $30~$70. For a bank that receives 100,000 alerts a year, this is a substantial sum; on average, penalties imposed for proven money laundering amount to $145 million per case. A reduction in false positives could result in between $600,000~$4.2 million per year in savings.

Key takeaways:

  • Strategy/challenge: Help investigators focus their attention on cases that have the highest risk of money laundering while minimizing the time they spend reviewing false-positive cases.

    For banks with large volumes of daily transactions, improvements in the effectiveness and efficiency of their investigations ultimately results in fewer cases of money laundering that go unnoticed. This allows banks to enhance their regulatory compliance and reduce the volume of financial crime present within their network.

  • Business driver: Improve efficiency of AML transaction monitoring and lower operational costs.

    With its ability to dynamically learn patterns in complex data, AI significantly improves accuracy in predicting which cases will result in a SAR filing. AI models for anti-money laundering can be deployed into the review process to score and rank all new cases.

  • Model solution: Assign a suspicious activity score to each AML alert, improving the efficiency in an AML compliance program.

    Any case that exceeds a predetermined threshold of risk is sent to the investigators for manual review. Meanwhile, any case that falls below the threshold can be automatically discarded or sent to a lighter review. Once AI models are deployed into production, they can be continuously retrained on new data to capture any novel behaviors of money laundering. This data will come from the feedback of investigators.

    Specifically the model will use rules that trigger an alert whenever a customer requests a refund of any amount since small refund requests could be the money launderer’s way of testing the refund mechanism or trying to establish refund requests as a normal pattern for their account.

Using this notebook

The following table summarizes aspects of this use case.

Topic Description
Use case type Anti-money laundering (false positive reduction)
Target audience Data Scientist, Financial Crime Compliance Team
Desired outcomes
  • Identify which customer data and transaction activity are indicative of a high risk for potential money laundering.
  • Detect anomalous changes in behavior or nascent money laundering patterns before they spread.
  • Reduce the false positive rate for the cases selected for manual review.
  • Annual alert volume
  • Cost per alert
  • False positive reduction rate
Sample dataset

Solution value

This use case builds a model that dynamically learns patterns in complex data and reduces false positive alerts. Then, financial crime compliance teams can prioritize the alerts that legitimately require manual review and dedicate more resources to those cases most likely to be suspicious. By learning from historical data to uncover patterns related to money laundering, AI also helps identify which customer data and transaction activity are indicative of a high risk for potential money laundering.

The primary issues and corresponding opportunities that this use case addresses include:

Issue Opportunity
Potential regulatory fine Mitigate the risk of missing suspicious activities due to lack of competency with alert investigations. Use alert scores to more effectively assign alerts—high risk alerts to more experienced investigators, low risk alerts to more junior team members.
Investigation productivity Increase investigators' productivity by making the review process more effective and efficient, and by providing a more holistic view when assessing cases.

Calculating ROI

ROI can be calculated as follows:

Avoided potential regulatory fine + Annual alert volume * false positive reduction rate * cost per alert

A high-level measurement of the ROI equation involves two parts.

  1. The total amount of avoided potential regulatory fines will vary depending on the nature of the bank and must be estimated on a case-by-case basis.

  2. The second part of the equation is where AI can have a tangible impact on improving investigation productivity and reducing operational costs. Consider this example:

    • A bank generates 100,000 AML alerts every year.
    • DataRobot achieves a 70% false positive reduction rate without losing any historical suspicious activities.
    • The average cost per alert is $30~$70.

    Result: The annual ROI of implementing the solution will be 100,000 * 70% * ($30~$70) = $2.1MM~$4.9MM.

Work with data

The linked synthetic dataset illustrates a credit card company’s AML compliance program. Specifically the model is detecting the following money-laundering scenarios:

  • Customer spends on the card but overpays their credit card bill and seeks a cash refund for the difference.
  • Customer receives credits from a merchant without offsetting transactions and either spends the money or requests a cash refund from the bank.

The unit of analysis in this dataset is an individual alert, meaning a rule-based engine is in place to produce an alert to detect potentially suspicious activity consistent with the above scenarios.

Problem framing

The target variable for this use case is whether or not the alert resulted in a SAR after manual review by investigators, making this a binary classification problem. The unit of analysis is an individual alert—the model will be built on the alert level—and each alert will receive a score ranging from 0 to 1. The score indicates the probability of being a SAR.

The goal of applying a model to this use case is to lower the false positive rate, which means resources are not spent reviewing cases that are eventually determined to not be suspicious after an investigation.

In this use case, the False Positive Rate of the rules engine on the validation sample (1600 records) is:

Number of SAR=0 divided by the total number of records = 1436/1600 = 90%.

Data preparation

Consider the following when working with data:

  • Define the scope of analysis: Collect alerts from a specific analytical window to start with; it’s recommended that you use 12–18 months of alerts for model building.

  • Define the target: Depending on the investigation processes, the target definition could be flexible. In this walkthrough, alerts are classified as Level1, Level2, Level3, and Level3-confirmed. These labels indicate at which level of the investigation the alert was closed (i.e., confirmed as a SAR). To create a binary target, treat Level3-confirmed as SAR (denoted by 1) and the remaining levels as non-SAR alerts (denoted by 0).

  • Consolidate information from multiple data sources: Below is a sample entity-relationship diagram indicating the relationship between the data tables used for this use case.

Some features are static information—kyc_risk_score and state of residence for example—these can be fetched directly from the reference tables.

For transaction behavior and payment history, the information will be derived from a specific time window prior to the alert generation date. This case uses 90 days as the time window to obtain the dynamic customer behavior, such as nbrPurchases90d, avgTxnSize90d, or totalSpend90d.

Below is an example of one row in the training data after it is merged and aggregated (it is broken into multiple lines for a easier visualization).

Features and sample data

The features in the sample dataset consist of KYC (Know-Your-Customer) information, demographic information, transactional behavior, and free-form text information from the customer service representatives’ notes. To apply this use case in your organization, your dataset should contain, minimally, the following features:

  • Alert ID
  • Binary classification target (SAR/no-SAR, 1/0, True/False, etc.)
  • Date/time of the alert
  • "Know Your Customer" score used at time of account opening
  • Account tenure, in months
  • Total merchant credit in the last 90 days
  • Number of refund requests by the customer in the last 90 days
  • Total refund amount in the last 90 days

Other helpful features to include are:

  • Annual income
  • Credit bureau score
  • Number of credit inquiries in the past year
  • Number of logins to the bank website in the last 90 days
  • Indicator that the customer owns a home
  • Maximum revolving line of credit
  • Number of purchases in the last 90 days
  • Total spend in the last 90 days
  • Number of payments in the last 90 days
  • Number of cash-like payments (e.g., money orders) in last 90 days
  • Total payment amount in last 90 days
  • Number of distinct merchants purchased at in the last 90 days
  • Customer Service Representative notes and codes based on conversations with customer (cumulative)

Implementation risks

When operationalizing this use case, consider the following, which may impact outcomes and require model re-evaluation:

  • Change in the transactional behavior of the money launderers.
  • Novel information introduced to the transaction, and customer records that are not seen by the machine learning models.

Predict and deploy

Once you identify the model that best learns patterns in your data to predict SARs, DataRobot makes it easy to deploy the model into your alert investigation process. This is a critical step for implementing the use case, as it ensures that predictions are used in the real world to reduce false positives and improve efficiency in the investigation process. The following sections describe activities related to preparing and then deploying a model.

The following applications of the alert-prioritization score from the false positive reduction model both automate and augment the existing rule-based transaction monitoring system.

  • If the FCC (Financial Crime Compliance) team is comfortable with removing the low-risk alerts (very low prioritization score) from the scope of investigation, then the binary threshold selected during the model building stage will be used as the cutoff to remove those no-risk alerts. The investigation team will only investigate alerts above the cutoff, which will still capture all the SARs based on what was learned from the historical data.

  • Often regulatory agencies will consider auto-closure or auto-removal as an aggressive treatment to production alerts. If auto-closing is not the ideal way to use the model output, the alert prioritization score can still be used to triage alerts into different investigation processes, hence improving the operational efficiency.

Deep dive: Imbalanced targets

In AML and Transaction Monitoring, the SAR rate is usually very low (1%–5%, depending on the detection scenarios); sometimes it could be even lower than 1% in extremely unproductive scenarios. In machine learning, such a problem is called class imbalance. The question becomes, how can you mitigate the risk of class imbalance and let the machine learn as much as possible from the limited known-suspicious activities?

DataRobot offers different techniques to handle class imbalance problems. Some techniques:

  • Evaluate the model with different metrics. For binary classification (the false positive reduction model here, for example), LogLoss is used as the default metric to rank models on the Leaderboard. Since the rule-based system is often unproductive, which leads to very low SAR rate, it’s reasonable to take a look at a different metric, such as the SAR rate in the top 5% of alerts in the prioritization list. The objective of the model is to assign a higher prioritization score with a high risk alert, so it’s ideal to have a higher rate of SAR in the top tier of the prioritization score. In the example shown in the image below, the SAR rate in the top 5% of prioritization score is more than 70% (original SAR rate is less than 10%), which indicates that the model is very effective in ranking the alert based on the SAR risk.

  • DataRobot also provides flexibility for modelers when tuning hyperparameters which could also help with the class imbalance problem. In the example below, the Random Forest Classifier is tuned by enabling the balance_boostrap (random sample equal amount of SAR and non-SAR alerts in each decision trees in the forest); you can see the validation score of the new ‘Balanced Random Forest Classifier’ model is slightly better than the parent model.

  • You can also use Smart Downsampling (from the Advanced Options tab) to intentionally downsample the majority class (i.e., non-SAR alerts) in order to build faster models with similar accuracy.

Deep Dive: Decision process

A review process typically consists of a deep-dive analysis by investigators. The data related to the case is made available for review so that the investigators can develop a 360-degree view of the customer, including their profile, demographic, and transaction history. Additional data from third-party data providers, and web crawling, can supplement this information to complete the picture.

For transactions that do not get auto-closed or auto-removed, the model can help the compliance team create a more effective and efficient review process by triaging their reviews. The predictions and their explanations also give investigators a more holistic view when assessing cases.

Risk-based alert triage

Based on the prioritization score, the investigation team could take different investigation strategies. For example:

  • No-risk or low-risk alerts can be reviewed on a quarterly basis, instead of monthly. The frequently alerted entities without any SAR risk can then be reviewed once every three months, which will significantly reduce the time of investigation.

  • High-risk alerts with higher prioritization scores can have their investigation fast-tracked to the final stage in the alert escalation path. This will significantly reduce the effort spent on level 1 and level 2 investigation.

  • Medium-risk alerts can use standard investigation process.

Smart alert assignment

For an alert investigation team that is geographically dispersed, the alert prioritization score can be used to assign alerts to different teams in a more effective manner. High-risk alerts can be assigned to the team with the most experienced investigators while low risk alerts can be handled by a less experienced team. This mitigates the risk of missing suspicious activities due to lack of competency with alert investigations.

For both approaches, the definition of high/medium/low risk could be either a set of hard thresholds (for example, High: score>=0.5, Medium: 0.5>score>=0.3, Low: score<0.3), or based on the percentile of the alert scores on a monthly basis (for example, High: above 80th percentile, Medium: between 50th and 80th percentile, Low: below 50th percentile)


See the notebook here.

Updated February 1, 2023
Back to top