Reduce False Positives for Anti Money Laundering (AML)

Financial Services Risk / Security Decrease Costs Improve Customer Experience Reduce Risk Binary Classification Blend End to End Fraud Detection
Risk-prioritize alerts generated from rule-based transaction monitoring systems to reduce the number of false-positive alerts and increase the efficiency of the alert investigation process.
Request a Demo


Business Problem

A key pillar of any AML compliance program is to monitor transactions for suspicious activity. The scope of transactions is broad, including deposits, withdrawals, fund transfers, purchases, merchant credits, and payments. Typically, monitoring starts with a rules-based system that scans customer transactions for red flags consistent with money laundering. When a transaction matches a predetermined rule, an alert is generated and the case is referred to the bank’s internal investigation team for manual review. If the investigators conclude the behavior is indicative of money laundering, then the bank will file a Suspicious Activity Report (SAR) with FinCEN.

Unfortunately, the standard transaction monitoring system described above has costly drawbacks. In particular, the rate of false-positives (cases incorrectly flagged as suspicious) generated by this rules-based system can reach 90% or more. Since the system is rules-based and rigid, it cannot dynamically learn the complex interactions and behaviors behind money laundering. The prevalence of false-positives makes investigators less efficient as they have to manually weed out cases that the rules-based system incorrectly marked as suspicious. Compliance teams at financial institutions can have hundreds or even thousands of investigators, and the current systems in place to help these investigators prevent them from becoming more effective and efficient in their investigations.

Intelligent Solution

With its ability to dynamically learn patterns in complex data, AI will significantly improve accuracy in predicting which cases will result in a SAR filing. AI models for anti-money laundering can be deployed into the review process to score and rank all new cases. Any case that exceeds a predetermined threshold of risk is sent to the investigators for manual review. Meanwhile, any case that falls below the threshold can be automatically discarded or sent to a lighter review. Once AI models are deployed into production, they can be continuously retrained on new data to capture any novel behaviors of money laundering. This data will come from the feedback investigators

In general, AI helps investigators focus their attention on cases that have the highest risk of AML while minimizing the time they spend reviewing false-positive cases. For banks with large volumes of daily transactions, improvements in the effectiveness and efficiency of their investigations ultimately results in fewer cases of money laundering that go unnoticed. This allows banks to enhance their regulatory compliance and reduce the volume of financial crime present within their network.

Value Estimation

How would I measure ROI for my use case?

ROI  = Avoided potential regulatory fine + Annual alert volume * false positive reduction rate * cost per alert

A high-level measurement of ROI involves two parts. The first part, which is the total amount of avoided regulatory fines, will vary depending on the nature of bank and will need to be estimated on a case-by-case basis. However, the second part of the equation is where AI can have a tangible impact on improving investigation productivity and reducing operational costs. For example, if a bank generates 100,000 AML alerts every year, and DataRobot achieves a 70% false positive reduction rate without losing any historical suspicious activities, and the average cost per alert is $30~70, then the annual ROI of implementing the solution will be 100,000 * 70% * ($30~$70) = $2.1MM ~$4.9MM.

Technical Implementation

About the Data

For illustrative purposes, in this tutorial we are using a synthetic dataset that illustrates a credit card company’s AML compliance program, specifically with the intent of detecting the following money-laundering scenarios:

  • Customer spends on the card but overpays their credit card bill and seeks a cash refund for the difference.
  • Customer receives credits from a merchant without offsetting transactions and either spends the money or requests a cash refund from the bank.

A rule-based engine is in place to produce an alert to detect potentially suspicious activity consistent with the above scenarios.  The rule triggers an alert whenever a customer requests a refund of any amount since small refund requests could be the money launderer’s way of testing the refund mechanism or trying to establish refund requests as a normal pattern for their account.

Problem Framing

The target variable for this use case is whether or not the alert resulted in a SAR after manual review by investigators. Selecting this variable as target makes this a binary classification problem. The unit of analysis is an individual alert, which means the model will be built on the alert level and each alert will get a score ranging from 0 to 1 which indicates the probability of being a SAR.

The False Positive Rate of the rules engine on the Validation sample is the number of SAR=0 divided by the total number of records = 1436/1600 = 90%.  The goal of applying a model to this use case is to lower the false positive rate, which means resources are not spent reviewing cases that are eventually determined to not be suspicious after an investigation.

The list below represents the features used for the false positive reduction use case. It consists of KYC (Know-Your-Customer) information, demographic information, transactional behavior, and even free-form text information from the customer service representatives’ notes. 

Sample Feature List
Feature NameData TypeDescriptionData Source
ALERTBinaryAlert Indicatortbl_alert1
SARBinary(Target)SAR Indicator (Binary Target)tbl_alert0
kycRiskScoreNumericAccount relationship (Know Your Customer) score used at time of account openingtbl_customer2
incomeNumericAnnual incometbl_customer32600
tenureMonthsNumericAccount tenure in monthstbl_customer13
creditScoreNumericCredit bureau scoretbl_customer780
stateCategoricalAccount billing address statetbl_accountVT
nbrPurchases90dNumericNumber of purchases in last 90 daystbl_transaction4
avgTxnSize90dNumericAverage transaction size in last 90 daystbl_transaction28.61
totalSpend90dNumericTotal spend in last 90 daystbl_transaction114.44
csrNotesTextCustomer Service Representative notes and codes based on conversations with customer (cumulative)tbl_customer_misccall back password call back card password replace atm call back
nbrDistinctMerch90dNumericNumber of distinct merchants purchased at in last 90 daystbl_transaction1
nbrMerchCredits90dNumericNumber of credits from merchants in last 90 daystbl_transaction0


NumericNumber of credits from merchants in round dollar amounts in last 90 daystbl_transaction0
totalMerchCred90dNumericTotal merchant credit amount in last 90 daystbl_transaction0


NumericNumber of merchant credits without an offsetting purchase in last 90 daystbl_transaction0
nbrPayments90dNumericNumber of payments in last 90 daystbl_transaction3
totalPaymentAmt90dNumericTotal payment amount in last 90 daystbl_account_bill114.44
overpaymentAmt90dNumericTotal amount overpaid in last 90 daystbl_account_bill0
overpaymentInd90dNumericIndicator that account was overpaid in last 90 daystbl_account_bill0
nbrCustReqRefunds90dNumericNumber refund requests by the customer in last 90 daystbl_transaction1
indCustReqRefund90dBinaryIndicator that customer requested a refund in last 90 daystbl_transaction1
totalRefundsToCust90dNumericTotal refund amount in last 90 daystbl_transaction56.01
nbrPaymentsCashLike90dNumericNumber of cash like payments (e.g., money orders) in last 90 daystbl_transaction0
maxRevolveLineNumericMaximum revolving line of credittbl_account14000
indOwnsHomeNumericIndicator that the customer owns a hometbl_transaction1
nbrInquiries1yNumericNumber of credit inquiries in the past yeartbl_transaction0
nbrCollections3yNumericNumber of collections in the past yeartbl_collection0
nbrWebLogins90dNumericNumber of logins to the bank website in the last 90 daystbl_account_login7
nbrPointRed90dNumericNumber of loyalty point redemptions in the last 90 daystbl_transaction2
PEPBinaryPolitically Exposed Person indicatortbl_customer0
Data Preparation  
  • Define the scope of analysis: Collect alerts from a specific analytical window to start with; it’s recommended that you  use 12–18 months of alerts for model building.
  • Define the target: Depending on the investigation processes, the target definition could be flexible. In this walkthrough, alerts are classified as ‘Level1’, ‘Level2’, ‘Level3’, and ‘Level3-confirmed’ which indicates at which level of the investigation the alert was closed (i.e., confirmed as a SAR). To create the binary target, we treat ‘Level3-confirmed’ as SAR (denoted by 1) and the remaining are Non-SAR alerts (denoted by 0) 
  • Consolidate information from multiple data sources: Below is a sample entity-relationship diagram indicating the relationship between the data tables used for this use case

Some features are static information, such as kyc_risk_score, state of residence, which could be fetched directly from the reference tables.  

For transaction behavior and payment history, the information will be derived from a specific time window prior to the alert generation date, in this case we pick up 90 days as the time window to obtain the dynamic customer behavior, such as nbrPurchases90d, avgTxnSize90d,  or totalSpend90d.

Below is an example of one row in the training data after it is merged and aggregated (i.e., broken into multiple lines for a easier visualization).

Model Training

DataRobot Automated Machine Learning automates many parts of the modeling pipeline. Instead of having to hand-code and manually test dozens of models to find the one that best fits your needs, DataRobot automatically runs dozens of models and finds the best performing one for you, all in a matter of minutes. In addition to training the models, DataRobot automates other steps in the modeling process such as processing and partitioning the dataset. 

We will jump straight to interpreting the model results. Take a look here to see how to use DataRobot from start to finish and how to understand the data science methodologies embedded in its automation.

While DataRobot is running Autopilot to find the champion model, you can go to Data tab > Feature Associations to view the feature association matrix and understand the correlations between each pair of the input features. For example as shown in the following image, the features nbrPurchases90d and nbrDistinctMerch90d (top-left corner) have strong associations and are therefore ‘clustered’ together (where each color block in this matrix is a cluster).

Interpret Results

After automated modeling is completed, the Leaderboard will rank each model (in order of recommended for deployment) so you can evaluate them and select the one you want to use. By default, DataRobot uses LogLoss as the evaluation metric to evaluate how good the machine learning model is at telling the difference between SAR and non-SAR.

To reduce false positives, you can choose other metrics like Gini Norm to sort the Leaderboard based on how good the models are at giving SAR a higher rank than the non-SAR alerts. 

Feature Impact (for a specific model, select Understand > Feature Impact) reveals the association between each feature and the target. The RandomForest Classifier (Gini) model (in the following image) is labeled ‘Fast & Accurate’. DataRobot identifies the top three most impactful features (which enable the machine to differentiate SAR from non-SAR alerts) as: total merchant credit in the last 90 days, number refund requests by the customer in the last 90 days, and total refund amount in the last 90 days.

In order to understand the direction of impact and the SAR risk at different levels of the input feature, DataRobot provides partial dependence graphs (within the Feature Effects tab) to depict how the likelihood of being a SAR is changing when the input feature takes different values. In this example, the total merchant credit amount in the last 90 days is the most impactful feature, but the SAR risk is not linearly increasing when the amount increases; when the amount is below $1000, the SAR risk stays on a relatively low level and surges significantly when the amount is above $1000. The SAR risk increase slows down when the amount is around $1500 and then tilts again until it hits the peak and plateaus out at around $2200. By looking at the partial dependence graph, it’s very straightforward to interpret the SAR risk at different levels of the input features, which could also be converted to a data-driven framework to set up risk-based thresholds to augment the traditional rule-based system.

In order to turn the machine-made decisions into human interpretable rationale, DataRobot provides Prediction Explanations for each alert scored and prioritized by the machine learning model. In the example shown in the image below, the record with ID=1269 has a very high likelihood of being a suspicious activity (prediction=90.2%), and the top three main reasons are:

  • Total merchant credit amount in the last 90 days is significantly greater than the others.
  • Total spend in the last 90 days is much higher than average.
  • Total payment amount in the last 90 days is much higher than average.

Prediction Explanations can also be used to cluster alerts into subgroups with different types of transactional behaviors, which could help triage alerts to different investigation approaches.

Evaluate Accuracy

The Lift Chart shows you how effective the model is at separating the SAR and non-SAR alerts. After an alert in the out-of-sample partition gets scored by the trained model, it will be assigned with a risk score that measures the likelihood of the alert being a SAR risk, or becoming a SAR. In the Lift Chart, alerts are sorted based on the SAR risk, broken down into 10 deciles, and displayed from lowest to the highest. For each decile, DataRobot computes the average predicted SAR risk (blue plus) as well as the average actual SAR event (orange circle) and depicts the two lines together. For the champion model we built for the false positive reduction use case, the SAR rate of the top decile is 55%, which is a significant lift from the ~10% SAR rate in the training data. And the top three deciles capture almost all SARs, which means that the 70% alerts with very low predicted SAR risk rarely result in SAR.

Now that we know the model is performing very well, we have to select an explicit threshold to make a binary decision based on the continuous SAR risk predicted by DataRobot. To pick up the optimal threshold, there are three important criteria:

  • The false negative rate has to be as small as possible. False negatives are the alerts that DataRobot determines are not SARs which then turn out to be true SARs. Missing a true SAR is very dangerous and would potentially result in an MRA (matter requiring attention) or regulatory fine. Here we are going to take a conversative approach to have 0 false negative rate, meaning all true SARs are captured. To achieve this, the threshold has to be low enough to capture all the SARs.
  • We also want to keep the alert volume as low as possible to reduce enough false positives. In this context, all alerts generated in the past that are not SARs are the de-facto false positives. The machine learning model is likely to assign a lower score to those non-SAR alerts; therefore we want to pick a high-enough threshold to reduce as many false positive alerts as possible.
  • We also want to ensure the selected threshold is not only working on the seen data, but also on the unseen data, so that when the model gets deployed to the transaction monitoring system for on-going scoring, it could still reduce false positives without missing any SARs.

By trying different choices of thresholds using the cross-validation data (the data used for model training and validation), we decided 0.03 is the optimal threshold since it satisfies the first two criteria. On one hand, the false negative rate is 0; on the other hand, the alert volume is reduced from 8000 to 2142, which means we reduce false positive alerts by 73% (5858/8000) without missing any SAR.

How about the third criterion? Does the threshold also work on the unseen alert? We can quickly validate it in DataRobot. By changing the data selection to Holdout, and applying the same threshold (0.03), the false negative rate remains 0 and the false positive reduction rate is still 73% (1457/2000)! This proves that the model generalizes well and will perform as expected on unseen data.

If the bank has a specific risk tolerance for missing a small portion of historical SAR, they can also apply Payoff Matrix to pick up the optimal threshold for the binary cut off. Here is an example: by setting the gain per true negative to $50 (cost reduction per alert) and the loss per false negative to $1000 (potential regulatory fine per missing SAR), the threshold is optimized at 0.0619 which gives the highest ROI of $300k out of 8000 alerts. By setting this threshold, the bank will reduce false positives by 74.3% (5940/8000) at the risk of missing only 3 SARs.

Advanced Tuning for Class Imbalanced Target:

In AML and Transaction Monitoring, the SAR rate is usually very low (1%–5%, depending on the detection scenarios); sometimes it could be even lower than 1% for the extremely unproductive scenarios. In machine learning, such a problem is called class imbalance. How could we mitigate the risk of class imbalance and let the machine learn as much as possible from the limited known-suspicious activities?

DataRobot offers different techniques to handle class imbalance problems. Below is one example of advanced tuning techniques:

  • Evaluate the model with different metrics. In DataRobot you can find many evaluation metrics for the models you build. For binary classification, like the false positive reduction model we are building here, LogLoss is used as the default metric to rank models on the Leaderboard. Since the rule-based system is often unproductive, which leads to very low SAR rate, it’s reasonable to take a look at a different metric, such as the SAR rate in the top 5% of alerts in the prioritization list. The objective of the model is to assign a higher prioritization score with a high risk alert, so it’s ideal to have a higher rate of SAR in the top tier of the prioritization score. In the example shown in the image below, the SAR rate in the top 5% of prioritization score is more than 70% (original SAR rate is less than 10%), which indicates that the model is very effective in ranking the alert based on the SAR risk.
  • DataRobot also provides flexibility for modelers when tuning hyperparameters which could also help with the class imbalance problem. In the example below, the Random Forest Classifier is tuned by enabling the balance_boostrap (random sample equal amount of SAR and non-SAR alerts in each decision trees in the forest); you can see the validation score of the new ‘Balanced Random Forest Classifier’ model is slightly better than the parent model.
  • You can also use the Smart Downsampling technique (from the Advanced Options tab) to intentionally downsample the majority class (i.e., non-SAR alerts) in order to build faster models with similar accuracy.

Once the modeling team decides on the champion model, they can simply click a button to download the compliance documentation for the model. The document will be a Microsoft Word file that gives a 360-degree view of the entire model-building process, as well as all the challenger models that are compared to the champion model. It’s also feasible to apply a user-defined template in this process to even further accelerate the model documentation process. Most of the machine learning models used for the Financial Crime Compliance domain require approval from the Model Risk Management (MRM) team. The compliance document provides comprehensive evidence and rationale for each step in the model development process.

Business Implementation

Decision Environment

After you are able to find the right model that best learns patterns in your data to predict SAR, DataRobot makes it easy to deploy the model into your alert investigation process. This is a critical step for implementing the use case as it ensures that predictions are used in the real world to reduce false positives and improve efficiency in the investigation process. 

Decision Maturity 

Automation | Augmentation | Blend 

There are multiple applications of the alert-prioritization score from the false positive reduction model, of which both automate and augment the existing rule-based transaction monitoring system. 

  • If the FCC (Financial Crime Compliance) team is comfortable with removing the low-risk alerts with very low prioritization score from the scope of investigation, then the binary threshold selected during the model building stage will be used as the cutoff to remove those no-risk alerts. The investigation team will only investigate the alert above the cutoff which will still capture all the SARs based on what we learned from the historical data.
  • Often the regulatory agencies will consider auto-closing or auto-removal as an aggressive treatment to the production alerts. When auto-closing is not the ideal way to use the model output, the alert prioritization score could still be used to triage alerts into different investigation processes, hence improving the operational efficiency.
Model Deployment

The predictions generated from DataRobot can be integrated with an alert management system which will let the investigation team know of high risk transactions.

Decision Stakeholders
  • Decision Executors: Financial Crime Compliance Team
  • Decision Manager: Chief Compliance Officer
  • Decision Author: Data Scientists or Business Analysts
Decision Process

Currently, the review process consists of a deep-dive analysis by investigators. The data related to the case is made available for review so that the investigators can develop a 360° view of the customer, including their profile, demographic, and transaction history. Additional data from third-party data providers and web crawling can supplement this information to complete the picture.

For transactions that do not get auto-closed or auto-removed, the model can help the compliance team create a more effective and efficient review process by triaging their reviews. The predictions and their explanations also give investigators a more holistic view when assessing cases. 

Risk-based Alert Triage: Based on the prioritization score, the investigation team could take different investigation strategies. 

  • For no-risk or low-risk alerts: the alerts could be reviewed on a quarterly basis, instead of monthly. The frequently alerted entities without any SAR risk will be reviewed once every three months, which will significantly reduce the time of investigation.
  • For high-risk alerts with higher prioritization scores: the investigation could fast forward to the final stage in the alert escalation path. This will significantly reduce the effort spent on level 1 and level 2 investigation.
  • For medium-risk alerts: the standard investigation process will still be applied.

Smart Alert Assignment: For an alert investigation team that is geographically dispersed, the alert prioritization score could be used to assign alerts to different teams in a more effective manner. The high risk alerts could be assigned to the team with the most experienced investigators while the low risk alerts get assigned to the less experienced team. This will mitigate the risk of missing suspicious activities due to lack of competency with alert investigations.

  • For both approaches, the definition of high/medium/low risk could be either a set of hard thresholds (High: score>=0.5, Medium: 0.5>score>=0.3, Low: score<0.3), or based on the percentile of the alert scores on a monthly basis (High: above 80th percentile, Medium: between 50th and 80th percentile, Low: below 50th percentile) 
Model Monitoring 

DataRobot will continuously monitor the model deployed on the dedicated prediction server. With DataRobot MLOps, the modeling team can monitor and manage the alert prioritization model by tracking the distribution drift of the input features as well as the performance deprecation over time.

Implementation Risks
  • Change in the transactional behavior of the money launderers.
  • Novel information introduced to the transaction, and customer records that are not seen by the machine learning models.
banner purple waves bg

Experience the DataRobot AI Platform

Less Friction, More AI. Get Started Today With a Free 30-Day Trial.

Sign Up for Free
Financial Markets
Explore More Financial Services Use Cases
AI is solving a variety of challenges in financial services, from uncovering new revenue opportunities and reducing risk profiles to improving customer experience and reducing costs.