Machine Learning in AML and Fraud Prevention

Hristo Iliev
Hristo Iliev

Author: Phd Hristo Iliev – Chief Data Scientist at NOTO

Hristo is a High-Performance Computing specialist and data science enthusiast with a doctoral degree in atomic and molecular physics from the University of Sofia, software developer, and systems administrator. He specialises in data analytics and predictive modelling at scale, development and performance tuning of parallel applications, and is currently the only Stack Overflow user to hold at once golden badges for the two major parallel programming paradigms – OpenMP and the Message Passing Interface.

Before venturing into the startup world, Hristo worked for six years at RWTH Aachen University, where he optimised various scientific and engineering applications, taught message passing to users of the university supercomputer, and served as co-organiser and co-editor of the proceedings of the first HPC symposium of the Jülich Aachen Research Alliance. Once he left academia, he became the principal data scientist of the Dutch startup, developing within the EU-funded DataPitch programme an interactive advertising solution for IPTV as well as the company’s internal product analytics.

What is the reality behind when the sales pitch is over?
Detecting and preventing fraud or money laundering of any kind is a never ending battle fought on a constantly shifting battleground. The criminals regularly change their behaviour to avoid automated detection and the operators of monitoring systems must promptly discover and react to the new fraud patterns. In recent years, Artificial Intelligence (AI) has been taking the industry by storm. Machine Learning (ML) classifiers that automatically learn patterns from existing data and then apply them to classify new data samples have emerged as a viable alternative to fixed programming in the form of expert-developed rulesets and scorecards. Their successful application to solving hard problems previously thought almost impossible for computers such as computer vision has prompted many vendors of ML solutions to seek entry into the market of fraud detection and prevention. Unfortunately, many of them underestimate (or outright have no clue about) the specifics of the various kinds of fraud and advertise almost magical qualities of their solutions, often in an attempt to ride the market hype and lure less tech-savvy clients. Thus, navigating the ecosystem of ML-based AML and fraud prevention systems can be hard.

No matter which provider one choses, the data-driven nature of Machine Learning presents some unique challenges, especially for businesses that haven’t dealt with ML before. Here is what you need to know.

Got data? Adequate Training Data?

Machine learning models can only be trained to recognise patterns that already exist in the data, which puts a strong quality requirement on the training and validation data sets. They must contain the right balanced unbiased mix of representative samples of both genuine and bad transactions. Balancing is hard to achieve with fraud and AML data which by nature is highly skewed in favour of genuine transactions.Another challenge, especially for small or new businesses that still do not have an extensive database of recorded transactions, is providing large enough data sets for training and validation. Models trained on insufficient data will overfit and will not generalise well to transactions outside the training set. For that reason, some ML solution providers have started offering pre-trained models. While such models can be appealing from a business perspective since they save the burden of gathering the initial data, they might contain hidden biases. That alone can make one liable under the terms of the EU’s General Data Protection Regulation (GDPR) and its upcoming sister regulation of high-risk systems employing AI. The latter puts very strong requirements on the quality of data used to train models employed in financial systems. But, what is even more important, fraud patterns and AML risk exposure vary widely between business sectors and even between businesses in the same sector. Proper normalisation is needed in order to obtain neutral features, which again requires access to sufficient amounts of historical data.

The Black Box Problem
Learning from data presents another challenge – lack of clear explainability of how the models actually work. Unlike explicitly programmed systems that follow predefined logic steps, ML solutions learn to approximate existing patterns in the data by adjusting various opaque internal model parameters during the training phase. That makes it hard to reason about why they produce certain output in any given case. Although explainable models such as decision trees exist, they are rarely employed directly due to their poor ability to generalise. Instead, ensembles of hundreds or thousands of decision trees where each tree corrects the mistakes of the previous ones, take for example the popular XGBoost, are used and they aren’t nearly as transparent as a single decision tree is.

Explainability is crucial for businesses operating from within the EU or serving EU customers. GDPR and the AI regulation mandate that one should be able to explain decisions made by automated systems. In similar regimes, reliance on black box ML solutions could result in long-running inquiries by the respective regulatory bodies, especially during the initial implementation phases of the new regulation when the reviewers are still gaining experience.

Feedback And Human Oversight
ML-based anti-fraud and AML solutions gain knowledge from the training data and apply it to make predictions about the fraudulent status of new and previously unseen transactions but they do not ultimately know if those predictions turn out to be true or not. It is up to a human user to determine the truth and to provide it back to the ML system in the form of a feedback loop. With that information available, model performance can be evaluated and compared against preset benchmarks such as the acceptable false positives ratio. In case of significant underperformance, the model is no longer adequate to the current pattern of fraudulent activity and needs to be retrained or replaced by an entirely different model. Feedback is a really crucial element in any ML system and if a product advertises ML capabilities but there is no clear way to provide feedback, that is likely a case of a fixed model or some other opaque mechanism is used to update the model knowledge. Such solutions that “just work” should be avoided.

The quality of the feedback is very important. If one simply loops back all the reported fraud, chargebacks or other negative data without any human oversight, that will both lead to wrong estimates of the model performance and to models with inflated false positives in general. As with any other data-driven solution, ML is governed by the GIGO principle–“garbage in, garbage out”. Therefore, one should be prepared to invest in human fraud and AML analysts even when employing an ML solution. This is also mandated by the upcoming EU regulation on high-risk AI systems.

The Way Forward – Hybrid Solutions

Even when it is possible to train and retrain ML solutions with custom data, it can still be hard to teach them to incorporate fixed logic which may be needed for, e.g., regulatory compliance or when addressing simple well-known fraud patterns. In such cases, one can augment the ML model with a set of rules or embed it in an existing rule-based system creating a hybrid solution.

Hybrid solutions have a number of practical advantages over pure ML-based ones:

  • It is easy to start using them right away with simple rules, even if there is not enough data available, then gradually add ML on top of them as data accumulates.
  • ML model decisions can be easily overridden and special cases can be quickly added without retraining the model.
  • It is possible to separate features containing sensitive information and process them with clear hand-written rules.

Instead of a epilogue

Certainly there are a ton of considerations that must be made before choosing one solution over the other, especially if you are looking to find the best fit for your organization. However here is our top 3:

  • Human expertise in detecting fraud and AML is not yet a thing of the past. If anyone knows your business inside out, it is you.
  • ML and AI solutions are certainly the next step but they are not as autonomous as we would like them to be. You will not find a silver bullet in that niche yet.
  • Be realistic about your current state of operations – do you have enough volume to effectively use ML and AI solutions, how much suspicious activity do you have, and how well is your data structured? All these seemingly minor items do matter. A lot!

About NOTO

Notolytix Ltd. was founded in 2015 by a group of – fraud prevention & IT veterans, from global companies like Groupon, Paysafe and Rakuten 

NOTO is an enterprise grade solution designed to address all financial crime threats. NOTO is data agnostic and uniquely flexible solution that empowers its users to efficiently combat fraud and money laundering across any vertical or industry. NOTO delivers unsurpassed ROI and truly global capabilities. 

One simple integration helps companies transform their approach to fraud, compliance and risk management in any sector or vertical. 

To learn more about NOTO, visit About NOTO