Author: Phd Hristo Iliev – Chief Data Scientist at NOTO
Hristo is a High-Performance Computing specialist and data science enthusiast with a doctoral degree in atomic and molecular physics from the University of Sofia, software developer, and systems administrator. He specialises in data analytics and predictive modelling at scale, development and performance tuning of parallel applications, and is currently the only Stack Overflow user to hold at once golden badges for the two major parallel programming paradigms – OpenMP and the Message Passing Interface.
Before venturing into the startup world, Hristo worked for six years at RWTH Aachen University, where he optimised various scientific and engineering applications, taught message passing to users of the university supercomputer, and served as co-organiser and co-editor of the proceedings of the first HPC symposium of the Jülich Aachen Research Alliance. Once he left academia, he became the principal data scientist of the Dutch startup holler.live, developing within the EU-funded DataPitch programme an interactive advertising solution for IPTV as well as the company’s internal product analytics.
Machine Learning, a branch of Artificial Intelligence, has been around for a while now. Recent advancements in the field have sparked a revolution across a wide range of industries and scientific disciplines, allowing automated computer systems to solve problems that were previously intractable and even surpassing humans in some domains. The demand for machine learning continues to rise on the belief that it will one day solve every problem. New and established businesses alike have responded by marketing a wide range of data-driven products.
The goal of every classification model is to reliably identify the category that a given input belongs to. In the case of transaction monitoring, a model may be used to sort transactions into legitimate and fraudulent categories. A user login classification model might separate legitimate logins from compromised ones. Such examples illustrate binary classification models, which divide data into only two categories. The first category is often known as “negative,” whereas the second is known as “positive.” The accuracy of such a model is measured by the percentage of inputs (events, transactions, logins, etc.) that are properly categorised. Sites marketing machine learning solutions for fraud detection or AML frequently claim an accuracy of 99% or higher. While accuracy is useful in gauging the overall quality of a model, it can be deceptive when trying to spot unusual phenomena like fraud.
Let’s use the example of a high-risk industry where fraud accounts for around one percent of all transactions to see how this works in practise. A completely ignorant “optimistic” model that always predicts that the transaction is not fraudulent will be correct 99% of the time and will have a 99% accuracy. How does that sound?
Model selectivity, which assesses the proportion of positive inputs that were accurately detected, is a far more truthful statistic in this scenario. The model has a selectivity of 33% if it can only identify 1 in 3 fraudulent transactions. In the aforementioned “optimistic” model, no transactions will be flagged as fraudulent; hence, the selectivity is always 0%, revealing the model to be fraudulent itself. However, as we’ll discover in one of the upcoming sections, the price of striving for high selectivity is steep.
Allowing fraudulent transactions and rejecting genuine ones, known as false positives, are both ways in which a corporation might lose money or report a loss. The rate of false positives in a model is defined as the number of false positives divided by the total number of negative (good) inputs. Even a small percentage of false positives can mean many misclassifications when dealing with a high volume of transactions.
Continuing with the example from earlier, a 1% false positives rate would indicate that 1% of legitimate transactions would be incorrectly flagged as fraudulent. As a result, any benefits from weeding out fraudulent transactions will be balanced out by the higher rejection rate, which is essentially the same as the fraud rate.
The percentage of correct positive classifications is reported via a statistic with a slightly misleading name: precision. When dealing with uncommon occurrences like fraud, even a moderate number of false positives can significantly reduce the precision. A high precision number, however, is not sufficient evidence of a successful model. This might simply indicate that the model is so conservative that it only labels as positive occurrences that have an extremely high risk of being fake. Such a model would be extremely insensitive and hence useless.In our fictitious example, a seemingly great model that captures all the fraud (i.e., has 100% selectivity) but introduces 1% false positives will have approximately two positive identifications per a hundred transactions, one truly fraudulent and one incorrectly identified as fraudulent from the remaining ninety-nine good transactions. The model’s precision will equal 50%, or 1 (actually fraudulent) / 2 (identified as fraudulent). However, if your model never incorrectly classifies legitimate transactions and only flags 1 out of every 1,000,000 fraudulent ones, you’ll have 100% precision.
As seen in the image above, selectivity and precision capture two distinct facets of the positive identification process, but none of them provides the whole picture by itself.
There are two distinct kinds of classification models. The first kind, an example of which is a set of decision rules, produces just the class that its input is expected to belong to, such as “good” or “fraudulent” for a transaction or “authentic” or “hacked” for a login. They don’t include context, such how certain they are in their forecast. The second type of model, of which XGBoost is one example, does not produce hard predictions but instead returns a score that indicates how confident the model is that the input belongs to one of several classes.
In the second kind of models, the threshold value used to differentiate between the two categories is left up to the user, with 0.5 (or 50% confidence) being a common starting point. Adjusting the threshold allows one to alter the required level of certainty for classification, which in turn affects the sensitivity and the rate of false positives. While raising the threshold will reduce false positives, it will also reduce the model’s sensitivity. Setting it too low will increase the sensitivity, but also the number of false positives. The reason for this is found in the Receiver Operating Characteristic (ROC), which gives the relationship between the sensitivity and the rate of false positives for a given threshold value. The ROC curve displays the model’s sensitivity and false positive rate against the full range of thresholds.
Due to their relationship as conjugate quantities, model sensitivity and false positives rate cannot both be independently selected. Given one of them, the model’s Receiver Operating Characteristic is used for determining the value of the other. Using the aforementioned ROC curve, we learn that a model sensitivity of 40% yields an expected false positive rate of roughly 1%. Or, if we accept a false-positive rate of 1%, we may expect the model to accurately identify around 40% of the fraud.
A dependable vendor of ML/AI solutions will either report both metrics simultaneously or display the ROC curve. Area under the ROC curve (AUC-ROC) is a popular metric for assessing model quality, but it can be deceptive because it averages the model’s performance across the entire range of false positive rates, when the focus should be on the narrow region of the ROC curve where the false positive rate is below a predetermined threshold.
Hyperparameters (as opposed to internal model parameters) are user-supplied controls used in the intricate mathematical process of model training. The maximum depth of a decision tree is an example of a controlling parameter; increasing it increases the accuracy of the tree’s possible decisions but also increases the risk that it may overfit to the training data.
The quality of a model can be greatly improved or rendered useless depending on the hyperparameters used. When it comes to picking the proper values, there is no foolproof method. They depend not just on the kind of model and the purpose of training, but also on the dataset being used. If you’re trying to train the same model on two separate sets of data, say, from two different time periods, you’ll probably need to adjust the hyperparameters.
Even with a thorough familiarity with the model’s inner workings and the data used to train it, further tuning, or the exploration of different value combinations in search of those that produce the best model, may be required to select the right value for each hyperparameter. Despite the fact that this is a hotspot for study, training a model typically takes a lot of time and resources. The quality of the models provided by any machine learning service provider who claims to offer fast (and hence inexpensive) training should be questioned.
An ML model is only as good as the data it was trained on. The model will be biased if there are flaws in the data. The model will learn to take advantage of any spurious correlations it finds in the data. Despite significant academic interest, automated debiasing of models remains mostly a research topic.
There is a wide range of possible bias manifestations. If the training data was unbalanced and overrepresented certain nations, for instance, a fraud detection model trained on European e-commerce data would assign a greater chance for fraud to transactions from clients in larger and more populated countries.
Pre-trained models, or models that have already been trained using data from another source, may seem like a good idea, especially for small businesses that don’t have access to enough data to train their own model. However, it’s difficult to know what kind of data those models were trained using, and thus what kind of biases they may have. Under some legislative frameworks, such as the forthcoming European AI Act, this may provide a regulatory compliance risk.
We have demonstrated that there is no one indicator for measuring the efficacy of a machine learning model. Instead, there is an abundance of metrics, each of which measures a discrete subset of what’s going on but never provides a holistic view. Because of this, dishonest service providers might choose to highlight metrics that paint their models in a more favourable light.
One needs to be careful when searching the market for answers. In order to avoid examples of purposeful modification of meaning, such as advertising the model’s accuracy, which, as we have demonstrated above, may be fairly high even for terrible models, it is important to always ask about the definition of any presented metric. If the definition is accurate, inquire about the expected false-positive rate. Inquire about the ROC curve for probabilistic models of fraud detection, paying special attention to the area of the curve that displays high sensitivity and a low false-positive rate. Inquire as to what steps, if any, have been done to ensure the model’s lack of bias.
To learn more about NOTO, visit About NOTO