• Report
  • Financial Crime
  • Banks
  • Global Research And Analytics
  • Financial Crime and Compliance
  • Data Imbalance Problem
June 15, 2018

Using machine learning to solve data imbalance in AML L1 alerts

Is your data well-balanced to train the machine learning model?

TData in the banking and financial services sectors has grown exponentially with the rise in money laundering and other financial crimes across the globe. Anti-money laundering (AML) data, in particular, has evolved dramatically and grown in volume due to the complexity of existing alerts as well as generation of new types of alerts.

 

Understanding customer level and transactions data is important in model development activities, which are vital to AML programs.

 

Based on various studies on financial crime compliance or FCC, researchers have found growing data imbalance problem between the minority class and the majority class (the minority class being true matches or true alerts, and the majority class being false matches or false positives).

 

Classical or traditional models favour the majority class and usually show inferior performance on the minority class. Presenting imbalanced data to a classifier will produce undesirable results, such as a much lower performance on testing data than training data.

 

However, a good AML model should perform equally well on both minority and majority classes.

 

The cost-sensitive learning methods consider higher costs for misclassification of observations in the minority class to address the anomaly. However, using a cost-sensitive learning method requires knowledge of the cost of misclassification, which is often unknown and therefore has to be assumed.

 

Machine learning algorithms and data mining solutions have provided an opportunity to understand the nature of imbalanced data. Machine learning techniques attempt to resolve class imbalance problems using sampling techniques, optimisation of model structure and learning algorithms. For imbalanced datasets, applying traditional methodologies such as K Nearest Neighbors, and Naive Bayes, results in inferior performance of the algorithms.

 

In this paper, we focus on the current challenges faced in using traditional methods for classification with imbalanced datasets, which rely on conventional sampling techniques to balance datasets. Additionally, we discuss alternative data balancing techniques to rebalance the data and a few of the machine learning classification algorithms that adapt themselves to deal with minority class data detection.