Machine Learning Error Analysis with AMEX

This project involved using BERT embeddings and clustering algorithms to analyze classification errors, group similar data into meaningful topics, and identify potential issues for corrective actions

Classification Error Analysis
Clustering Algorithm
Embedding Analysis

The Problem

The challenge lies in the immense volume of credit card activity: $4.8 trillion in transactions processed in 2022, 10 billion handled by AmEx in 2023 and 251,952 complaints filed with the Consumer Financial Protection Bureau from 2021 to 2024. Current machine learning models struggle with mislabeled training data and confidently misclassifying transactions, leading to unreliable predictions. To tackle these issues, more effective error analysis and improvements are essential to enhance accuracy and address customer concerns at scale.

Dataset and Pretrained-Model

Dataset: 77 unique labels
Train Set

10,003 utterances
Imbalanceed label distribution

Test Set

3,080 utterances
Balanced label distribution

Pretrained-Model: BERT-Banking77

98.34 % Accuracy on train data

92.76 % Accuracy on test data

Methodology

Louvain Clustering

"Community identifier" that groups nodes by network connectivity, efficiently handling diverse community sizes and complex structures
Trained unsupervised, leveraging hierarchical by modularity optimization

Optimization
Tuning Louvain with BERT-Banking77 embeddings produced 77 distinct clusters

Out of 3080 utterances, only 8 classified as outliers
Resolution = 12.8, Modularity = 0.67
Identified 181 potential errors.

Topic Consolidation
Manually grouped similar categories into 12 comprehensive topics

The most common label in each cluster became the cluster name

Analyzing Error Reason Distributions

Within-Topic vs. Out-of-Topic

Note: All data shown here is used solely for demonstration purposes. It does not represent proprietary data

Over 80% of within-topic errors are caused by mislabeling
‘Model Errors’ are more common for the without-topic category

BERT Confidence in Distinguishing 'Model Error' and 'Mislabel' Reasons

Within-Topic vs. Out-of-Topic

Within-topic errors made at a high BERT confidence level are a strong indicator for a ‘Mislabel’ type

For 0.3-0.4 confidence intervals, it shows a peak in 'Model Error’
'Mislabeled' errors are more frequent in the high confidence bins, particularly in the 0.9-1.0 range.

BERT Confidence in Distinguishing 'Model Error' and 'Mislabel' Reasons

Within-Topic vs. Out-of-Topic

Within-Topic

BERT expresses high confidence for a large portion of all erroneous predictions
‘Mislabeled’ errors are concentrated at high BERT confidence levels but lack significant association with Louvain cluster confidence

Out-of-Topic

Out-of-topic errors are particularly common below the 0.6 cluster confidence level
Otherwise, there is a lack of significant distribution patterns, and errors are dispersed.

Louvain Results

We compared the results of BERT77 with Louvain confidence for all utterances to analyze the relationship between predicted confidence and cluster confidence
The pattern we observed shows:

97% of correctly classified utterances have a predicted confidence value above 0.8
Less than 1% of all correctly classified utterances have a predicted confidence value below 0.6. Out of all the utterances in this range, over 73% are considered errors
Real-world application: Use a confidence threshold (i.e. 0.6) as indicator of an error. Streamline QA processes by automatically flagging these data points for manual review

Clustering Next Steps

Automate Mislabeled Detection

Implement systems that automatically detect mislabels in the dataset, reducing manual review workload and increasing the reliability of the data

Refine Clustering Strategy

Investigate hierarchical clustering at different levels of granularity, from detailed intents to meta-topics, to understand patterns in mislabeling