Skip to main content

Machine Learning Error Analysis with AMEX

This project involved using BERT embeddings and clustering algorithms to analyze classification errors, group similar data into meaningful topics, and identify potential issues for corrective actions

  • Classification Error Analysis
  • Clustering Algorithm
  • Embedding Analysis

The Problem

The challenge lies in the immense volume of credit card activity: $4.8 trillion in transactions processed in 2022, 10 billion handled by AmEx in 2023 and 251,952 complaints filed with the Consumer Financial Protection Bureau from 2021 to 2024. Current machine learning models struggle with mislabeled training data and confidently misclassifying transactions, leading to unreliable predictions. To tackle these issues, more effective error analysis and improvements are essential to enhance accuracy and address customer concerns at scale.

Dataset and Pretrained-Model

Dataset: 77 unique labels
Train Set

  • 10,003 utterances
  • Imbalanceed label distribution
Test Set
  • 3,080 utterances
  • Balanced label distribution

Pretrained-Model: BERT-Banking77

  • 98.34 % Accuracy on train data
  • 92.76 % Accuracy on test data

Methodology

Louvain Clustering

  • "Community identifier" that groups nodes by network connectivity, efficiently handling diverse community sizes and complex structures
  • Trained unsupervised, leveraging hierarchical by modularity optimization

Optimization
Tuning Louvain with BERT-Banking77 embeddings produced 77 distinct clusters

  • Out of 3080 utterances, only 8 classified as outliers
  • Resolution = 12.8, Modularity = 0.67
  • Identified 181 potential errors.

Topic Consolidation
Manually grouped similar categories into 12 comprehensive topics

  • The most common label in each cluster became the cluster name

Analyzing Error Reason Distributions

Within-Topic vs. Out-of-Topic

Note: All data shown here is used solely for demonstration purposes. It does not represent proprietary data

Over 80% of within-topic errors are caused by mislabeling
‘Model Errors’ are more common for the without-topic category

BERT Confidence in Distinguishing 'Model Error' and 'Mislabel' Reasons

Within-Topic vs. Out-of-Topic

Within-topic errors made at a high BERT confidence level are a strong indicator for a ‘Mislabel’ type

For 0.3-0.4 confidence intervals, it shows a peak in 'Model Error’
'Mislabeled' errors are more frequent in the high confidence bins, particularly in the 0.9-1.0 range.

BERT Confidence in Distinguishing 'Model Error' and 'Mislabel' Reasons

Within-Topic vs. Out-of-Topic

Within-Topic

  • BERT expresses high confidence for a large portion of all erroneous predictions
  • ‘Mislabeled’ errors are concentrated at high BERT confidence levels but lack significant association with Louvain cluster confidence

Out-of-Topic

  • Out-of-topic errors are particularly common below the 0.6 cluster confidence level
  • Otherwise, there is a lack of significant distribution patterns, and errors are dispersed.

Louvain Results

We compared the results of BERT77 with Louvain confidence for all utterances to analyze the relationship between predicted confidence and cluster confidence
The pattern we observed shows:

  • 97% of correctly classified utterances have a predicted confidence value above 0.8
  • Less than 1% of all correctly classified utterances have a predicted confidence value below 0.6. Out of all the utterances in this range, over 73% are considered errors
  • Real-world application: Use a confidence threshold (i.e. 0.6) as indicator of an error. Streamline QA processes by automatically flagging these data points for manual review

Clustering Next Steps

Automate Mislabeled Detection

  • Implement systems that automatically detect mislabels in the dataset, reducing manual review workload and increasing the reliability of the data

Refine Clustering Strategy

  • Investigate hierarchical clustering at different levels of granularity, from detailed intents to meta-topics, to understand patterns in mislabeling