Advanced Credit Card Fraud Detection Using Ensemble Machine Learning
Introduction
Credit card fraud represents a significant challenge for financial institutions, with billions lost annually to fraudulent transactions. In this post, I’ll walk through building an advanced fraud detection system using ensemble machine learning techniques. We’ll optimize multiple models using Bayesian optimization and combine their strengths through stacking.
Data Preparation and Preprocessing
First, we load our fraud datasets and prepare them for modeling:
import kagglehub
# Download latest version
path = kagglehub.dataset_download("kartik2112/fraud-detection")
print("Path to dataset files:", path)
After downloading our datasets, prepare them for modeling:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load the datasets
train_df = pd.read_csv('fraudTrain.csv')
test_df = pd.read_csv('fraudTest.csv')
# ...existing code...
# Encode categorical features
label_cols = ['merchant', 'category', 'first', 'last', 'gender', 'street']
le = LabelEncoder()
# ...existing code...
# Feature engineering - convert transaction time to seconds
train_df['trans_date_trans_time'] = pd.to_datetime(train_df['trans_date_trans_time'])
# ...existing code...
The key preprocessing steps include:
- Encoding categorical variables using Label Encoding
- Converting transaction timestamps to a numeric feature (seconds elapsed)
- Scaling the transaction amount using StandardScaler
Handling Class Imbalance
Fraud detection typically deals with highly imbalanced data. We use SMOTEENN (combination of SMOTE and Edited Nearest Neighbors) to create a more balanced training dataset:
from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTEENN
# Prepare features and target
X = train_df[['Time', 'Amount'] + label_cols]
y = train_df['is_fraud']
# ...existing code...
Hyperparameter Optimization with Bayesian Optimization
For each of our base models (XGBoost, LightGBM, and CatBoost), we use Bayesian optimization to find optimal hyperparameters:
XGBoost Optimization
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
from bayes_opt import BayesianOptimization
def optimize_xgboost(learning_rate, max_depth, n_estimators, gamma):
max_depth = int(max_depth)
n_estimators = int(n_estimators)
# ...existing code...
return auc
param_space_xgb = {
'learning_rate': (0.01, 0.2),
# ...existing code...
}
# ...existing code...
LightGBM Optimization
from lightgbm import LGBMClassifier
def optimize_lightgbm(learning_rate, max_depth, n_estimators, num_leaves):
# ...existing code...
return auc
param_space_lgb = {
# ...existing code...
}
# ...existing code...
CatBoost Optimization
from catboost import CatBoostClassifier
def optimize_catboost(learning_rate, depth, iterations, l2_leaf_reg):
# ...existing code...
return auc
param_space_cat = {
# ...existing code...
}
# ...existing code...
Building the Stacked Ensemble Model
After optimizing each individual model, we combine them using a stacking ensemble approach:
from sklearn.ensemble import StackingClassifier
# Create optimized base models
xgb_best = XGBClassifier(**best_params_xgb, random_state=42)
lgb_best = LGBMClassifier(**best_params_lgb, random_state=42)
cat_best = CatBoostClassifier(**best_params_cat, silent=True, random_state=42)
# Create stacking ensemble
stacking = StackingClassifier(estimators=[('XGB', xgb_best),
('LGB', lgb_best),
('Cat', cat_best)],
final_estimator=XGBClassifier(random_state=42))
# Train the stacked ensemble
stacking.fit(X_train_balanced, y_train_balanced)
Evaluation
Finally, we evaluate our model on the test set:
from sklearn.metrics import classification_report, roc_auc_score
# Make predictions
y_pred = stacking.predict(X_test)
# Print evaluation metrics
print(classification_report(y_test, y_pred))
print('ROC AUC Score:', roc_auc_score(y_test, y_pred))
Conclusion
In this post, we built an advanced credit card fraud detection system using ensemble learning techniques. By combining the strengths of three powerful gradient boosting algorithms (XGBoost, LightGBM, and CatBoost) and optimizing their hyperparameters through Bayesian optimization, we created a robust model that can effectively identify fraudulent transactions.
Key techniques we utilized:
- Feature engineering and encoding
- Handling imbalanced data with SMOTEENN
- Hyperparameter tuning with Bayesian optimization
- Model stacking for improved performance
This approach can be extended to other fraud detection scenarios and generally serves as a solid framework for tackling complex classification problems.