SMOTE vs SMOTE-ENN: Which is more effective for Churn Prediction in Imbalanced Banking Data
An imbalanced dataset is a type of dataset where the number of instances belonging to one class is significantly higher or lower than the number of instances belonging to the other class. In other words, the distribution of the target variable is not uniform across the different classes. which can lead to bias in learning and also it makes difficult to evaluate the model.
There are various techniques to handle imbalance datasets but SMOTE (Synthetic Minority Oversampling Technique) is used widely. Which creates artificial data points around the minority class.
Problem with SMOTE
it can lead to the creation of noisy and irrelevant synthetic samples in regions of feature space where the minority class is highly concentrated. This can result in overfitting and reduced generalization performance of the model.
SMOTE with the Edited Nearest Neighbors (ENN)
SMOTENN (SMOTE-ENN) is a variant of SMOTE that addresses this limitation by combining SMOTE with the Edited Nearest Neighbors (ENN) technique. ENN is a data filtering technique that removes noisy and irrelevant samples from the dataset.
By combining SMOTE with ENN, SMOTENN is able to generate synthetic samples that are more representative of the minority class and reduce the presence of noisy samples. This can lead to improved generalization performance of the model.
We will be assessing the effectiveness of SMOTE and SMOTE-ENN on the Churn Prediction problem. Our goal is to predict high-risk customers who are likely to leave the banking industry.
The target class distribution for this problem is given below.
It can be seen that 20% of the total customers have churned, while the rest of the customers are still active. Due to this, the target class is highly imbalanced.
The main focus of this problem is to prioritize the detection of customers who are likely to churn over those who are likely to stay. As a result, recall is a crucial metric for this task.
Baseline Model (Without Balancing data)
log_reg = LogisticRegression()
log_reg.fit(X_train,Y_train)
evaluate_model(log_reg,X_train,Y_train,X_test,y_test,fit=True)
Recall : 0.19
It is apparent that the model is unable to detect churned customers when the data is unbalanced, as the recall metric is very low, specifically 0.19.
SMOTE
oversample = SMOTE()
x_train, y_train = oversample.fit_resample(X_train, Y_train)
log_reg_smote = LogisticRegression()
log_reg_smote.fit(x_train,y_train)
evaluate_model(log_reg_smote,x_train,y_train,X_test,y_test,fit=True)
Recall: 0.70
After implementing the SMOTE technique, the recall metric has increased significantly to 0.70. However, there is still room for further improvement in the model’s performance.
SMOTE- ENN
smt = SMOTEENN(random_state=42)
x_train, y_train = smt.fit_resample(X_train, Y_train)
x_train=pd.DataFrame(x_train,columns = x.columns)
X_test=pd.DataFrame(X_test,columns = x.columns)
log_reg_smote_ENN = LogisticRegression()
log_reg_smote_ENN.fit(x_train,y_train)
evaluate_model(log_reg_smote_ENN,x_train,y_train,X_test,y_test,fit=True)
Recall: 0.77
Upon implementing SMOTE-ENN, we can observe a considerable improvement in the recall metric, which has increased to 0.77. It has outperformed SMOTE in terms of performance.
Conclusion:
By using SMOTE-ENN instead of SMOTE, we were able to improve the recall metric significantly. This is because SMOTE-ENN acts as a filter to SMOTE, removing noisy samples and providing a cleaner dataset for modeling. As a result, the model trained on the SMOTE-ENN balanced dataset was better able to capture the underlying patterns in the data and make more accurate predictions on the minority class.