Machine Learning Framework for customer churn prediction and risk cluster analysis

Citation

Lim, Yi Yang (2025) Machine Learning Framework for customer churn prediction and risk cluster analysis. Masters thesis, Multimedia University.

Full text not available from this repository.
Official URL: http://erep.mmu.edu.my/

Abstract

Customer churn occurs when a customer stops using a company’s services. Due to increased competition, churn retention has become an important aspect of the business. This study has several contributions; first, it introduces a holistic approach to predict customer churn that addresses common real-world challenges, such as missing values and imbalanced datasets. Second, the framework helps to identify the primary factors driving churn across different sectors. Third, this study enables businesses to implement targeted retention strategies by applying churn risk analysis, which not only optimises resource allocation but also uncovers important behavioural patterns within each customer group. The three datasets used are the E-commerce, BigML, and Bank Churners datasets. This is to ensure the proposed method can be applied to diverse business sectors. The research begins with data preprocessing to ensure that the data is cleaned and suitable for further analysis. Exploratory data analysis is then performed to identify potential issues, including missing values and class imbalances. Next, the minimum redundancy maximum relevance (mRMR) feature selection method is applied to identify the most relevant features for the target variable. Subsequently, the Synthetic Minority Oversampling Technique (SMOTE) is employed to handle class imbalances in the dataset. Three prediction models: XGBoost, Logistic Regression, and Random Forest are used to predict customer churn with model hyperparameters optimised using Optuna. To ensure the results are consistent and unbiased, 10-fold cross-validation is applied for model evaluation. Next, Shapley Additive Explanations (SHAP) is used to identify the factors contributing to churn. It is then followed by the K-means to segment customers into distinct groups, and finally, a Bayesian logistic regression model is used to analyse the risk profile of each cluster, which helps to identify the loyal, medium-risk or high-risk clusters. Based on the performance analysis, the XGBoost model achieves the best performance among the three evaluated models, followed by Random Forest and Logistic Regression. For the E-commerce dataset, the accuracy achieved is 98.28%, 93.18% for the BigML dataset, and 97.88% for the Bank Churners dataset. A comparative analysis of the results obtained with and without the application of SMOTE indicates that a balanced dataset is crucial for improving model performance. In particular, evaluation metrics such as F1 score, recall, and precision significantly improved after applying the proposed method.

Item Type: Thesis (Masters)
Additional Information: Call No.: Q325.5 .L56 2025
Uncontrolled Keywords: Machine learning
Subjects: Q Science > Q Science (General) > Q300-390 Cybernetics
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 07 Apr 2026 01:40
Last Modified: 07 Apr 2026 01:40
URII: http://shdl.mmu.edu.my/id/eprint/15699

Downloads

Downloads per month over past year

View ItemEdit (login required)