Diabetes Prediction Using Feature Selection Algorithms and Boosting-Based Machine Learning Classifiers

Citation

Rahman, Fatima and Hossain, Sheyum and Tiang, Jun Jiat and Nahid, Abdullah-Al (2025) Diabetes Prediction Using Feature Selection Algorithms and Boosting-Based Machine Learning Classifiers. Diagnostics, 15 (20). p. 2622. ISSN 2075-4418

[img] Text
diagnostics-15-02622.pdf - Published Version
Restricted to Repository staff only

Download (2MB)

Abstract

Background: Diabetes mellitus is a significant primary global health concern that requires accurate diagnosis at an early stage to prevent severe complications. However, accurate prediction remains challenging due to limited, noisy, and imbalanced datasets. This study proposes a novel machine learning framework for improved diabetes prediction, addressing key challenges such as inadequate feature selection, class imbalance, and data preprocessing. Methods: This proposed work systematically evaluates five feature selection algorithms—Recursive Feature Elimination, Grey Wolf Optimizer, Particle Swarm Optimizer, Genetic Algorithm, and Boruta—using cross-validation and SHAP analysis to enhance feature interpretability. Classification is performed using two boosting algorithms: the light gradient boosting machine algorithm (LGBM) and the extreme gradient boosting algorithm (XGBoost). Results: The proposed framework, using the five most important features selected by the Boruta feature selection algorithm, outperformed other configurations with the LightGBM classifier, achieving an accuracy of 85.16%, an F1-score of 85.41%, and a 54.96% reduction in training time. Conclusions: Additionally, we have benchmarked our approach against recent studies and validated its effectiveness on both the Pima Indian Diabetes Dataset and the newly released DiaHealth dataset, demonstrating robust and accurate early diabetes detection across diverse clinical datasets. This approach offers a cost-effective, interpretable, and clinically relevant solution for early diabetes detection by reducing the number of input features, providing transparent feature importance, and achieving high predictive accuracy with efficient model training.

Item Type: Article
Uncontrolled Keywords: Boosting classifier algorithms, diabetes prediction, feature selection algorithms (FSAs), machine learning, medical diagnostics
Subjects: R Medicine > RC Internal medicine > RC71-78.7 Examination. Diagnosis
Divisions: Faculty of Artificial Intelligence & Engineering (FAIE)
Depositing User: Nor Afiqah Mohd Adnan
Date Deposited: 10 Dec 2025 01:43
Last Modified: 10 Dec 2025 01:43
URII: http://shdl.mmu.edu.my/id/eprint/15002

Downloads

Downloads per month over past year

View ItemEdit (login required)