Leveraging AutoML to optimize dataset selection for improved breast cancer variants pathogenicity prediction

Citation

Ahmad, Rahaf M. and AlDhaheri, Noura and Mohamad, Mohd Saberi and Ali, Bassam R. (2025) Leveraging AutoML to optimize dataset selection for improved breast cancer variants pathogenicity prediction. Computational and Structural Biotechnology Journal, 27. pp. 4668-4681. ISSN 2001-0370

[img] Text
Leveraging AutoML to optimize dataset selection for improved breast cancer variants pathogenicity prediction.pdf - Published Version
Restricted to Repository staff only

Download (5MB)

Abstract

Breast cancer (BC) remains one of the most prevalent and lethal malignancies worldwide, with its onset shaped by complex interactions between germline predispositions, environmental exposures, and accumulated somatic mutations. Accurate prediction of variant pathogenicity is essential for identifying high-risk individuals, guiding early detection, and tailoring treatment strategies. However, existing computational tools often lack disease-specific training and fail to generalize across diverse variant datasets. To address this gap, we systematically benchmarked the predictive utility of four distinct variant datasets using three Automated Machine Learning (AutoML) frameworks-TPOT, H2O AutoML, and MLJAR. Our goal was to evaluate how dataset composition influences classification performance and to identify the optimal dataset for BC-specific pathogenicity prediction. Among the datasets evaluated, Dataset-2-curated from both cancer-specific and non-cancer databases, consistently yielded the highest predictive performance across all frameworks. H2O AutoML achieved a peak accuracy of 99.99 %, while TPOT and MLJAR also exhibited robust generalization on this dataset. Feature importance analyses revealed strong convergence across frameworks, highlighting conservation scores and pathogenicity metrics as dominant predictors. Interpretability techniques including SHAP, permutation importance, and LIME further validated the biological relevance and transparency of the models. This study presents a scalable, interpretable AutoML benchmarking framework tailored to the clinical prioritization of BC variants. By demonstrating the superiority of cancer-specific, disease-relevant datasets, our findings underscore the critical importance of thoughtful dataset design in machine learning pipelines for genomic medicine. Beyond BC, this framework is readily transferable to other genetic disorders, providing a foundational tool for precision diagnostics and the advancement of personalized oncology.

Item Type: Article
Uncontrolled Keywords: Automated machine learning, breast cancer, dataset optimization
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science > QA76.75-76.765 Computer software
R Medicine > RC Internal medicine > RC0254 Neoplasms. Tumors. Oncology (including Cancer)
Divisions: Faculty of Engineering and Technology (FET)
Depositing User: Nurin Syazwani Azmi
Date Deposited: 04 Dec 2025 08:45
Last Modified: 13 Dec 2025 07:06
URII: http://shdl.mmu.edu.my/id/eprint/14960

Downloads

Downloads per month over past year

View ItemEdit (login required)