Citation
Mohd Ali, Nursabillilah (2024) Classification of microarray breast cancer datasets via features selection using binary single solution simulated Kalman Filter. PhD thesis, Multimedia University. Full text not available from this repository.Abstract
Gene expression profiling using DNA microarrays can provide useful and independent predictive information for patients recently diagnosed with breast cancer. Researchers are attempting to identify links between genes and this disease, and correlations between genes. Microarrays are also used to uncover novel biomarkers, which can provide accurate diagnosis and monitoring tools for early detection of certain subtypes of cancer, or for assessing the effectiveness of specific treatment protocols. Microarray dataset suffers from huge number of features with sample size. The study aims to introduce a novel feature selection (FS) method for microarray breast cancer data. The primary objective is to achieve maximum accuracy while minimising the number of selected features. Human gene expression profiles on the NCBI website are highly dimensional and imbalanced, with small sample sizes that hindered the search for accurate features. Thus, this work proposes a new FS method termed the binarization-ssSKF (bin-ssSKF) to improve testing accuracy using a 10- fold cross-validation method using three machine learning algorithms (SVM, RF, and KNN). The proposed method used five breast cancer datasets, GSE2034, GSE1456, GSE7390_DMFS, GSE7390_RFS, and GSE11121, which contained different numbers of features and sample sizes. These datasets underwent a normalisation process, where each feature was scaled and normalised for the FS process. Binary dataset was applied in the study and supervised machine learning method was used. The results were accuracy with FS and without tuning parameter; without FS and with tuning parameter; with FS and without tuning parameter; and with both FS and tuning parameter. With the proposed FS method, the testing accuracy was improved and comparable to the existing results. Findings for GSE2034 (Wang dataset) benchmarked to the existing study were significantly increased from 61% to 74.14% (the best accuracy) using binssSKF after the parameters were tuned with RF classifier, with the selected feature ratio of 0.264. GSE1456 achieved testing accuracy of 87.50%. Both GSE7390_DMFS and GSE_7390 respectively achieved 77.5.% and 75%. Whereas, GSE11121 was scored 82.50%. These testing results surpass the performance of previously used other existing method.
Item Type: | Thesis (PhD) |
---|---|
Additional Information: | Call No.: QP624.5.D726 N87 2024 |
Uncontrolled Keywords: | DNA microarrays |
Subjects: | Q Science > QP Physiology |
Divisions: | Faculty of Engineering and Technology (FET) |
Depositing User: | Ms Nurul Iqtiani Ahmad |
Date Deposited: | 03 Feb 2025 02:48 |
Last Modified: | 03 Feb 2025 02:48 |
URII: | http://shdl.mmu.edu.my/id/eprint/13340 |
Downloads
Downloads per month over past year
![]() |