Feature Transformation on Big Data for Species Classification in Machine Learning

Citation

Yow, Li Wen and Ong, Lee Yeng and Tan, Joon Liang (2025) Feature Transformation on Big Data for Species Classification in Machine Learning. Emerging Science Journal, 9 (6). pp. 3017-3041. ISSN 2610-9182

[img] Text
9.pdf - Published Version
Restricted to Repository staff only

Download (2MB)

Abstract

Classification of bacterial species, particularly for closely related taxa, remains a major challenge in many areas, e.g., public health, food industries, and many others. The issues are mainly caused by overlapping genetic features of organisms and data complexities. In this study, a bacterial taxonomic identification framework that integrates genome-derived motif sequences with machine learning was introduced. Two hundred and forty genome sequences from Salmonella enterica, representing six subspecies and ten serovars, were used for modelling. Sequence motifs were predicted from single-copy orthologous core genes of the downloaded genomes. Single nucleotide polymorphisms (SNPs) within these motifs were extracted and numerically encoded as machine learning features. The 20 top-most informative predictors from feature selections were used for model training in Random Forest and Support Vector Machine. Comparing the output from multiple analyses, the Random Forest model achieved the highest accuracy of 97.92%, demonstrating reliable differentiation of Salmonella at both subspecies and serovar levels. This research presents two key innovations: i) the use of sequence motifs as molecular signatures for bacterial classification; ii) a novel feature engineering method that transforms genome-derived data into machine learning-readable features. The proposed framework offers a practical and scalable solution for fine-level bacterial classification and has high potential to be applied for other microbial taxa.

Item Type: Article
Uncontrolled Keywords: Big Data, Bioinformatics, Feature Selection, Machine Learning, Sequence Motifs
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Suzilawati Abu Samah
Date Deposited: 07 Jan 2026 06:47
Last Modified: 07 Jan 2026 08:43
URII: http://shdl.mmu.edu.my/id/eprint/15182

Downloads

Downloads per month over past year

View ItemEdit (login required)