Development of KDP-aligned Malay Part-Of-Speech corpus and deep learning based tagger

Citation

Mohamad Ali, Nurulhuda (2025) Development of KDP-aligned Malay Part-Of-Speech corpus and deep learning based tagger. Masters thesis, Multimedia University.

Full text not available from this repository.
Official URL: http://erep.mmu.edu.my/

Abstract

Part-of-Speech (POS) tagging is a fundamental component of Natural Language Processing (NLP). While POS tagging has been extensively studied for high-resource languages supported by large, standardised annotated corpora, resources for the Malay language remain limited. Existing Malay POS-annotated datasets are scarce or not publicly accessible, and many of them are not aligned with authoritative linguistic standards provided by Dewan Bahasa dan Pustaka (DBP). Consequently, the lack of a linguistically validated Malay POS-annotated corpus limits robust POS tagging research and applications. To address these challenges, this thesis proposes a two-stage sequential framework. The first stage focuses on developing the Kamus Dewan Perdana–Malay Part-of-Speech (KDP-MPOS) corpus, aligned with DBP guidelines to ensure grammatical consistency and contemporary lexical coverage. The corpus comprises 464 Malay news articles (197,651 tokens across 8,100 sentences) annotated using a semi-automated approach followed by expert validation by three linguists. Annotation consistency is evaluated using interannotator agreement (IAA) metrics. The second stage focuses on the development and evaluation of a Malay POS tagging system (MPOS tagger) trained exclusively on the KDP-MPOS corpus. Multiple deep learning architectures are evaluated, including neural sequence models (RNN, LSTM, BiLSTM, and GRU) combined with Word2Vec and FastText embeddings, as well as transformer-based models (mBERT and XLM-RoBERTa). Experimental results indicate that FastText consistently outperforms Word2Vec for neural models, with the GRU–FastText configuration achieving the highest performance among recurrent models. Transformer-based models further outperform neural models, achieving accuracy above 95% on the heldout test set. To assess robustness beyond the training domain, an external evaluation using 60 unseen text samples from news articles, advertisements, and informal Reddit posts is conducted with expert review. The analysis shows that transformer-based models exhibit stronger generalisation and lower error rates across both formal and informal text types. This research contributes (1) KDP-MPOS, the first Malay POSannotated corpus grounded in Kamus Dewan Perdana and DBP standards, and (2) MPOS tagger, a Malay POS tagging system systematically evaluated using neural and transformer-based architectures. These contributions advance Malay NLP by providing a linguistically grounded resource and empirically validated POS tagging models to support future research and applications.

Item Type: Thesis (Masters)
Additional Information: Call No.: QA76.9.N38 N87 2025
Uncontrolled Keywords: Natural language processing (Computer science)
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science > QA76.75-76.765 Computer software
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 10 Jun 2026 05:22
Last Modified: 10 Jun 2026 05:22
URII: http://shdl.mmu.edu.my/id/eprint/16104

Downloads

Downloads per month over past year

View ItemEdit (login required)