Citation
Ho, Ian Heng Jin (2022) Sentiment analysis in Malay text from unlabelled data. Masters thesis, Multimedia University. Full text not available from this repository.Abstract
The Malay language is considered a low-resource language. Many studies in the field of sentiment analysis in the Malay language manually curate and label their own datasets. These datasets are usually small due to the manual effort required and not publicly available. Due to the lack of training resources, the main approach in this thesis is a lexicon-based method. The common method used to develop Malay lexicons is translation of widely accepted English lexicons and carrying over their sentiment labels. However, this leads to a generalised and low-quality lexicon. Additionally, when compounded with the multilingual aspect of Malay text, specifically in Malaysia due to the common use of English words, sentiment analysis becomes even more complex. In this thesis, an end-to-end framework for preprocessing and a two-phase sentiment classification method is presented. The first phase, named MAL (Multilingual Automatic Lexicon Inducer), involves the induction of a domainspecific lexicon from word embeddings. The induced lexicon can then be used in a lexicon-based classifier to label documents. It was found that MAL produced a domain-specific lexicon from a small set of seed words that outperformed generalised and translated lexicons. However, lexicon-based methods face word-match issues when documents have sentiment words not found in the lexicon. This is mitigated in the second phase where a supervised classifier was trained on a filtered output of the induced lexicon-based classifier to produce a model that could classify new documents containing unseen sentiment words. Overall, the proposed framework was shown to not require labelled data for sentiment analysis, is applicable to the Malay and English language, and can handle different domains described as having various writing styles such as formal and informal, as well as various topics, contexts and platforms. The framework achieved the following results: (a) MAL induced a lexicon that achieved an F1-score of 0.88 on both Malay and English datasets, outperforming the baseline at an F1-score of 0.80. (b) The second-phase supervised classifier achieved a lower F1-score of 0.81 and 0.82 on the Malay and English datasets respectively, but with a much higher recall compared to MAL.
Item Type: | Thesis (Masters) |
---|---|
Additional Information: | Call No.: QA76.9.S57 .H64 2022 |
Uncontrolled Keywords: | Sentiment analysis |
Subjects: | Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science > QA76.75-76.765 Computer software |
Divisions: | Faculty of Engineering (FOE) |
Depositing User: | Ms Nurul Iqtiani Ahmad |
Date Deposited: | 11 Jan 2024 02:03 |
Last Modified: | 11 Jan 2024 02:03 |
URII: | http://shdl.mmu.edu.my/id/eprint/12042 |
Downloads
Downloads per month over past year
Edit (login required) |