Multilingual Question Answering for Malaysia History with Transformer-based Language Model

Citation

Lim, Qi Zhi and Lee, Chin Poo and Lim, Kian Ming and Ng, Jing Xiang and Ooi, Eric Khang Heng and Loh, Nicole Kai Ning (2024) Multilingual Question Answering for Malaysia History with Transformer-based Language Model. Emerging Science Journal, 8 (2). pp. 675-686. ISSN 2610-9182

[img] Text
2198-6961-1-PB.pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

In natural language processing (NLP), a Question Answering System (QAS) refers to a system or model that is designed to understand and respond to user queries in natural language. As we navigate through the recent advancements in QAS, it can be observed that there is a paradigm shift of the methods used from traditional machine learning and deep learning approaches towards transformer-based language models. While significant progress has been made, the utilization of these models for historical QAS and the development of QAS for Malay language remain largely unexplored. This research aims to bridge the gaps, focusing on developing a Multilingual QAS for history of Malaysia by utilizing a transformer-based language model. The system development process encompasses various stages, including data collection, knowledge representation, data loading and pre-processing, document indexing and storing, and the establishment of a querying pipeline with the retriever and reader. A dataset with a collection of 100 articles, including web blogs related to the history of Malaysia, has been constructed, serving as the knowledge base for the proposed QAS. A significant aspect of this research is the use of the translated dataset in English instead of the raw dataset in Malay. This decision was made to leverage the effectiveness of well-established retriever and reader models that were trained on English data. Moreover, an evaluation dataset comprising 100 question-answer pairs has been created to evaluate the performance of the models. A comparative analysis of six different transformer-based language models, namely DeBERTaV3, BERT, ALBERT, ELECTRA, MiniLM, and RoBERTa, has been conducted, where the effectiveness of the models was examined through a series of experiments to determine the best reader model for the proposed QAS. The experimental results reveal that the proposed QAS achieved the best performance when employing RoBERTa as the reader model. Finally, the proposed QAS was deployed on Discord and equipped with multilingual support through the incorporation of language detection and translation modules, enabling it to handle queries in both Malay and English.

Item Type: Article
Uncontrolled Keywords: Question Answering; Historical Knowledge; Natural Language Processing; DeBERTaV3; BERT; ALBERT; ELECTRA; MiniLM; RoBERTa.
Subjects: Q Science > Q Science (General)
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 30 May 2024 02:42
Last Modified: 30 May 2024 02:42
URII: http://shdl.mmu.edu.my/id/eprint/12483

Downloads

Downloads per month over past year

View ItemEdit (login required)