Citation
Khan, Asif and Khan, Khairullah and Khan, Wahab and Khan, Sadiq Nawaz and Haq, Rafiul (2024) Knowledge-based Word Tokenization System for Urdu. Journal of Informatics and Web Engineering, 3 (2). pp. 86-97. ISSN 2821-370X![]() |
Text
View of Knowledge-based Word Tokenization System for Urdu.pdf - Published Version Restricted to Repository staff only Download (2MB) |
Abstract
Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solutionto the complexities posed by the unique linguistic attributes of Urdu tokenization
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Natural Language Processing |
Subjects: | L Education > LC Special aspects of education |
Depositing User: | Ms Rosnani Abd Wahab |
Date Deposited: | 10 Jul 2025 09:27 |
Last Modified: | 10 Jul 2025 09:27 |
URII: | http://shdl.mmu.edu.my/id/eprint/14238 |
Downloads
Downloads per month over past year
![]() |