Knowledge-based Word Tokenization System for Urdu

Citation

Khan, Asif and Khan, Khairullah and Khan, Wahab and Khan, Sadiq Nawaz and Haq, Rafiul (2024) Knowledge-based Word Tokenization System for Urdu. Journal of Informatics and Web Engineering, 3 (2). pp. 86-97. ISSN 2821-370X

[img] Text
View of Knowledge-based Word Tokenization System for Urdu.pdf - Published Version
Restricted to Repository staff only

Download (2MB)

Abstract

Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solutionto the complexities posed by the unique linguistic attributes of Urdu tokenization

Item Type: Article
Uncontrolled Keywords: Natural Language Processing
Subjects: L Education > LC Special aspects of education
Depositing User: Ms Rosnani Abd Wahab
Date Deposited: 10 Jul 2025 09:27
Last Modified: 10 Jul 2025 09:27
URII: http://shdl.mmu.edu.my/id/eprint/14238

Downloads

Downloads per month over past year

View ItemEdit (login required)