Citation
Wibowo, Fahrudin Mukti and Nafan, Muhammad Zidny and Gustalika, Muhamad Azrino and Fernando, Harinda and Hussain, Muhammad and Sahadun, Nur Afiqah (2025) Lightweight String Similarity Approaches for Duplicate Detection in Academic Titles. Journal of Informatics and Web Engineering, 4 (3). pp. 416-426. ISSN 2821-370X|
Text
2107-Article Text-21616-3-10-20250928.pdf - Published Version Restricted to Repository staff only Download (702kB) |
Abstract
This study addresses the critical challenge of detecting duplicate final year project (FYP) titles in academic institutions, where minor variations like reordering, synonyms, and paraphrasing often obscure plagiarism. We systematically evaluate four string similarity algorithms - Jaro-Winkler, Levenshtein Edit Distance, TF-IDF with Cosine Similarity, and Jaccard Similarity - using a synthetic dataset of 250 title pairs representing common duplication patterns. Our experiments reveal that character-based methods (Jaro-Winkler and Edit Distance) achieve perfect detection (F1-score=1.0) for literal matches, including typographical variations and phrase reordering. At the same time, TF-IDF demonstrates strong semantic capability (F1-score=0.95), albeit with some false positives. Jaccard Similarity performs poorly (Recall=0.40) due to its inability to handle paraphrased content. The analysis of score distributions show a clear separation between duplicates and non-duplicates for character-based approaches, compared to significant overlap in set-based methods. Based on these findings, we propose a practical two-stage screening framework: initial high-confidence filtering using Jaro-Winkler (threshold>0.9) followed by semantic validation with TF-IDF (threshold>0.8). This hybrid approach offers institutions an effective balance between accuracy and computational efficiency for title screening. This study contributes by demonstrating how existing string similarity techniques can be orchestrated into a lightweight, two-stage screening framework tailored for academic title duplication, balancing accuracy with deployment feasibility in institutional settings. Future work should explore multilingual extensions and validation with real-world title datasets to further enhance the practical applicability of these findings.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | Duplicate setection, string Similarity, TF-IDF, lightweight NLP, hybrid models |
| Subjects: | Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science |
| Depositing User: | Nor Afiqah Mohd Adnan |
| Date Deposited: | 11 Nov 2025 02:31 |
| Last Modified: | 11 Nov 2025 02:31 |
| URII: | http://shdl.mmu.edu.my/id/eprint/14878 |
Downloads
Downloads per month over past year
Edit (login required) |
