Lightweight String Similarity Approaches for Duplicate Detection in Academic Titles

Citation

Wibowo, Fahrudin Mukti and Nafan, Muhammad Zidny and Gustalika, Muhamad Azrino and Fernando, Harinda and Hussain, Muhammad and Sahadun, Nur Afiqah (2025) Lightweight String Similarity Approaches for Duplicate Detection in Academic Titles. Journal of Informatics and Web Engineering, 4 (3). pp. 416-426. ISSN 2821-370X

[img] Text
2107-Article Text-21616-3-10-20250928.pdf - Published Version
Restricted to Repository staff only

Download (702kB)

Abstract

This study addresses the critical challenge of detecting duplicate final year project (FYP) titles in academic institutions, where minor variations like reordering, synonyms, and paraphrasing often obscure plagiarism. We systematically evaluate four string similarity algorithms - Jaro-Winkler, Levenshtein Edit Distance, TF-IDF with Cosine Similarity, and Jaccard Similarity - using a synthetic dataset of 250 title pairs representing common duplication patterns. Our experiments reveal that character-based methods (Jaro-Winkler and Edit Distance) achieve perfect detection (F1-score=1.0) for literal matches, including typographical variations and phrase reordering. At the same time, TF-IDF demonstrates strong semantic capability (F1-score=0.95), albeit with some false positives. Jaccard Similarity performs poorly (Recall=0.40) due to its inability to handle paraphrased content. The analysis of score distributions show a clear separation between duplicates and non-duplicates for character-based approaches, compared to significant overlap in set-based methods. Based on these findings, we propose a practical two-stage screening framework: initial high-confidence filtering using Jaro-Winkler (threshold>0.9) followed by semantic validation with TF-IDF (threshold>0.8). This hybrid approach offers institutions an effective balance between accuracy and computational efficiency for title screening. This study contributes by demonstrating how existing string similarity techniques can be orchestrated into a lightweight, two-stage screening framework tailored for academic title duplication, balancing accuracy with deployment feasibility in institutional settings. Future work should explore multilingual extensions and validation with real-world title datasets to further enhance the practical applicability of these findings.

Item Type: Article
Uncontrolled Keywords: Duplicate setection, string Similarity, TF-IDF, lightweight NLP, hybrid models
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science
Depositing User: Nor Afiqah Mohd Adnan
Date Deposited: 11 Nov 2025 02:31
Last Modified: 11 Nov 2025 02:31
URII: http://shdl.mmu.edu.my/id/eprint/14878

Downloads

Downloads per month over past year

View ItemEdit (login required)