Indonesian-English Textual Similarity Detection Using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS)

Citation

Krisnawati, Lucia D. and Mahastama, Aditya W. and Haw, Su Cheng and Ng, Kok Why and Naveen, Palanichamy (2024) Indonesian-English Textual Similarity Detection Using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS). CommIT (Communication and Information Technology) Journal, 18 (2). pp. 183-195. ISSN 1979-2484

[img] Text
View of Indonesian-English Textual Similarity Detection Using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS).pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

The tremendous development in NaturalLanguage Processing (NLP) has enabled the detectionof bilingual and multilingual textual similarity. One ofthe main challenges of the Textual Similarity Detection(TSD) system lies in learning effective text representation.The research focuses on identifying similar texts betweenIndonesian and English across a broad range of semanticsimilarity spectrums. The primary challenge is generat-ing English and Indonesian dense vector representation,a.k.a. embeddings that share a single vector space.Through trial and error, the research proposes using theUniversal Sentence Encoder (USE) model to constructbilingual embeddings and FAISS to index the bilingualdataset. The comparison between query vectors andindex vectors is done using two approaches: the heuristiccomparison with Euclidian distance and a clusteringalgorithm, Approximate Nearest Neighbors (ANN). Thesystem is tested with four different semantic granularities,two text granularities, and evaluation metrics with acutoff value ofk={2,10}. Four semantic granularitiesused are highly similar or near duplicate, SemanticEntailment (SE), Topically Related (TR), and Out ofTopic (OOT), while the text granularities take on thesentence and paragraph levels. The experimental resultsdemonstrate that the proposed system successfully rankssimilar texts in different languages within the top ten.It has been proven by the highest F1@2 score of 0.96for the near duplicate category on the sentence level.Unlike the near-duplicate category, the highest F1 scoresof 0.77 and 0.89 are shown by the SE and TR categories,respectively. The experiment results also show a highcorrelation between text and semantic granularity

Item Type: Article
Uncontrolled Keywords: Textual Similarity Detection, Universal Encoder (USE), Facebook AI Similarity Search(FAISS)
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 04 Dec 2024 02:20
Last Modified: 04 Dec 2024 02:20
URII: http://shdl.mmu.edu.my/id/eprint/13200

Downloads

Downloads per month over past year

View ItemEdit (login required)