Using the Reddit Corpus for Cyberbully Detection

Citation

Abdur Rakib, Tazeek and Soon, Lay Ki (2018) Using the Reddit Corpus for Cyberbully Detection. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2018, PT I, 10751. pp. 180-189. ISSN 0302-9743

[img] Text
33.pdf - Published Version
Restricted to Repository staff only

Download (887kB)

Abstract

With the creation of word embeddings, research areas around natural language processing, such as sentiment analysis and machine translation, have improved. This has been made possible by the limitless amount of text data available on the internet and the usage of a simple, two-layer neural network. However, it remains to be seen if the domain knowledge used to train word embeddings have an impact on the task the embeddings are being used for, based on the domain knowledge of the task itself. In this paper, we extracted and cleaned text data from the Reddit database, followed by training a word embedding model that is based on the word2vec skip-gram model. Then, the features of this model were used to train a random forest classifier for classifying cyberbully comments. Our model was benchmarked with four pre-trained word embeddings, as well as hand-crafted feature extraction methods. The results show that the domain knowledge of word embeddings do play a part in the task it is being used for, as our model has a 2% improvement of precision over the next best score.

Item Type: Article
Uncontrolled Keywords: Cyberbully detection Data preparation Textual features · Word embedding
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Ms Rosnani Abd Wahab
Date Deposited: 31 Mar 2021 22:28
Last Modified: 31 Mar 2021 22:28
URII: http://shdl.mmu.edu.my/id/eprint/7584

Downloads

Downloads per month over past year

View ItemEdit (login required)