Domain data-driven veracity engine

Citation

Tan, De Zhern (2020) Domain data-driven veracity engine. Masters thesis, Multimedia University.

Full text not available from this repository.
Official URL: http://erep.mmu.edu.my/

Abstract

The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subsequent stages of analysis and hinder the possible insights and values that can be derived from the data. Hence, methods to detect and remove dirty data are required to unlock those insights. In this research, a framework is proposed to utilize several natural language processing methods to create a novel data veracity engine. This research's contributions are first to enable the conversion of the unstructured dataset into a structured dataset in a semi-automated manner. Next, the system will also detect and remove duplicates consisting of exact and near duplicates from the dataset. The natural language processing techniques proposed in the data veracity engine include a combination of Latent Semantic Analysis and Simhash. Latent Semantic Analysis is used to discover the relation between different rows of datasets using cosine similarity and group the similar ones into clusters to apply different types of text processing per cluster basis. Simhash is locality sensitive hashing algorithm that utilizes cosine distance to find near duplicate rows of data in order to remove them from the dataset with low overhead. For the first part of the research, a semi-structured dataset consisting of actual telco equipment log data was used. For the second part of the research, datasets consisting of varying sizes of telco equipment data added with artificially generated noise were used. The results of this research showed that both types of dirty data, semi-structured data and near duplicates were able to be removed effectively from the real-world dataset using the proposed data veracity engine. It is hoped that this research can contribute to the development of more robust cleaning algorithms.

Item Type: Thesis (Masters)
Additional Information: Call No.: QA76.9.B45 T36 2020
Uncontrolled Keywords: Big data
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75-76.95 Calculating machines
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 26 Sep 2024 03:32
Last Modified: 26 Sep 2024 03:32
URII: http://shdl.mmu.edu.my/id/eprint/12984

Downloads

Downloads per month over past year

View ItemEdit (login required)