Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques

Citation

Mohd Zebaral Hoque, Jesmeen (2019) Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques. Masters thesis, Multimedia University.

Full text not available from this repository.
Official URL: http://erep.mmu.edu.my/

Abstract

Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing.

Item Type: Thesis (Masters)
Additional Information: Call No.: Q325.5 .J47 2019
Uncontrolled Keywords: Machine learning
Subjects: Q Science > Q Science (General) > Q300-390 Cybernetics
Divisions: Faculty of Engineering and Technology (FET)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 21 Sep 2020 05:41
Last Modified: 06 Mar 2023 06:55
URII: http://shdl.mmu.edu.my/id/eprint/7737

Downloads

Downloads per month over past year

View ItemEdit (login required)