Citation
Lokman, Amar and Wan Ismail, Wan Zakiah and Ab Aziz, Nor Azlina (2025) Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches. Algorithms, 18 (8). p. 494. ISSN 1999-4893|
Text
Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches.pdf - Published Version Restricted to Repository staff only Download (3MB) |
Abstract
Water quality assessment plays a vital role in environmental monitoring and resource management. This study aims to enhance the predictive modeling of the Water Quality Index (WQI) using a combination of statistical diagnostics and machine learning techniques. Data collected from six river locations in Malaysia are analyzed. The methodology involves collecting water quality data from six river locations in Malaysia, followed by a series of statistical analyses including assumption testing (shapiro–wilk and breusch–pagan tests), diagnostic evaluations, feature importance analysis, and principal component analysis (PCA). Decision tree regression (DTR) and autoregressive integrated moving average (ARIMA) are employed for regression, while random forest is used for classification. Learning curve analysis is conducted to evaluate model performance and generalization. The results indicate that dissolved oxygen (DO) and ammoniacal nitrogen (AN) are the most influential parameters, with normalized importance scores of 1.000 and 0.565, respectively. The breusch–pagan test identifies significant heteroscedasticity (p-value = (Formula presented.)), while the Shapiro–Wilk test confirms non-normality (p-value = 0.0). PCA effectively reduces dimensionality while preserving 95% of dataset variance, optimizing computational efficiency. Among the regression models, ARIMA demonstrates better predictive accuracy than DTR. Meanwhile, random forest achieves high classification performance and shows strong generalization capability with increasing training data. Learning curve analysis reveals overfitting in the regression model, suggesting the need for hyperparameter tuning, while the classification model demonstrates improved generalization with additional training data. Strong correlations among key parameters indicate potential multicollinearity, emphasizing the need for careful feature selection. These findings highlight the synergy between statistical pre-processing and machine learning, offering a more accurate and efficient approach to water quality prediction for informed environmental policy and real-time monitoring systems.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | Machine learning |
| Subjects: | Q Science > Q Science (General) > Q300-390 Cybernetics |
| Divisions: | Faculty of Engineering and Technology (FET) |
| Depositing User: | Ms Rosnani Abd Wahab |
| Date Deposited: | 30 Sep 2025 09:27 |
| Last Modified: | 06 Oct 2025 04:02 |
| URII: | http://shdl.mmu.edu.my/id/eprint/14629 |
Downloads
Downloads per month over past year
Edit (login required) |
