Enhancing phishing website Uniform Resource Locators detection through machine learning algorithm performance analysis with Synthetic Minority Oversampling Technique

Citation

Almas, Anum and Saeed, Ahmad and Ali, Farman and Alshamrani, Ali and Roslee, Mardeni and Junaid, Muhammad Umer and Afsar, Haleem (2026) Enhancing phishing website Uniform Resource Locators detection through machine learning algorithm performance analysis with Synthetic Minority Oversampling Technique. International Communications in Heat and Mass Transfer, 176. p. 111398. ISSN 07351933

[img] Text
10.pdf - Published Version
Restricted to Repository staff only

Download (2MB)

Abstract

Phishing attacks via fraudulent website Uniform Resource Locators (URLs) are recognized as a significant threat across various sectors, exploiting unsolicited emails to illegally obtain personal information. The challenge of identifying Phishing URLs, prevalent forms of fraud compromising user credentials in sectors like banking, e-commerce, and digital marketing, is addressed in this paper. However, current models are found to lack the precision needed for accurate detection. In this study, phishing URL detection is examined under a leakage-aware evaluation framework in which Synthetic Minority Oversampling Technique (SMOTE) is applied only to the training split, and classifier performance is analyzed across representative machine-learning models. The methodology employs feature-selection techniques on the phishing dataset and evaluates the resulting leakage-aware framework across six representative machine-learning classifiers. Utilizing FSTs such as Information Gain (IG), Gain-Ratio (GR), Relief-Feature (Relief-F), and correlation, model performance is assessed using six algorithms, including Bernoulli Naive Bayes (BNB), Random Forest (RF), K-Nearest Neighbor (KNN), Decision Tree (DT), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). Results indicate that applying SMOTE to the training dataset, followed by adjusting RF and XGBoost models, has significantly improved accuracy to over 97%. Additionally, hyper-parameter are applied to further improve the model, with XGBoost slightly outperforming RF in accuracy.

Item Type: Article
Uncontrolled Keywords: Phishing, K-Nearest Neighbor, Random Forest, Extreme Gradient Boosting, Machine learning, Synthetic minority over-sampling
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines
Divisions: Faculty of Artificial Intelligence & Engineering (FAIE)
Depositing User: Ms Suzilawati Abu Samah
Date Deposited: 05 Jun 2026 05:18
Last Modified: 05 Jun 2026 05:18
URII: http://shdl.mmu.edu.my/id/eprint/16025

Downloads

Downloads per month over past year

View ItemEdit (login required)