Evaluating the Impact of Feature Engineering in Phishing URL Detection: A Comparative Study of URL, HTML, and Derived Features

Citation

Kustiawan, Yanche Ari and Ghauth, Khairil Imran (2025) Evaluating the Impact of Feature Engineering in Phishing URL Detection: A Comparative Study of URL, HTML, and Derived Features. IEEE Access. p. 1. ISSN 2169-3536

[img] Text
9.pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

Phishing attacks have evolved into sophisticated threats, making effective cybersecurity detection strategies essential. While many studies focus on either URL or HTML features, limited work has explored the comparative impact of engineered feature sets across different machine learning models. This study aims to bridge that empirical gap by evaluating the effectiveness of URL-based, HTML-based, and derived features, individually and in combination, on phishing URL detection. The proposed approach utilizes the PhishOFE dataset of 101,063 phishing and legitimate URLs. Features are organized into four sets: (1) URL only, (2) HTML only, (3) URL + HTML, and (4) URL + HTML + derived features. Ten machine learning models are employed, including Random Forest, k-Nearest Neighbors, Logistic Regression, Support Vector Machine, Naive Bayes, and advanced ensemble methods such as LightGBM, XGBoost, and CatBoost. Performance is assessed using accuracy, precision, recall, and F1-score, while permutation importance is used to evaluate feature significance. Experimental results demonstrate that ensemble models outperform traditional classifiers, with CatBoost achieving the highest accuracy of 99.45% using the complete feature set. Moreover, URL features like URLLength and NoOfSubDomain consistently rank high in importance, while derived features such as SuspiciousCharRatio and URLComplexityScore notably enhance detection performance in specific models.

Item Type: Article
Uncontrolled Keywords: Phishing URL detection, machine learning, feature engineering, URL features, HTML features, derived features, ensemble learning.
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Ms Suzilawati Abu Samah
Date Deposited: 30 Jun 2025 03:14
Last Modified: 30 Jun 2025 03:14
URII: http://shdl.mmu.edu.my/id/eprint/14151

Downloads

Downloads per month over past year

View ItemEdit (login required)