Main Menu

Multi-stage spatial temporal ensemble model with integrated learning methods for robust deepfake detection

Citation

Yassin, Warusia and Abdollah, Mohd Faizal and Ismail, Anuar and Kamis, Noor Hisham and Abdul Razak, Siti Fatimah and Joy, Helen K. (2026) Multi-stage spatial temporal ensemble model with integrated learning methods for robust deepfake detection. Discover Computing, 29 (1). ISSN 2948-2992

Text
s10791-026-10093-1.pdf - Published Version
Restricted to Repository staff only
Download (1MB)

Official URL: https://doi.org/10.1007/s10791-026-10093-1

Abstract

Deepfake detection remains a significant challenge as modern generative models increasingly minimize visible artefacts, and many existing approaches rely solely on either spatial or temporal cues, which limits their robustness and generalization. Many existing hybrid approaches integrate mature learning models in linear or stacked pipelines, which often suffer from error propagation, reduced interpretability, and suboptimal generalization. Unlike prior hybrid approaches that primarily stack spatial– temporal learners, the proposed multi-stage hybrid Integrated Learning Method (ILM) introduces a validation-aware dual-detection mechanism, an independent dual-path spatial-temporal learning design, and a decision-level nonlinear ensemble fusion strategy, explicitly mitigating face mislocalization, temporal dilution, and false-positive propagation observed in existing deepfake detection pipelines. The ILM framework structurally coordinates facial region localization and validation using YOLOv5 and Haar Cascade, deep spatial feature extraction using ResNet-50, frame-level spatial classification via LightGBM, and temporal sequence modeling using LSTM networks. The outputs from the spatial and temporal pathways are subsequently fused using a Random Forest classifier, enabling nonlinear aggregation of complementary evidence while preserving interpretability. Experimental results on the FaceForensics++and Celeb-DF (v2) benchmark datasets show that ILM achieves 98.30% accuracy, 97.90% precision, and 98.70% recall, outperforming recent state-of-the-art CNN–LSTM, ViT-based, and CNN–Transformer models by 1–6%. Ablation studies confirm that each module contributes incrementally to performance stability and false-positive reduction, demonstrating the importance of ILM’s multi-stage architecture rather than the individual algorithms alone. Overall, ILM provides a modular, accurate, and computationally efficient solution suitable for deployment in digital forensics, media authentication, and AI governance. Future work will extend ILM with transformer-based global encoders and explainable AI techniques to further improve interpretability and robustness against emerging deepfake models.

Item Type:	Article
Uncontrolled Keywords:	Deepfake detection, Integrated learning, YOLOv5
Subjects:	T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK5101-6720 Telecommunication. Including telegraphy, telephone, radio, radar, television
Divisions:	Faculty of Information Science and Technology (FIST) Faculty of Artificial Intelligence & Engineering (FAIE)
Depositing User:	Ms Rosnani Abd Wahab
Date Deposited:	05 Jun 2026 01:05
Last Modified:	05 Jun 2026 01:05
URII:	http://shdl.mmu.edu.my/id/eprint/15963

Downloads

Downloads per month over past year

Edit (login required)