ASTDT: an Interpretable Adaptive Spectro-Temporal Diffusion Transformer for audio deepfake detection

Citation

Wani, Taiba Maijd and Qadri, Syed Asif Ahmad and Ashraf, Arselan and Amerini, Irene (2025) ASTDT: an Interpretable Adaptive Spectro-Temporal Diffusion Transformer for audio deepfake detection. EURASIP Journal on Information Security, 2025 (1). ISSN 2510-523X

[img] Text
s13635-025-00217-3.pdf - Published Version
Restricted to Repository staff only

Download (5MB)

Abstract

Advances in audio synthesis techniques have led to the creation of highly realistic audio deepfakes, posing growing threats to digital integrity and public trust. These synthetic manipulations mimic natural speech with high fidelity, making detection increasingly challenging and fueling the spread of misinformation, identity fraud, and voice-based attacks. To address these concerns, this study proposes the Adaptive Spectro-Temporal Diffusion Transformer (ASTDT), a novel detection framework that tackles key challenges in generalization, interpretability, and adaptability across diverse audio generation techniques. ASTDT integrates a score-based diffusion model to augment training spectrograms with realistic deepfake variations, improving generalization to unseen text-to-speech and voice conversion attacks. An adaptive spectro-temporal feature extraction mechanism partitions audio into interpretable frequency and temporal segments, while a dual-modal attention fusion module jointly processes magnitude and phase features. These fused features are processed by a transformer encoder with diffusion-aware attention, enabling effective modeling of long-range temporal dependencies. To enhance transparency, ASTDT includes an interpretability module that combines quantitative feature attributions and spatial heatmaps to explain model predictions. Experimental results across four benchmark datasets demonstrate the effectiveness of ASTDT, with the model achieving the lowest equal error rate of 1.20% on the ASVspoof 2019 dataset.

Item Type: Article
Uncontrolled Keywords: Adaptive spectro-temporal diffusion transformer (ASTDT), audio deepfake detection
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Nor Afiqah Mohd Adnan
Date Deposited: 09 Dec 2025 03:55
Last Modified: 13 Dec 2025 00:45
URII: http://shdl.mmu.edu.my/id/eprint/14981

Downloads

Downloads per month over past year

View ItemEdit (login required)