Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

Citation

Ong, Kah Liang and Lee, Chin Poo and Lim, Heng Siong and Lim, Kian Ming and Alqahtani, Ali (2023) Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers. IEEE Access, 11. pp. 108571-108579. ISSN 2169-3536

[img] Text
a.pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

Speech emotion recognition aims to automatically identify and classify emotions from speech signals. It plays a crucial role in various applications such as human-computer interaction, affective computing, and social robotics. Over the years, researchers have proposed different approaches for speech emotion recognition, leveraging various classifiers and features. However, despite the advancements, existing methods in speech emotion recognition still have certain limitations. Some approaches rely on handcrafted features that may not capture the full complexity of emotional information present in speech signals, while others may suffer from a lack of robustness and generalization when applied to different datasets. To address these challenges, this paper proposes a speech emotion recognition method that combines Mel spectrogram with Short-Term Fourier Transform (Mel-STFT) and the Improved Multiscale Vision Transformers (MViTv2). The Mel-STFT spectrograms capture both the frequency and temporal information of speech signals, providing a more comprehensive representation of the emotional content. The MViTv2 classifier introduces multi-scale visual modeling with different stages and pooling attention mechanisms. MViTv2 incorporates relative positional embeddings and a residual pooling connection to effectively model the interactions between tokens in the space-time structure, preserve essential information, and improve the efficiency of the model. Experimental results demonstrate that the proposed method generalizes well on different datasets, achieving an accuracy of 91.51% on the Emo-DB dataset, 81.75% on the RAVDESS dataset, and 64.03% on the IEMOCAP dataset.

Item Type: Article
Uncontrolled Keywords: Speech recognition, Emotion recognition, Feature extraction, Spectrogram, Transformers, Mel frequency cepstral coefficient, Speech enhancement
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7871 Electronics--Materials
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 07 Dec 2023 01:25
Last Modified: 07 Dec 2023 01:25
URII: http://shdl.mmu.edu.my/id/eprint/11919

Downloads

Downloads per month over past year

View ItemEdit (login required)