Speech Emotion Recognition with vision transformers

Citation

Ong, Kah Liang (2024) Speech Emotion Recognition with vision transformers. Masters thesis, Multimedia University.

Full text not available from this repository.
Official URL: http://erep.mmu.edu.my/

Abstract

The study of speech emotion recognition aims to automatically detect and categorise emotions expressed through speech signals, crucial for applications such as human-computer interaction and affective computing. However, existing methods face challenges in accurately capturing the intricate emotional content within speech signals, often due to limitations in representing detailed emotional patterns and harnessing the richness of spectrogram features. To overcome these limitations, this thesis introduces three advanced methods leveraging spectrogram features and innovative Vision Transformers architectures. Spectrograms offer detailed representations of speech signals in both frequency and time domains, capturing acoustic characteristics crucial for emotion recognition. Vision Transformers, employing self-attention mechanisms and hierarchical feature extraction, excel in recognising complex emotional patterns and extracting discriminative representations from spectrogram data. The first method, “CQT-MaxViT”, integrates constant-Q transform (CQT) spectrograms with Multi-axis Vision Transformers (MaxViT), achieving accuracies of 87.74%, 77.54%, 62.49% on the Emo-DB, RAVDESS, and IEMOCAP datasets, respectively. The second method, “Mel-STFT-MViTv2”, combines mel spectrograms with shorttime Fourier transform (Mel-STFT) and Improved Multiscale Vision Transformers (MViTv2), recording accuracies of 91.51%, 81.75%, 63.12% on the same datasets. Lastly, “MaxMViT-MLP” amalgamates the strengths of CQT-MaxViT and Mel-STFTMViTv2 methods, yielding accuracies of 96.23%, 91.23%, 66.30%. These advanced methods represent significant strides in speech emotion recognition, offering richer representations and improved accuracy in capturing and classifying emotional states from speech signals.

Item Type: Thesis (Masters)
Additional Information: Call No.: Q325.73 .O54 2024
Uncontrolled Keywords: Deep learning (Machine learning)
Subjects: Q Science > Q Science (General) > Q300-390 Cybernetics
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 03 Feb 2025 04:18
Last Modified: 03 Feb 2025 04:18
URII: http://shdl.mmu.edu.my/id/eprint/13344

Downloads

Downloads per month over past year

View ItemEdit (login required)