Main Menu

Integrating Facial Emotion Recognition, Speech to Text Transcription, and Natural Language Processing for Customer Satisfaction Analysis from Video Reviews

Citation

Deshpande, Sudhindra B. and Goh, Kah Ong Michael and Deshpande, Uttam U. and Mathad, K. S. and Karekar, N. V. and Tangod, Kiran K. (2026) Integrating Facial Emotion Recognition, Speech to Text Transcription, and Natural Language Processing for Customer Satisfaction Analysis from Video Reviews. Engineering, Technology & Applied Science Research, 16 (2). pp. 34615-34622. ISSN 2241-4487

Text
ETASR_15095.pdf - Published Version
Restricted to Repository staff only
Download (1MB)

Official URL: https://doi.org/10.48084/etasr.15095

Abstract

Customer satisfaction is a decisive factor in the success of products and services provided, yet conventional text-based reviews often fail to capture the full spectrum of user emotions needed to assess satisfaction. On the other hand, video product or service reviews offer a more informative medium for evaluating customer satisfaction. To leverage this, the present study proposes a multimodal machine learning framework for video-based customer feedback analysis, integrating facial emotion recognition, speech-to-text transcription, and Natural Language Processing (NLP). A dataset of 1,000 video reviews was processed through a multistage pipeline that involved frame extraction, face detection, emotion classification, audio transcription, sentiment analysis, and late fusion of modalities. Experimental results highlight the limitations of unimodal models: visual-only sentiment prediction achieved 62.3% accuracy (precision = 0.61, recall = 0.63, F1-score = 0.62, Area Under Curve (AUC) = 0.65), while audio-only sentiment prediction reached 59.5% accuracy (precision = 0.58, recall = 0.59, F1-score = 0.59, AUC = 0.61). The text-based model provided a stronger baseline at 72.1% accuracy (precision = 0.70, recall = 0.72, F1-score = 0.71, AUC = 0.75). In contrast, the multimodal fusion framework substantially outperformed unimodal approaches, achieving 79.9% accuracy, precision = 0.80, recall = 0.81, F1-score = 0.80, and the highest AUC of 0.86. Additionally, aspect-level analysis revealed that camera quality (+0.16) was the most positively perceived feature, while app performance (-0.33) and delivery (-0.09) emerged as primary concerns. Temporal analysis showed satisfaction scores fluctuating between 52.1 and 63.4 (0-100 scale) over 20 weeks, underscoring the value of continuous monitoring. These findings demonstrate that multimodal video feedback analysis yields more comprehensive, reliable, and fair performance than single-channel methods.

Item Type:	Article
Uncontrolled Keywords:	customer satisfaction, video feedback, emotion recognition, sentiment analysis, facial emotions, product feedback
Subjects:	N Fine Arts > N Visual arts
Divisions:	Faculty of Information Science and Technology (FIST)
Depositing User:	Ms Suzilawati Abu Samah
Date Deposited:	08 Jun 2026 00:15
Last Modified:	08 Jun 2026 00:15
URII:	http://shdl.mmu.edu.my/id/eprint/16085

Downloads

Downloads per month over past year

Edit (login required)