Citation
Deshpande, Sudhindra B. and Goh, Kah Ong Michael and Deshpande, Uttam U. and Mathad, K. S. and Karekar, N. V. and Tangod, Kiran K. (2026) Integrating Facial Emotion Recognition, Speech to Text Transcription, and Natural Language Processing for Customer Satisfaction Analysis from Video Reviews. Engineering, Technology & Applied Science Research, 16 (2). pp. 34615-34622. ISSN 2241-4487|
Text
ETASR_15095.pdf - Published Version Restricted to Repository staff only Download (1MB) |
Abstract
Customer satisfaction is a decisive factor in the success of products and services provided, yet conventional text-based reviews often fail to capture the full spectrum of user emotions needed to assess satisfaction. On the other hand, video product or service reviews offer a more informative medium for evaluating customer satisfaction. To leverage this, the present study proposes a multimodal machine learning framework for video-based customer feedback analysis, integrating facial emotion recognition, speech-to-text transcription, and Natural Language Processing (NLP). A dataset of 1,000 video reviews was processed through a multistage pipeline that involved frame extraction, face detection, emotion classification, audio transcription, sentiment analysis, and late fusion of modalities. Experimental results highlight the limitations of unimodal models: visual-only sentiment prediction achieved 62.3% accuracy (precision = 0.61, recall = 0.63, F1-score = 0.62, Area Under Curve (AUC) = 0.65), while audio-only sentiment prediction reached 59.5% accuracy (precision = 0.58, recall = 0.59, F1-score = 0.59, AUC = 0.61). The text-based model provided a stronger baseline at 72.1% accuracy (precision = 0.70, recall = 0.72, F1-score = 0.71, AUC = 0.75). In contrast, the multimodal fusion framework substantially outperformed unimodal approaches, achieving 79.9% accuracy, precision = 0.80, recall = 0.81, F1-score = 0.80, and the highest AUC of 0.86. Additionally, aspect-level analysis revealed that camera quality (+0.16) was the most positively perceived feature, while app performance (-0.33) and delivery (-0.09) emerged as primary concerns. Temporal analysis showed satisfaction scores fluctuating between 52.1 and 63.4 (0-100 scale) over 20 weeks, underscoring the value of continuous monitoring. These findings demonstrate that multimodal video feedback analysis yields more comprehensive, reliable, and fair performance than single-channel methods.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | customer satisfaction, video feedback, emotion recognition, sentiment analysis, facial emotions, product feedback |
| Subjects: | N Fine Arts > N Visual arts |
| Divisions: | Faculty of Information Science and Technology (FIST) |
| Depositing User: | Ms Suzilawati Abu Samah |
| Date Deposited: | 08 Jun 2026 00:15 |
| Last Modified: | 08 Jun 2026 00:15 |
| URII: | http://shdl.mmu.edu.my/id/eprint/16085 |
Downloads
Downloads per month over past year
Edit (login required) |
