Multimodal Emotion Recognition from Videos using Audio-Visual Transformer Fusion: A Deep Learning Approach

Main Article Content

Dr Komil Vora, Prof. Dishita Mashru

Abstract

Emotion recognition from videos has gained significant attention in recent years with the rise of deep learning. This research proposes a multimodal approach that combines both audio and visual cues to improve the accuracy of emotion recognition in dynamic video content. We introduce an architecture that leverages a dual-stream Transformer-based model: one for processing facial features extracted from video frames, and the other for spectrogram representations of audio signals. These streams are fused at the feature level using a cross-attention mechanism to capture inter-modal dependencies. Our model is evaluated on benchmark datasets such as RAVDESS and CREMA-D, achieving state-of-the-art performance in classifying emotions like anger, happiness, sadness, and surprise. The results demonstrate the effectiveness of Transformer-based fusion for multimodal emotion recognition. We also discuss the implications of this model for applications in human-computer interaction, education, and healthcare.


 

Article Details

Section
Articles