Multimodal Emotion Recognition from Videos using Audio-Visual Transformer Fusion: A Deep Learning Approach
Main Article Content
Abstract
Emotion recognition from videos has gained significant attention in recent years with the rise of deep learning. This research proposes a multimodal approach that combines both audio and visual cues to improve the accuracy of emotion recognition in dynamic video content. We introduce an architecture that leverages a dual-stream Transformer-based model: one for processing facial features extracted from video frames, and the other for spectrogram representations of audio signals. These streams are fused at the feature level using a cross-attention mechanism to capture inter-modal dependencies. Our model is evaluated on benchmark datasets such as RAVDESS and CREMA-D, achieving state-of-the-art performance in classifying emotions like anger, happiness, sadness, and surprise. The results demonstrate the effectiveness of Transformer-based fusion for multimodal emotion recognition. We also discuss the implications of this model for applications in human-computer interaction, education, and healthcare.