Multimodal speech emotion recognition based on audio and text information

Alqurashi, Nawal 2024. Multimodal speech emotion recognition based on audio and text information. PhD Thesis, Cardiff University.
Item availability restricted.

	PDF - Accepted Post-Print Version Restricted to Repository staff only until 14 July 2026 due to copyright restrictions. Available under License Creative Commons Attribution Non-commercial No Derivatives. Download (5MB) \| Request a copy
	PDF (Cardiff University Electronic Publication Form) - Supplemental Material Restricted to Repository staff only Download (211kB) \| Request a copy

Abstract

Emotion recognition is inherently complex due to the reliance on multiple modalities. While humans possess the ability to infer emotions despite misalignments between these modalities, automatic Speech Emotion Recognition (SER) systems are devoid of such intuitive capabilities, which impacts their ability to accurately and reliably interpret emotions. This thesis addresses this challenge by aiming to develop a multimodal SER framework that integrates speech and text, resolves cross-modal discrepancies, and enhances the interpretability of emotion predictions. To achieve this aim, the research sets out to improve fusion strategies, reduce ambiguity arising from modality mismatches, and enable more context-aware emotion modelling through advanced learning techniques. In line with these goals, the thesis presents three contributions. Firstly, it proposes a hierarchical classification framework for SER that independently processes audio and text, employing a novel late fusion method to improve recognition accuracy. This approach evaluates emotional cues across multiple levels, providing insights into the relative significance of each modality. Secondly, the thesis develops an adaptation of this framework to specifically address scenarios where emotionally charged text is paired with neutral speech, resulting in modality discrepancies that frequently contribute to misclassification. By integrating text as a supportive modality, the adapted framework improves the system's ability to recognise emotional patterns that are often obscured by neutral tones. Finally, a novel augmentation strategy using Artificial Intelligence (AI) voice cloning is introduced to address modality mismatches. This approach generates augmented samples of neutral speech paired with emotional text, enabling the model to learn from such conflicts. Supervised Contrastive Learning (SCL), incorporating the augmentation strategy, is then applied to improve its capacity to manage variability and inconsistencies in real-world emotional data. This research emphasises the importance of integrating speech and text modalities in SER, addressing modality discrepancies, improving the interpretability of emotion predictions, and enriching emotional representations. The experimental results demonstrate the effectiveness of the proposed models in classifying ambiguous emotional samples, managing modality mismatches, and improving contextual understanding. These findings underscore the effectiveness of hierarchical modelling, text integration, and AI driven augmentation, advancing the performance and reliability of SER systems over existing approaches.

Item Type:	Thesis (PhD)
Date Type:	Completion
Status:	Unpublished
Schools:	Schools > Computer Science & Informatics
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software
Funders:	Saudi Arabian Cultural Bureau
Date of First Compliant Deposit:	14 July 2025
Last Modified:	15 Jul 2025 09:18
URI:	https://orca.cardiff.ac.uk/id/eprint/179779

Actions (repository staff only)

Edit Item

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)