Rethinking the effect of unimodal labels in multimodal sentiment analysis

Zhang, Lingli, Li, Tianrui, Lu, Baiyu, Fang, Junlin, Zheng, Desheng, Zhou, Wei, Liu, Weide and Lv, Fengmao 2026. Rethinking the effect of unimodal labels in multimodal sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications , 3796718. 10.1145/3796718

Full text not available from this repository.

Official URL: https://doi.org/10.1145/3796718

Abstract

Multimodal sentiment analysis aims to comprehensively understand human sentiment by integrating diverse modalities, such as text, audio, and vision. To improve modality complementarity, the recent Multimodal Multi-task Learning (MML) framework employs joint training of unimodal and multimodal sentiment analysis tasks using sub-annotations of modality. In this work, we further draw attention to the observation that integrating unimodal tasks may introduce conflicting task information, negatively affecting the multimodal task performance. Motivated by this issue, we propose the Multimodal Task Correlation-aware Learning (MTCL) framework to leverage beneficial task correlations and suppress harmful ones. Specifically, MTCL introduces a Correlation-Adaptive Training (CAT) strategy to learn a task-relation aware unimodal encoder for each modality. First, in order to distinguish whether a sample contains conflicting information, CAT strategy incorporates a Dual-Branch Contrast (DBC) module which divides the training set into a beneficial subset and a harmful subset. Based on this division, CAT strategy proposes an adaptive training loss to guide the model in understanding nuanced multitask correlations. The adaptive training loss has two components: 1) For the beneficial subset, a contrastive loss is utilized to improve the model's ability to extract complementary representations. 2) For the harmful subset, we apply a task-correction loss to mitigate the negative interference caused by harmful task associations. With the CAT strategy, our framework can effectively distinguish beneficial and harmful task correlations to extract distinctive and robust unimodal representations. The superiority of MTCL is verified via extensive experiments on several multimodal video sentiment analysis benchmarks. Our work is publicly available at https://github.com/tiggers23/MTCL.

Item Type:	Article
Date Type:	Published Online
Status:	In Press
Schools:	Schools > Computer Science & Informatics
Publisher:	Association for Computing Machinery (ACM)
ISSN:	1551-6857
Date of Acceptance:	3 January 2026
Last Modified:	23 Mar 2026 12:31
URI:	https://orca.cardiff.ac.uk/id/eprint/185959

Actions (repository staff only)

Edit Item

Altmetric

Dimensions

CORE (COnnecting REpositories)