Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

DADA++: Dual Alignment Domain Adaptation for unsupervised video-text retrieval

Hao, Xiaoshuai, Zhao, Haimei, Diao, Yunfeng, Yin, Rong, Jin, Guangyin, Zhang, Jing, Zhang, Wanqian and Zhou, Wei 2025. DADA++: Dual Alignment Domain Adaptation for unsupervised video-text retrieval. ACM Transactions on Multimedia Computing, Communications and Applications 10.1145/3759252

Full text not available from this repository.

Abstract

Video-text retrieval aims at returning the most semantically relevant videos given a textual query, which is a thriving topic in both computer vision and natural language processing communities. This paper focuses on a more challenging task, i.e., Unsupervised Domain Adaptation Video-text Retrieval (UDAVR), wherein training and testing data come from different distributions. Previous approaches are mostly derived from classification-based domain adaptation methods, which are neither multi-modal nor suitable for retrieval tasks. They merely alleviate the domain shift while overlooking the pairwise misalignment issue in the target domain, i.e., there exist no semantic relationships between target videos and texts. While Foundation Models like CLIP perform well in in-domain video-text retrieval, their effectiveness significantly drops during domain shifts due to this lack of alignment. To tackle this, we propose a novel method named D ual A lignment D omain A daptation ( DADA++ ). Specifically, we first introduce cross-modal semantic embedding to generate discriminative source features in a joint embedding space. Besides, we utilize cross-modal domain adaptations to balance the minimization of domain shift in a smooth manner. Furthermore, we empirically identify the pairwise misalignment in the target domain, and thus propose the i ntegrated D ual A lignment C onsistency (iDAC). The proposed iDAC adaptively aligns the video-text pairs, which are more likely to be relevant in the target domain, by verifying their cross-modal semantic proximity reciprocally in both hard and soft manners. This enables positive pairs to increase progressively while potentially aligning noisy pairs throughout the training procedure. We also provide insights into the functionality of DADA++ through the lens of Foundation Models, explaining its superiority in a theoretical way. Compared with state-of-the-art methods, DADA++ achieves 9.4% and 8.5% relative improvements on R@1 under the settings of TGIF \(\rightarrow\) MSR-VTT and TGIF \(\rightarrow\) MSVD respectively, demonstrating its superior performance.

Item Type: Article
Date Type: Published Online
Status: In Press
Schools: Schools > Computer Science & Informatics
Publisher: Association for Computing Machinery (ACM)
ISSN: 1551-6857
Date of Acceptance: 30 July 2025
Last Modified: 21 Aug 2025 09:00
URI: https://orca.cardiff.ac.uk/id/eprint/180585

Actions (repository staff only)

Edit Item Edit Item