| Yu, Li, Wang, Situo, Zhou, Wei and Gabbouj, Moncef 2026. DVLTA-VQA: Decoupled vision-language modeling with text-guided adaptation for blind video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 10.1109/tcsvt.2026.3657415 |
Abstract
Inspired by the dual-stream (dorsal and ventral streams) theory of the human visual system (HVS), recent Video Quality Assessment (VQA) methods have integrated Contrastive Language-Image Pretraining (CLIP) to enhance semantic understanding. However, as CLIP is originally designed for images, it lacks the ability to adequately capture the temporal dynamics and motion perception (dorsal stream) inherent in videos. To address this limitation, we propose DVLTA-VQA (Decoupled Vision-Language Modeling with Text-Guided Adaptation), which decouples CLIP’s visual and textual components to better align with the NR-VQA pipeline. Specifically, we introduce a Video-Based Temporal CLIP module and a Temporal Context Module to explicitly model motion dynamics, effectively enhancing the dorsal stream representation. Complementing this, a Basic Visual Feature Extraction Module is employed to strengthen spatial detail analysis in the ventral stream. Furthermore, we propose a text-guided adaptive fusion strategy that leverages textual semantics to dynamically weight visual features, facilitating effective spatiotemporal integration. Extensive experiments on multiple public datasets demonstrate that the proposed method achieves state-of-the-art performance, significantly improving prediction accuracy and generalization capability.
| Item Type: | Article |
|---|---|
| Date Type: | Published Online |
| Status: | In Press |
| Schools: | Schools > Computer Science & Informatics |
| Publisher: | Institute of Electrical and Electronics Engineers (IEEE) |
| ISSN: | 1051-8215 |
| Last Modified: | 02 Feb 2026 13:54 |
| URI: | https://orca.cardiff.ac.uk/id/eprint/184340 |
Actions (repository staff only)
![]() |
Edit Item |




Dimensions
Dimensions