| Li, Zirui, Wu, Siwei, Li, Yizhi, Wang, Xingyu, Zhou, Yi and Lin, Chenghua 2025. DocMMIR: A framework for document multi-modal information retrieval. Presented at: EMNLP 2025, Suzhou, China, 4 - 9 November 2025. Published in: Christodoulopoulos, Christos, Chakraborty, Tanmoy, Rose, Carolyn and Peng, Violet eds. Association for Computational Linguistics, pp. 13117-13130. 10.18653/v1/2025.findings-emnlp.705 |
Abstract
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains—including Wikipedia articles, scientific papers (arXiv), and presentation slides—within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal dataset, comprising 450K training, 19.2K validation, and 19.2K test documents, serving as both a benchmark to reveal the shortcomings of existing MMIR models and a training set for further improvement. The dataset systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP (ViT-L/14) demonstrating reasonable zero-shot performance. Through systematic investigation of cross-modal fusion strategies and loss function selection on the CLIP (ViT-L/14) model, we develop an optimised approach that achieves a +31% improvement in MRR@10 metrics from zero-shot baseline to fine-tuned model. Our findings offer crucial insights and practical guidance for future development in unified multimodal document retrieval tasks.
| Item Type: | Conference or Workshop Item - published (Paper) |
|---|---|
| Date Type: | Publication |
| Status: | Published |
| Schools: | Schools > Computer Science & Informatics |
| Publisher: | Association for Computational Linguistics |
| ISBN: | 979-8-89176-335-7 |
| Last Modified: | 10 Feb 2026 12:35 |
| URI: | https://orca.cardiff.ac.uk/id/eprint/184572 |
Actions (repository staff only)
![]() |
Edit Item |




Dimensions
Dimensions