Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

DocMMIR: A framework for document multi-modal information retrieval

Li, Zirui, Wu, Siwei, Li, Yizhi, Wang, Xingyu, Zhou, Yi and Lin, Chenghua 2025. DocMMIR: A framework for document multi-modal information retrieval. Presented at: EMNLP 2025, Suzhou, China, 4 - 9 November 2025. Published in: Christodoulopoulos, Christos, Chakraborty, Tanmoy, Rose, Carolyn and Peng, Violet eds. Association for Computational Linguistics, pp. 13117-13130. 10.18653/v1/2025.findings-emnlp.705

Full text not available from this repository.

Abstract

The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains—including Wikipedia articles, scientific papers (arXiv), and presentation slides—within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal dataset, comprising 450K training, 19.2K validation, and 19.2K test documents, serving as both a benchmark to reveal the shortcomings of existing MMIR models and a training set for further improvement. The dataset systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP (ViT-L/14) demonstrating reasonable zero-shot performance. Through systematic investigation of cross-modal fusion strategies and loss function selection on the CLIP (ViT-L/14) model, we develop an optimised approach that achieves a +31% improvement in MRR@10 metrics from zero-shot baseline to fine-tuned model. Our findings offer crucial insights and practical guidance for future development in unified multimodal document retrieval tasks.

Item Type: Conference or Workshop Item - published (Paper)
Date Type: Publication
Status: Published
Schools: Schools > Computer Science & Informatics
Publisher: Association for Computational Linguistics
ISBN: 979-8-89176-335-7
Last Modified: 10 Feb 2026 12:35
URI: https://orca.cardiff.ac.uk/id/eprint/184572

Actions (repository staff only)

Edit Item Edit Item