Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Multi-level matching network for multimodal entity linking

Hu, Zhiwei, Gutierrez Basulto, Victor ORCID: https://orcid.org/0000-0002-6117-5459, Li, Ru and Pan, Jeff Z. 2025. Multi-level matching network for multimodal entity linking. Presented at: 31st SIGKDD Conference on Knowledge Discovery and Data Mining, Toronto, Canada, 3-7 August 2025. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, United States: Association for Computing Machinery, pp. 508-519. 10.1145/3690624.3709306
Item availability restricted.

[thumbnail of KDD2025_MEL_CameraReady.pdf] PDF - Accepted Post-Print Version
Restricted to Repository staff only until 8 August 2025 due to copyright restrictions.

Download (1MB)

Abstract

Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base. Most existing approaches to MEL are based on representation learning or vision-and-language pre-training mechanisms for exploring the complementary effect among multiple modalities. However, these methods suffer from two limitations. On the one hand, they overlook the possibility of considering negative samples from the same modality. On the other hand, they lack mechanisms to capture bidirectional cross-modal interaction. To address these issues, we propose a Multi-level Matching network for Multimodal Entity Linking(M3EL). Specifically, M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder and introduces an intra-modal contrastive learning sub-module to obtain better discriminative embeddings based on uni-modal differences; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity: Coarse-grained Global-to-Global and Fine-grained Global-to-Local, to achieve local and global level intra-modal interaction; (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching, to implement bidirectional cross-modal interaction. Extensive experiments conducted on WikiMEL, RichpediaMEL, and WikiDiverse datasets demonstrate the outstanding performance of M3EL when compared to the state-of-the-art baselines.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: Published
Schools: Schools > Computer Science & Informatics
Publisher: Association for Computing Machinery
ISBN: 9798400712456
Related URLs:
Date of First Compliant Deposit: 12 February 2025
Date of Acceptance: 17 November 2024
Last Modified: 16 Apr 2025 09:09
URI: https://orca.cardiff.ac.uk/id/eprint/176144

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics