Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

SceneDiff: Generative scene-level image retrieval with text and sketch using diffusion models

Zuo, Ran, Hu, Haoxiang, Deng, Xiaoming, Gao, Cangjun, Zhang, Zhengming, Lai, Yukun ORCID: https://orcid.org/0000-0002-2094-5680, Ma, Cuixia, Liu, Yong-Jin and Wang, Hongan 2024. SceneDiff: Generative scene-level image retrieval with text and sketch using diffusion models. Presented at: International Joint Conference on Artificial Intelligence, Jeju, South Korea, 3-9 August 2024. Published in: Larson, Kate ed. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. pp. 1825-1833. 10.24963/ijcai.2024/202

[thumbnail of SceneDiffusion_IJCAI2024.pdf]
Preview
PDF - Accepted Post-Print Version
Download (4MB) | Preview

Abstract

Jointly using text and sketch for scene-level image retrieval utilizes the complementary between text and sketch to describe the fine-grained scene content and retrieve the target image, which plays a pivotal role in accurate image retrieval. Existing methods directly fuse the features of sketch and text and thus suffer from the bottleneck of limited utilization for crucial semantic and structural information, leading to inaccurate matching with images. In this paper, we propose SceneDiff, a novel retrieval network that leverages a pre-trained diffusion model to establish a shared generative latent space, enabling a joint latent representation learning for both sketch and text features and precise alignment with the corresponding image. Specifically, we encode text, sketch and image features, and project them into the diffusion-based share space, conditioning the denoising process on sketch and text features to generate latent fusion features, while employing the pre-trained autoencoder for latent image features. Within this space, we introduce the content-aware feature transformation module to reconcile encoded sketch and image features with the diffusion latent space's dimensional requirements and preserve their visual content information. Then we augment the representation capability of the generated latent fusion features by integrating multiple samplings with partition attention, and utilize contrastive learning to align both direct fusion features and generated latent fusion features with corresponding image representations. Our method outperforms the state-of-the-art works through extensive experiments, providing a novel insight into the related retrieval field.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
ISBN: 9781956792041
Related URLs:
Date of First Compliant Deposit: 11 May 2024
Date of Acceptance: 16 April 2024
Last Modified: 21 Aug 2024 13:54
URI: https://orca.cardiff.ac.uk/id/eprint/168855

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics