Zuo, Ran, Hu, Haoxiang, Deng, Xiaoming, Gao, Cangjun, Zhang, Zhengming, Lai, Yukun ORCID: https://orcid.org/0000-0002-2094-5680, Ma, Cuixia, Liu, Yong-Jin and Wang, Hongan 2024. SceneDiff: Generative scene-level image retrieval with text and sketch using diffusion models. Presented at: International Joint Conference on Artificial Intelligence, Jeju, South Korea, 3-9 August 2024. Published in: Larson, Kate ed. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. pp. 1825-1833. 10.24963/ijcai.2024/202 |
Preview |
PDF
- Accepted Post-Print Version
Download (4MB) | Preview |
Abstract
Jointly using text and sketch for scene-level image retrieval utilizes the complementary between text and sketch to describe the fine-grained scene content and retrieve the target image, which plays a pivotal role in accurate image retrieval. Existing methods directly fuse the features of sketch and text and thus suffer from the bottleneck of limited utilization for crucial semantic and structural information, leading to inaccurate matching with images. In this paper, we propose SceneDiff, a novel retrieval network that leverages a pre-trained diffusion model to establish a shared generative latent space, enabling a joint latent representation learning for both sketch and text features and precise alignment with the corresponding image. Specifically, we encode text, sketch and image features, and project them into the diffusion-based share space, conditioning the denoising process on sketch and text features to generate latent fusion features, while employing the pre-trained autoencoder for latent image features. Within this space, we introduce the content-aware feature transformation module to reconcile encoded sketch and image features with the diffusion latent space's dimensional requirements and preserve their visual content information. Then we augment the representation capability of the generated latent fusion features by integrating multiple samplings with partition attention, and utilize contrastive learning to align both direct fusion features and generated latent fusion features with corresponding image representations. Our method outperforms the state-of-the-art works through extensive experiments, providing a novel insight into the related retrieval field.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Computer Science & Informatics |
ISBN: | 9781956792041 |
Related URLs: | |
Date of First Compliant Deposit: | 11 May 2024 |
Date of Acceptance: | 16 April 2024 |
Last Modified: | 21 Aug 2024 13:54 |
URI: | https://orca.cardiff.ac.uk/id/eprint/168855 |
Actions (repository staff only)
Edit Item |