Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

CLIP-Hand: CLIP-based regressor for hand pose estimation and mesh recovery

Zhou, Feng, Ji, Shuang, Shen, Pei, Dai, Ju, Pan, Junjun, Lai, Yu-kun ORCID: https://orcid.org/0000-0002-2094-5680 and Rosin, Paul L. ORCID: https://orcid.org/0000-0002-4965-3884 2025. CLIP-Hand: CLIP-based regressor for hand pose estimation and mesh recovery. Visual Computer 42 , 43. 10.1007/s00371-025-04203-1

Full text not available from this repository.

Abstract

Despite significant advancements in 3D hand pose estimation, it still faces challenges due to self-occlusion and complex backgrounds. To tackle those issues, we propose a CLIP-based Regressor for Hand Pose Estimation and Mesh Recovery (CLIP-Hand) from a single RGB image. Specifically, we propose an innovative method that combines high-resolution feature aggregation with contrastive language-image pre-trained model (CLIP) to enhance feature representations through language-guided visual prompts. Our approach utilizes a multi-layer Transformer encoder-decoder module to improve the prediction accuracy of hand meshing and joint points. To boost the performance, a predefined 3D joint module and a text dataset are proposed to augment the training data and improve the model’s generalization ability across different scenarios. Extensive experiments on datasets such as FreiHAND, RHD, and Dexter+Object demonstrate the effectiveness of our approach, showing improved performance in terms of accuracy and robustness compared to existing methods. The source code and data will be released once the paper is accepted.

Item Type: Article
Date Type: Published Online
Status: Published
Schools: Schools > Computer Science & Informatics
Publisher: Springer
ISSN: 0178-2789
Date of Acceptance: 16 November 2025
Last Modified: 07 Jan 2026 15:45
URI: https://orca.cardiff.ac.uk/id/eprint/183695

Actions (repository staff only)

Edit Item Edit Item