Zhou, Feng, Ji, Shuang, Shen, Pei, Dai, Ju, Pan, Junjun, Lai, Yu-kun ORCID: https://orcid.org/0000-0002-2094-5680 and Rosin, Paul L. ORCID: https://orcid.org/0000-0002-4965-3884
2025.
CLIP-Hand: CLIP-based regressor for hand pose estimation and mesh recovery.
Visual Computer
42
, 43.
10.1007/s00371-025-04203-1
|
Abstract
Despite significant advancements in 3D hand pose estimation, it still faces challenges due to self-occlusion and complex backgrounds. To tackle those issues, we propose a CLIP-based Regressor for Hand Pose Estimation and Mesh Recovery (CLIP-Hand) from a single RGB image. Specifically, we propose an innovative method that combines high-resolution feature aggregation with contrastive language-image pre-trained model (CLIP) to enhance feature representations through language-guided visual prompts. Our approach utilizes a multi-layer Transformer encoder-decoder module to improve the prediction accuracy of hand meshing and joint points. To boost the performance, a predefined 3D joint module and a text dataset are proposed to augment the training data and improve the model’s generalization ability across different scenarios. Extensive experiments on datasets such as FreiHAND, RHD, and Dexter+Object demonstrate the effectiveness of our approach, showing improved performance in terms of accuracy and robustness compared to existing methods. The source code and data will be released once the paper is accepted.
| Item Type: | Article |
|---|---|
| Date Type: | Published Online |
| Status: | Published |
| Schools: | Schools > Computer Science & Informatics |
| Publisher: | Springer |
| ISSN: | 0178-2789 |
| Date of Acceptance: | 16 November 2025 |
| Last Modified: | 07 Jan 2026 15:45 |
| URI: | https://orca.cardiff.ac.uk/id/eprint/183695 |
Actions (repository staff only)
![]() |
Edit Item |





Dimensions
Dimensions