Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

SketchGPT: A sketch-based multimodal interface for application-agnostic LLM interaction

Huang, Zeyuan, Gao, Cangjun, Shan, Yaxian, Hu, Haoxiang, Li, Qingkun, Deng, Xiaoming, Ma, Cuixia, Lai, Yukun ORCID: https://orcid.org/0000-0002-2094-5680, Liu, Yong-Jin, Tian, Feng, Dai, Guozhong and Wang, Hongan 2025. SketchGPT: A sketch-based multimodal interface for application-agnostic LLM interaction. Presented at: UIST '25: The 38th Annual ACM Symposium on User Interface Software and Technology, Busan, Republic of Korea, 28 September - 1 October, 2025. Published in: Bianchi, Andrea, Glassman, Elena, Mackay, Wendy E., Zhao, Shengdong, Kim, Jeeeun and Oakley, Ian eds. UIST '25: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. New York, NY: Association for Computing Machinery, 10.1145/3746059.3747598

[thumbnail of SketchGPT_UIST2025.pdf]
Preview
PDF - Published Version
Available under License Creative Commons Attribution.

Download (24MB) | Preview

Abstract

Human interaction with large language models (LLMs) is typically confined to text or image interfaces. Sketches offer a powerful medium for articulating creative ideas and user intentions, yet their potential remains underexplored. We propose SketchGPT, a novel interaction paradigm that integrates sketch and speech input directly over the system interface, facilitating open-ended, context-aware communication with LLMs. By leveraging the complementary strengths of multimodal inputs, expressions are enriched with semantic scope while maintaining efficiency. Interpreting user intentions across diverse contexts and modalities remains a key challenge. To address this, we developed a prototype based on a multi-agent framework that infers user intentions within context and generates executable context-sensitive and toolkit-aware feedback. Using Chain-of-Thought techniques for temporal and semantic alignment, the system understands multimodal intentions and performs operations following human-in-the-loop confirmation to ensure reliability. User studies demonstrate that SketchGPT significantly outperforms unimodal manipulation approaches, offering more intuitive and effective means to interact with LLMs.

Item Type: Conference or Workshop Item (Paper)
Date Type: Published Online
Status: Published
Schools: Schools > Computer Science & Informatics
Publisher: Association for Computing Machinery
ISBN: 9798400720376
Date of First Compliant Deposit: 27 September 2025
Date of Acceptance: 24 July 2025
Last Modified: 30 Sep 2025 15:00
URI: https://orca.cardiff.ac.uk/id/eprint/181362

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics