Huang, Zeyuan, Gao, Cangjun, Shan, Yaxian, Hu, Haoxiang, Li, Qingkun, Deng, Xiaoming, Ma, Cuixia, Lai, Yukun ![]() ![]() |
Preview |
PDF
- Published Version
Available under License Creative Commons Attribution. Download (24MB) | Preview |
Abstract
Human interaction with large language models (LLMs) is typically confined to text or image interfaces. Sketches offer a powerful medium for articulating creative ideas and user intentions, yet their potential remains underexplored. We propose SketchGPT, a novel interaction paradigm that integrates sketch and speech input directly over the system interface, facilitating open-ended, context-aware communication with LLMs. By leveraging the complementary strengths of multimodal inputs, expressions are enriched with semantic scope while maintaining efficiency. Interpreting user intentions across diverse contexts and modalities remains a key challenge. To address this, we developed a prototype based on a multi-agent framework that infers user intentions within context and generates executable context-sensitive and toolkit-aware feedback. Using Chain-of-Thought techniques for temporal and semantic alignment, the system understands multimodal intentions and performs operations following human-in-the-loop confirmation to ensure reliability. User studies demonstrate that SketchGPT significantly outperforms unimodal manipulation approaches, offering more intuitive and effective means to interact with LLMs.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Published Online |
Status: | Published |
Schools: | Schools > Computer Science & Informatics |
Publisher: | Association for Computing Machinery |
ISBN: | 9798400720376 |
Date of First Compliant Deposit: | 27 September 2025 |
Date of Acceptance: | 24 July 2025 |
Last Modified: | 30 Sep 2025 15:00 |
URI: | https://orca.cardiff.ac.uk/id/eprint/181362 |
Actions (repository staff only)
![]() |
Edit Item |