Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Adaptive spatiotemporal graph transformer network for action quality assessment

Liu, Jiang, Wang, Huasheng, Zhou, Wei, Stawarz, Katarzyna ORCID: https://orcid.org/0000-0001-9021-0615, Corcoran, Padraig ORCID: https://orcid.org/0000-0001-9731-3385, Chen, Ying and Liu, Hantao ORCID: https://orcid.org/0000-0003-4544-3481 2025. Adaptive spatiotemporal graph transformer network for action quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 10.1109/TCSVT.2025.3541456

[thumbnail of Adaptive_Spatiotemporal_Graph_Transformer_Network_for_Action_Quality_Assessment.pdf]
Preview
PDF - Accepted Post-Print Version
Download (5MB) | Preview

Abstract

Long video action quality assessment (AQA) aims to evaluate the performance of long-term actions depicted in a video and produce an overall assessment for action quality. A video of long-term actions often contains more complicated temporal and spatial information than that of short-term actions. However, existing approaches that segment a video into individual clips for independent analysis potentially disrupt the narrative flow and diminish contextual details within and across clips, impeding comprehensive video understanding. To address this challenge, we propose an adaptive spatiotemporal graph transformer network (ASGTN) that combines multiple graph structures and transformer attention mechanisms to capture both local and global contextual information within and across clips in a long video. Specifically, the adaptive spatiotemporal graph (ASG) combines a spatial graph branch, designed to enrich the local nuanced spatiotemporal relations within an individual clip, and a temporal graph branch, tailored to dynamically learn the semantic context across different clips. Furthermore, a transformer encoder is integrated to amplify the global dependencies across clips in the entire video. This structure is designed to preserve narrative coherence and maintain essential contextual details in video-level features. Finally, we employ a level-focused decoder to predict the action quality score distribution. Experiments demonstrate that our model achieves state-of-the-art results on popular AQA datasets. Our code is available at https://github.com/jiangliu5/ASGTN AQA.

Item Type: Article
Date Type: Published Online
Status: In Press
Schools: Schools > Computer Science & Informatics
Publisher: Institute of Electrical and Electronics Engineers
ISSN: 1051-8215
Date of First Compliant Deposit: 13 February 2025
Date of Acceptance: 7 February 2025
Last Modified: 20 Feb 2025 15:00
URI: https://orca.cardiff.ac.uk/id/eprint/176178

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics