Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Intra- and inter-observer reliability of ChatGPT-4o in thyroid nodule ultrasound feature analysis based on ACR TI-RADS: an image-based study

Chen, Ziman, Chambara, Nonhlanhla ORCID: https://orcid.org/0000-0002-3183-883X, Liu, Shirley Yuk Wah, Chow, Tom Chi Man, Lai, Carol Man Sze and Ying, Michael Tin Cheung 2025. Intra- and inter-observer reliability of ChatGPT-4o in thyroid nodule ultrasound feature analysis based on ACR TI-RADS: an image-based study. Diagnostics 15 (20) , 2617. 10.3390/diagnostics15202617

[thumbnail of diagnostics-15-02617.pdf] PDF - Published Version
Available under License Creative Commons Attribution.

Download (661kB)
License URL: https://creativecommons.org/licenses/by/4.0/
License Start date: 17 October 2025

Abstract

Background/Objectives: Advances in large language models like ChatGPT-4o have extended their use to medical image analysis. Accurate assessment of thyroid nodule ultrasound features using ACR TI-RADS is crucial for clinical practice. This study aims to evaluate ChatGPT-4o’s intra-observer consistency and its agreement with an expert in analyzing these features from ultrasound image assessments based on ACR TI-RADS. Methods: This cross-sectional study used ultrasound images from 100 thyroid nodules collected prospectively between May 2019 and August 2021. Ultrasound images were analyzed by ChatGPT-4o, following ACR TI-RADS guidelines, to assess features of thyroid nodule including composition, echogenicity, shape, margin, and echogenic foci. The analysis was repeated after one week to evaluate intra-observer reliability. The ultrasound images were also analyzed by another ultrasound expert for the evaluation of inter-observer reliability. Agreement was measured using Cohen’s Kappa coefficient, and concordance rates were calculated based on alignment with the expert’s reference classifications. Results: Intra-observer agreement for ChatGPT-4o was moderate for composition (Kappa = 0.449) and echogenic foci (Kappa = 0.404), with substantial agreement for echogenicity (Kappa = 0.795). Agreement was notably low for shape (Kappa = −0.051) and margin (Kappa = 0.154). Inter-observer agreement between ChatGPT-4o and the expert was generally low, with Kappa values ranging from −0.006 to 0.238, the highest being for echogenic foci. Overall concordance rates between ChatGPT-4o and expert evaluations ranged from 46.6% to 48.2%, with the highest for shape (65%) and the lowest for echogenicity (29%). Conclusions: ChatGPT-4o showed favorable consistency in assessing some thyroid nodule features in intra-observer analysis, but notable variability in others. Inter-observer comparisons with expert evaluations revealed generally low agreement across all features, despite acceptable concordance for certain imaging characteristics. While promising for specific ultrasound features, ChatGPT-4o’s consistency and accuracy still vary significantly compared to expert assessments.

Item Type: Article
Date Type: Published Online
Status: Published
Schools: Schools > Healthcare Sciences
Publisher: MDPI
Date of First Compliant Deposit: 29 October 2025
Date of Acceptance: 15 October 2025
Last Modified: 30 Oct 2025 14:48
URI: https://orca.cardiff.ac.uk/id/eprint/181966

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics