Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Assessing ChatGPT 4.0's capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A robust categorical analysis

Casals-Farre, Octavi, Baskaran, Ravanth, Singh, Aditya, Kaur, Harmeena, Ul Hoque, Tazim, de Almeida, Andreia ORCID: https://orcid.org/0000-0002-6889-1503, Coffey, Marcus and Hassoulas, Athanasios ORCID: https://orcid.org/0000-0002-1029-1847 2025. Assessing ChatGPT 4.0's capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A robust categorical analysis. Scientific Reports 15 (1) , 13031. 10.1038/s41598-025-97327-2

[thumbnail of s41598-025-97327-2.pdf] PDF - Published Version
Available under License Creative Commons Attribution.

Download (1MB)

Abstract

Advances in the various applications of artificial intelligence will have important implications for medical training and practice. The advances in ChatGPT-4 alongside the introduction of the medical licensing assessment (MLA) provide an opportunity to compare GPT-4's medical competence against the expected level of a United Kingdom junior doctor and discuss its potential in clinical practice. Using 191 freely available questions in MLA style, we assessed GPT-4's accuracy with and without offering multiple-choice options. We compared single and multi-step questions, which targeted different points in the clinical process, from diagnosis to management. A chi-squared test was used to assess statistical significance. GPT-4 scored 86.3% and 89.6% in papers one-and-two respectively. Without the multiple-choice options, GPT's performance was 61.5% and 74.7% in papers one-and-two respectively. There was no significant difference between single and multistep questions, but GPT-4 answered 'management' questions significantly worse than 'diagnosis' questions with no multiple-choice options (p = 0.015). GPT-4's accuracy across categories and question structures suggest that LLMs are competently able to process clinical scenarios but remain incapable of understanding these clinical scenarios. Large-Language-Models incorporated into practice alongside a trained practitioner may balance risk and benefit as the necessary robust testing on evolving tools is conducted. [Abstract copyright: © 2025. The Author(s).]

Item Type: Article
Date Type: Published Online
Status: Published
Schools: Schools > Medicine
Publisher: Nature Research
Date of First Compliant Deposit: 7 May 2025
Date of Acceptance: 3 April 2025
Last Modified: 07 May 2025 09:52
URI: https://orca.cardiff.ac.uk/id/eprint/178111

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics