Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Performance of large language models for CAD-RADS 2.0 classification derived from cardiac CT reports

Arnold, Philipp Georg, Russe, Maximilian Frederik, Bamberg, Fabian, Emrich, Tilman, Vecsey-Nagy, Milán, Ashi, Ayaat, Kravchenko, Dmitrij, Varga-Szemes, Ákos, Soschynski, Martin, Rau, Alexander, Kotter, Elmar and Hagar, Muhammad Taha 2025. Performance of large language models for CAD-RADS 2.0 classification derived from cardiac CT reports. Journal of Cardiovascular Computed Tomography 10.1016/j.jcct.2025.03.007

[thumbnail of 1-s2.0-S1934592525000541-main.pdf] PDF - Accepted Post-Print Version
Available under License Creative Commons Attribution.

Download (1MB)
License URL: http://creativecommons.org/licenses/by/4.0/
License Start date: 29 March 2025

Abstract

Background The Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 offers standardized guidelines for interpreting coronary artery disease in cardiac CT. Accurate and consistent CAD-RADS 2.0 scoring is crucial for comprehensive disease characterization and clinical decision-making. This study investigates the capability of large language models (LLMs) to autonomously generate CAD-RADS 2.0 scores from cardiac CT reports. Methods A dataset of cardiac CT reports was created to evaluate the performance of several state-of-the-art LLMs in generating CAD-RADS 2.0 scores via in-context learning. The tested models comprised GPT-3.5, GPT-4o, Mistral 7b, Mixtral 8 ​× ​7b, Llama3 8b, Llama3 8b with a 64k context length, and Llama3 70b. The generated scores from each model were compared to the ground truth, which was provided by two board-certified cardiothoracic radiologists in consensus based on the reports. Results The final set comprised 200 cardiac CT reports. GPT-4o and Llama3 70b achieved the highest accuracy in generating full CAD-RADS 2.0 scores including all modifiers with a performance rate of 93 ​% and 92.5 ​%, respectively, followed by Mixtral 8 ​× ​7b with 78 ​%. In contrast, older LLMs, such as Mistral 7b and GPT-3.5 performed poorly (16 ​%) and Llama3 8b demonstrated intermediate results with an accuracy of 41.5 ​%. Conclusion LLMs enhanced with in-context learning are capable of autonomously generating CAD-RADS 2.0 scores for cardiac CT reports with excellent accuracy, potentially enhancing both the efficiency and consistency of cardiac CT reporting. Open-source models not only deliver competitive accuracy but also present the benefit of local hosting, mitigating concerns around data security.

Item Type: Article
Date Type: Published Online
Status: In Press
Schools: Schools > Medicine
Additional Information: License information from Publisher: LICENSE 1: URL: http://creativecommons.org/licenses/by/4.0/, Start Date: 2025-03-29
Publisher: Elsevier
ISSN: 1934-5925
Date of First Compliant Deposit: 16 April 2025
Date of Acceptance: 28 March 2025
Last Modified: 16 Apr 2025 09:30
URI: https://orca.cardiff.ac.uk/id/eprint/177730

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

Loading...

View more statistics