Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics

Mohammad-Rahimi, Hossein, Ourang, Seyed AmirHossein, Pourhoseingholi, Mohamad Amin, Dianat, Omid, Dummer, Paul Michael Howell ORCID: https://orcid.org/0000-0002-0726-7467 and Nosrat, Ali 2024. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. International Endodontic Journal 57 (3) , pp. 305-314. 10.1111/iej.14014

[thumbnail of Manuscript-chatbots-Final76.pdf] PDF - Accepted Post-Print Version
Download (482kB)

Abstract

Aim This study aimed to evaluate and compare the validity and reliability of responses provided by GPT-3.5, Google Bard, and Bing to frequently asked questions (FAQs) in the field of endodontics. Methodology FAQs were formulated by expert endodontists (n = 10) and collected through GPT-3.5 queries (n = 10), with every question posed to each chatbot three times. Responses (N = 180) were independently evaluated by two board-certified endodontists using a modified Global Quality Score (GQS) on a 5-point Likert scale (5: strongly agree; 4: agree; 3: neutral; 2: disagree; 1: strongly disagree). Disagreements on scoring were resolved through evidence-based discussions. The validity of responses was analysed by categorizing scores into valid or invalid at two thresholds: The low threshold was set at score ≥4 for all three responses whilst the high threshold was set at score 5 for all three responses. Fisher's exact test was conducted to compare the validity of responses between chatbots. Cronbach's alpha was calculated to assess the reliability by assessing the consistency of repeated responses for each chatbot. Results All three chatbots provided answers to all questions. Using the low-threshold validity test (GPT-3.5: 95%; Google Bard: 85%; Bing: 75%), there was no significant difference between the platforms (p > .05). When using the high-threshold validity test, the chatbot scores were substantially lower (GPT-3.5: 60%; Google Bard: 15%; Bing: 15%). The validity of GPT-3.5 responses was significantly higher than Google Bard and Bing (p = .008). All three chatbots achieved an acceptable level of reliability (Cronbach's alpha >0.7). Conclusions GPT-3.5 provided more credible information on topics related to endodontics compared to Google Bard and Bing.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Dentistry
Publisher: Wiley
ISSN: 0143-2885
Date of First Compliant Deposit: 20 December 2023
Date of Acceptance: 13 December 2023
Last Modified: 20 Dec 2024 02:45
URI: https://orca.cardiff.ac.uk/id/eprint/164990

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics