Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings.,Radiology

当前位置： X-MOL 学术 › Radiology › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings.
Radiology ( IF 12.1 ) Pub Date : 2024-10-01 , DOI: 10.1148/radiol.232346
Shawn H Sun ₁ , Kenneth Huynh ₁ , Gillean Cortes ₁ , Robert Hill ₁ , Julia Tran ₁ , Leslie Yeh ₁ , Amanda L Ngo ₁ , Roozbeh Houshyar ₁ , Vahid Yaghmai ₁ , Mark Tran ₁

Affiliation

Background The burgeoning interest in ChatGPT as a potentially useful tool in medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Purpose To evaluate the accuracy, reliability, and repeatability of differential diagnoses produced by ChatGPT from transcribed radiologic findings. Materials and Methods Cases selected from a radiology textbook series spanning a variety of imaging modalities, subspecialties, and anatomic pathologies were converted into standardized prompts that were entered into ChatGPT (GPT-3.5 and GPT-4 algorithms; April 3 to June 1, 2023). Responses were analyzed for accuracy via comparison with the final diagnosis and top 3 differential diagnosis provided in the textbook, which served as the ground truth. Reliability, defined based on the frequency of algorithmic hallucination, was assessed through the identification of factually incorrect statements and fabricated references. Comparisons were made between the algorithms using the McNemar test and a generalized estimating equation model framework. Test-retest repeatability was measured by obtaining 10 independent responses from both algorithms for 10 cases in each subspecialty, and calculating the average pairwise percent agreement and Krippendorff α. Results A total of 339 cases were collected across multiple radiologic subspecialties. The overall accuracy of GPT-3.5 and GPT-4 for final diagnosis was 53.7% (182 of 339) and 66.1% (224 of 339; P < .001), respectively. The mean differential score (ie, proportion of top 3 diagnoses that matched the original literature differential diagnosis) for GPT-3.5 and GPT-4 was 0.50 and 0.54 (P = .06), respectively. Of the references provided in GPT-3.5 and GPT-4 responses, 39.9% (401 of 1006) and 14.3% (161 of 1124; P < .001), respectively, were fabricated. GPT-3.5 and GPT-4 generated false statements in 16.2% (55 of 339) and 4.7% (16 of 339; P < .001) of cases, respectively. The range of average pairwise percent agreement across subspecialties for the final diagnosis and top 3 differential diagnosis was 59%-98% and 23%-49%, respectively. Conclusion ChatGPT achieved the best results when the most up-to-date model (GPT-4) was used and when it was prompted for a single diagnosis. Hallucination frequency was lower with GPT-4 than with GPT-3.5, but repeatability was an issue for both models. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.

更新日期：2024-10-01

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南