Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine,npj Digital Medicine

当前位置： X-MOL 学术 › npj Digit. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
npj Digital Medicine ( IF 12.4 ) Pub Date : 2024-07-23 , DOI: 10.1038/s41746-024-01185-7
Qiao Jin ₁ , Fangyuan Chen ₂ , Yiliang Zhou ₃ , Ziyang Xu ₄ , Justin M Cheung ₅ , Robert Chen ₆ , Ronald M Summers ₇ , Justin F Rousseau ₈ , Peiyun Ni ₉ , Marc J Landsman _{10,

11} , Sally L Baxter ₁₂ , Subhi J Al'Aref ₁₃ , Yijia Li ₁₄ , Alexander Chen ₁₅ , Josef A Brejt ₁₅ , Michael F Chiang ₁₆ , Yifan Peng ₃ , Zhiyong Lu ₁

Affiliation

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
University of Pittsburgh, Pittsburgh, PA, USA.
Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
Ronald O. Perelman Department of Dermatology, New York University Grossman School of Medicine, New York City, NY, USA.
Department of Medicine, Harvard Medical School and Massachusetts General Hospital, Boston, MA, USA.
Pathology & Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA.
Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Department of Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, USA.
Department of Neurology, Peter O'Donnell Jr. Brain Institute, UT Southwestern Medical Center, Dallas, TX, USA.
Division of Gastroenterology, Department of Medicine, Harvard Medical School and Massachusetts General Hospital, Boston, MA, USA.
Division of Gastroenterology, Department of Medicine, Metrohealth Medical Center, Cleveland, OH, USA.
Case Western Reserve University School of Medicine, Cleveland, OH, USA.
Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, La Jolla, CA, USA.
Division of Cardiology, Department of Internal Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA.
University of Pittsburgh Medical Center, Pittsburgh, PA, USA.
Department of Internal Medicine, Weill Cornell Medicine, New York, NY, USA.
National Eye Institute, National Institutes of Health, Bethesda, MD, USA.

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V’s rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges—an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V’s high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

中文翻译：

医学领域多模态 GPT-4 视觉专家级准确性背后隐藏的缺陷

最近的研究表明，具有视觉功能的生成式预训练 Transformer 4 (GPT-4V) 在医疗挑战任务中的表现优于人类医生。然而，这些评估主要只关注多项选择题的准确性。我们的研究通过对 GPT-4V 在解决新英格兰医学杂志(NEJM) 图像挑战（一项设计的成像测验）时的图像理解、医学知识回忆和逐步多模态推理的基本原理进行全面分析，扩展了当前的范围。测试医疗专业人员的知识和诊断能力。评估结果证实，GPT-4V 在多项选择准确性方面的表现与人类医生相当（81.6% vs. 77.8%）。 GPT-4V 在医生回答错误的情况下也表现良好，准确率超过 78%。然而，我们发现 GPT-4V 在做出正确的最终选择 (35.5%) 的情况下经常呈现出有缺陷的基本原理，最突出的是图像理解 (27.2%)。尽管 GPT-4V 在多项选择题中具有很高的准确性，但我们的研究结果强调，在将此类多模态 AI 模型集成到临床工作流程之前，有必要对其基本原理进行进一步深入的评估。

更新日期：2024-07-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文