Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis,British Journal of Ophthalmology

当前位置： X-MOL 学术 › Br. J. Ophthalmol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis
British Journal of Ophthalmology ( IF 3.7 ) Pub Date : 2024-10-01 , DOI: 10.1136/bjo-2023-325054
Pusheng Xu ₁ , Xiaolan Chen ₁ , Ziwei Zhao ₁ , Danli Shi _{2,

3,

4}

Affiliation

Purpose To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images. Methods We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation. Results Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability. Conclusion GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models. Data are available in a public, open access repository. The OphthalVQA dataset is freely available at .

中文翻译：

揭示临床能力：GPT-4V(ision) 用于眼科多模态图像分析的基准研究

目的评估基于 GPT-4V(ision) 的聊天机器人在解释视觉多模态图像方面的能力和不足。方法我们使用 GPT-4V 开发了一款数字眼科医生应用程序，并使用数据集（60 个图像、60 种眼科状况、6 种模式）评估其性能，其中包括裂隙灯、扫描激光检眼镜、后极眼底摄影 (FPP)、光学相干断层扫描、眼底荧光素血管造影和眼部超声图像。该聊天机器人针对每张图像进行了十个开放式问题的测试，涵盖检查识别、病变检测、诊断和决策支持。手动评估响应的准确性、可用性、安全性和诊断可重复性。使用句子相似度和基于 GPT-4 的自动评估进行自动评估。结果在 600 份回复中，30.6% 是准确的，21.5% 是高度可用的，55.6% 被认为没有危害。 GPT-4V 在裂隙灯图像上表现最佳，分别有 42.0%、38.5% 和 68.5% 的响应准确、高度可用且无害。但其在FPP图像中的表现较弱，在同类别中仅占13.7%、3.7%和38.5%。 GPT-4V 正确识别了 95.6% 的成像方式，并在病变识别 (25.6%)、诊断 (16.1%) 和决策支持 (24.0%) 方面显示出不同的准确度。 GPT-4V 诊断眼部图像的总体重复率为 63.3% (38/60)。 GPT-4V 生成的响应与人类答案之间的总体句子相似度为 55.5%，准确度的 Spearman 相关性为 0.569，可用性为 0.576。结论 GPT-4V目前尚不适合眼科临床决策。我们的研究作为增强眼科多模式模型的基准。数据可在公共、开放访问存储库中获取。OphaseVQA 数据集可免费获取：.

更新日期：2024-09-20

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南