当前位置: X-MOL 学术Radiology › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study
Radiology ( IF 12.1 ) Pub Date : 2024-04-30 , DOI: 10.1148/radiol.232133
Andrea Cozzi 1 , Katja Pinker 1 , Andri Hidber 1 , Tianyu Zhang 1 , Luca Bonomo 1 , Roberto Lo Gullo 1 , Blake Christianson 1 , Marco Curti 1 , Stefania Rizzo 1 , Filippo Del Grande 1 , Ritse M Mann 1 , Simone Schiaffino 1


The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks.


To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management.

Materials and Methods

This retrospective study included reports for women who underwent MRI, mammography, and/or US for breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1–5 and written in Italian, English, or Dutch were collected between January 2000 and October 2023. Board-certified breast radiologists and the LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the findings described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using the Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would affect clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test.


Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that would result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 [23.9%] for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for GPT-4; P < .001).


LLMs achieved moderate agreement with human reader–assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management.

© RSNA, 2024

Supplemental material is available for this article.


GPT-3.5、GPT-4 和 Google Bard 的 BI-RADS 类别分配:多语言研究


公开的大型语言模型的性能(LLMs )对于复杂的临床任务仍不清楚。


评估人类读者与LLMs根据以三种语言编写的乳腺影像报告分配乳腺影像报告和数据系统 (BI-RADS) 类别,并评估不一致的类别分配对临床管理的影响。


这项回顾性研究包括在三个转诊中心接受 MRI、乳房 X 光检查和/或 US 进行乳腺癌筛查或诊断目的的女性的报告。 2000 年 1 月至 2023 年 10 月期间收集的报告的结果被归类为 BI-RADS 1-5,并以意大利语、英语或荷兰语撰写。LLMs GPT-3.5 和 GPT-4 (OpenAI) 以及 Bard(现在称为 Gemini (Google))仅使用原始放射科医生描述的发现来分配 BI-RADS 类别。人类读者之间的协议LLMsBI-RADS 类别的评估使用 Gwet 一致性系数(AC1 值)。计算影响临床管理的 BI-RADS 类别分配变化的频率(即 BI-RADS 0 与 BI-RADS 1 或 2 与 BI-RADS 3 与 BI-RADS 4 或 5),并使用 McNemar 检验进行比较。


在 2400 份报告中,原始放射科医生与审查放射科医生之间的一致性几乎是完美的(AC1 = 0.91),而原始放射科医生与 GPT-4、GPT-3.5 和 Bard 之间的一致性是中等的(AC1 分别 = 0.52、0.48 和 0.42) )。在人类读者和LLMs,在导致临床管理改变的 BI-RADS 类别升级或降级频率方面观察到差异(对于人类读者,2400 名中的 118 名 [4.9%],巴德 2400 名中的 611 名 [25.5%],2400 名中的 573 名 [23.9%] ] 对于 GPT-3.5,2400 人中的 435 人 [18.1%] 对于 GPT-4; P < .001),这将对临床管理产生负面影响(对于人类读者,2400 人中的 37 人 [1.5%],2400 人中的 435 人 [18.1%]对于 Bard,GPT-3.5 为 2400 中的 344 [14.3%],GPT-4 为 2400 中的 255 [10.6%]; P < .001)。


LLMs在以三种语言编写的报告中,与人类读者分配的 BI-RADS 类别取得了一定的一致性,但也产生了很高比例的不一致的 BI-RADS 类别,这会对临床管理产生负面影响。

 © 北美放射学会,2024

