Evaluating the Performance and Bias of Natural Language Processing Tools in Labeling Chest Radiograph Reports.,Radiology

当前位置： X-MOL 学术 › Radiology › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating the Performance and Bias of Natural Language Processing Tools in Labeling Chest Radiograph Reports.
Radiology ( IF 12.1 ) Pub Date : 2024-10-01 , DOI: 10.1148/radiol.232746
Samantha M Santomartino,John R Zech,Kent Hall,Jean Jeudy,Vishwa Parekh,Paul H Yi

Background Natural language processing (NLP) is commonly used to annotate radiology datasets for training deep learning (DL) models. However, the accuracy and potential biases of these NLP methods have not been thoroughly investigated, particularly across different demographic groups. Purpose To evaluate the accuracy and demographic bias of four NLP radiology report labeling tools on two chest radiograph datasets. Materials and Methods This retrospective study, performed between April 2022 and April 2024, evaluated chest radiograph report labeling using four NLP tools (CheXpert [rule-based], RadReportAnnotator [RRA; DL-based], OpenAI's GPT-4 [DL-based], cTAKES [hybrid]) on a subset of the Medical Information Mart for Intensive Care (MIMIC) chest radiograph dataset balanced for representation of age, sex, and race and ethnicity (n = 692) and the entire Indiana University (IU) chest radiograph dataset (n = 3665). Three board-certified radiologists annotated the chest radiograph reports for 14 thoracic disease labels. NLP tool performance was evaluated using several metrics, including accuracy and error rate. Bias was evaluated by comparing performance between demographic subgroups using the Pearson χ2 test. Results The IU dataset included 3665 patients (mean age, 49.7 years ± 17 [SD]; 1963 female), while the MIMIC dataset included 692 patients (mean age, 54.1 years ± 23.1; 357 female). All four NLP tools demonstrated high accuracy across findings in the IU and MIMIC datasets, as follows: CheXpert (92.6% [47 516 of 51 310], 90.2% [8742 of 9688]), RRA (82.9% [19 746 of 23 829], 92.2% [2870 of 3114]), GPT-4 (94.3% [45 586 of 48 342], 91.6% [6721 of 7336]), and cTAKES (84.7% [43 436 of 51 310], 88.7% [8597 of 9688]). RRA and cTAKES had higher accuracy (P < .001) on the MIMIC dataset, while CheXpert and GPT-4 had higher accuracy on the IU dataset. Differences (P < .001) in error rates were observed across age groups for all NLP tools except RRA on the MIMIC dataset, with the highest error rates for CheXpert, RRA, and cTAKES in patients older than 80 years (mean, 15.8% ± 5.0) and the highest error rate for GPT-4 in patients 60-80 years of age (8.3%). Conclusion Although commonly used NLP tools for chest radiograph report annotation are accurate when evaluating reports in aggregate, demographic subanalyses showed significant bias, with poorer performance in older patients. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Cai in this issue.

中文翻译：

评估自然语言处理工具在标记胸片报告中的性能和偏差。

背景自然语言处理（NLP）通常用于注释放射学数据集，以训练深度学习（DL）模型。然而，这些 NLP 方法的准确性和潜在偏差尚未得到彻底调查，尤其是在不同的人口群体中。目的评估四个 NLP 放射学报告标记工具在两个胸片数据集上的准确性和人口统计学偏倚。材料和方法这项回顾性研究于 2022 年 4 月至 2024 年 4 月期间进行，评估了使用四种 NLP 工具（CheXpert [基于规则]、RadReportAnnotator [RRA;基于 DL]，OpenAI 的 GPT-4 [基于 DL]，cTAKES [混合]），在重症监护医学信息市场（MIMIC）胸片数据集的子集上平衡了年龄、性别、种族和民族（n = 692）和整个印第安纳大学（IU）胸片数据集（n = 3665）。三名获得委员会认证的放射科医生对 14 个胸部疾病标签的胸片报告进行了注释。NLP 工具的性能使用几个指标进行评估，包括准确性和错误率。通过使用 Pearson χ2 检验比较人口统计亚组之间的表现来评估偏倚。结果 IU 数据集包括 3665 名患者（平均年龄 49.7 岁 ± 17 [SD];1963 名女性），而 MIMIC 数据集包括 692 名患者（平均年龄 54.1 岁 ± 23.1 岁;357 名女性）。所有四种 NLP 工具在 IU 和 MIMIC 数据集中的发现中都表现出很高的准确性，如下：CheXpert（92.6% [47 516 中的 51 310]，90.2% [9688 中的 8742]）、RRA（82.9% [23 829 中的 19 746]、92.2% [3114 中的 2870]）、GPT-4（94.3% [48 342 中的 45 586]、91.6% [7336 中的 6721]）和 cTAKES（84.7% [51 310 中的 43 436]、 88.7% [9688 中的 8597]）。RRA 和 cTAKES 的准确性更高（P < .001）在 MIMIC 数据集上，而 CheXpert 和 GPT-4 在 IU 数据集上具有更高的准确性。在 MIMIC 数据集上，除 RRA 外，所有 NLP 工具在不同年龄组的错误率存在差异（P < .001），其中 80 岁以上患者的 CheXpert、RRA 和 cTAKES 错误率最高（平均值为 15.8% ± 5.0），GPT-4 的错误率在 60-80 岁患者中最高（8.3%）。结论尽管在评估总体报告时，常用的胸片报告注释 NLP 工具是准确的，但人口学亚分析显示显着偏倚，老年患者的表现较差。© RSNA，2024 年本文提供补充材料。另见本期蔡国强的社论。

更新日期：2024-10-01

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南