当前位置: X-MOL 学术npj Digit. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluation and mitigation of cognitive biases in medical language models
npj Digital Medicine ( IF 12.4 ) Pub Date : 2024-10-21 , DOI: 10.1038/s41746-024-01283-6
Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, Rama Chellappa

Increasing interest in applying large language models (LLMs) to medicine is due in part to their impressive performance on medical exam questions. However, these exams do not capture the complexity of real patient–doctor interactions because of factors like patient compliance, experience, and cognitive bias. We hypothesized that LLMs would produce less accurate responses when faced with clinically biased questions as compared to unbiased ones. To test this, we developed the BiasMedQA dataset, which consists of 1273 USMLE questions modified to replicate common clinically relevant cognitive biases. We assessed six LLMs on BiasMedQA and found that GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which showed large drops in performance. Additionally, we introduced three bias mitigation strategies, which improved but did not fully restore accuracy. Our findings highlight the need to improve LLMs’ robustness to cognitive biases, in order to achieve more reliable applications of LLMs in healthcare.



中文翻译:


医学语言模型中认知偏差的评估和缓解



人们对将大型语言模型 (LLMs,部分原因是它们在医学检查题中的出色表现。然而,由于患者依从性、体验和认知偏差等因素,这些检查并没有捕捉到真实患者与医生互动的复杂性。我们假设 LLMs面对无偏倚的问题时,LLMs 在面对临床偏倚的问题时产生的回答不太准确。为了测试这一点,我们开发了 BiasMedQA 数据集,其中包含 1273 个 USMLE 问题,经过修改以复制常见的临床相关认知偏倚。我们在 BiasMedQA 上评估了六个 LLMs,发现 GPT-4 因其对偏见的弹性而脱颖而出,相比之下,Llama 2 70B-chat 和 PMC Llama 13B 的性能大幅下降。此外,我们引入了三种偏差缓解策略,这些策略提高了但并未完全恢复准确性。我们的研究结果强调了需要提高 LLMs 对认知偏差的鲁棒性,以便在医疗保健中实现 LLMs。

更新日期:2024-10-22
down
wechat
bug