当前位置: X-MOL 学术npj Digit. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Autonomous medical evaluation for guideline adherence of large language models
npj Digital Medicine ( IF 12.4 ) Pub Date : 2024-12-12 , DOI: 10.1038/s41746-024-01356-6
Dennis Fast, Lisa C. Adams, Felix Busch, Conor Fallon, Marc Huppertz, Robert Siepmann, Philipp Prucker, Nadine Bayerl, Daniel Truhn, Marcus Makowski, Alexander Löser, Keno K. Bressem

Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.



中文翻译:


针对大型语言模型指南依从性的自主医学评估



指南依从性自主医学评估 (AMEGA) 是一个综合基准,旨在评估大型语言模型在 13 个专业的 20 种诊断场景中对医学指南的依从性。它包括一个评估框架和方法,用于评估模型在医学推理、鉴别诊断、治疗计划和指南依从性方面的能力,使用反映真实世界临床互动的开放式问题。它包括 135 个问题和 1337 个加权评分元素,旨在评估全面的医学知识。在 17 LLMs,GPT-4 得分最高,为 41.9/50,紧随其后的是 Llama-3 70B 和 WizardLM-2-8x22B。相比之下,一名应届医学毕业生的得分为 25.8/50。该基准测试引入了新颖的内容,以避免 LLMs 记住现有医疗数据的问题。AMEGA 的公开代码支持对 AI 辅助临床决策的进一步研究,旨在通过在时间限制下帮助临床医生进行诊断和治疗来加强患者护理。

更新日期:2024-12-12
down
wechat
bug