Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions,American Journal of Human Genetics

当前位置： X-MOL 学术 › Am. J. Hum. Genet. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions
American Journal of Human Genetics ( IF 8.1 ) Pub Date : 2024-08-14 , DOI: 10.1016/j.ajhg.2024.07.011
Kendall A Flaharty ₁ , Ping Hu ₁ , Suzanna Ledgister Hanchard ₁ , Molly E Ripper ₁ , Dat Duong ₁ , Rebekah L Waikel ₁ , Benjamin D Solomon ₁

Affiliation

Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%–49%, 41%–51%, and 54%–68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%–90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals’ descriptions; LLMs answered these questions with a maximum 21% accuracy.

中文翻译：

评估基于遗传病的医学、外行语言和自我报告描述的大型语言模型

大型语言模型（LLMs）正在引起人们对医疗环境的兴趣。例如，LLMs 可以通过根据临床记录提供合理的鉴别诊断来连贯地响应医学查询。但是，有许多问题需要探索，例如评估开源和闭源 LLMs，以及 LLM 对医疗和非医疗用户查询的性能。在这项研究中，我们评估了多个 LLMs，包括 Llama-2-chat、Vicuna、Medllama2、Bard/Gemini、Claude、ChatGPT3.5 和 ChatGPT-4，以及非LLM 方法（谷歌搜索和 Phenomizer）关于它们从类似教科书的临床医生问题中识别遗传病的能力，以及它们与 63 种遗传病相关的相应的外行翻译。对于开源 LLMs，较大的模型比较小的 LLMs：7b、13b 和大于 33b 参数的模型获得的准确率范围分别为 21%-49%、41%-51% 和 54%-68%。闭源LLMs 的表现优于开源 LLMs，其中 ChatGPT-4 表现最佳（89%–90%）。11 LLMs在临床医生和非专业人士提示之间存在显着的性能差距。我们还评估了上下文提示和关键字删除如何影响开源 LLM 性能。为模型提供了 2 种类型的上下文提示：列表类型提示，它提高了 LLM 性能，以及定义类型提示，它没有。我们进一步分析了从描述中删除稀有术语的情况，这降低了 7 个评估 LLMs。最后，我们观察到真实个体的描述的性能要低得多;LLMs 以最高 21% 的准确率回答了这些问题。

更新日期：2024-08-14

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南