当前位置: X-MOL 学术Am. J. Gastroenterol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.
The American Journal of Gastroenterology ( IF 8.0 ) Pub Date : 2024-12-17 , DOI: 10.14309/ajg.0000000000003255
Sheza Malik,Lewis J Frey,Jason Gutman,Asim Mushtaq,Fatima Warraich,Kamran Qureshi

INTRODUCTION Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology. METHODS Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality. RESULT ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance. CONCLUSION The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

中文翻译:


评估人工智能对急性肝衰竭查询的驱动反应:准确性、清晰度和相关性的比较分析。



引言人工智能 (AI) 的最新进展,特别是通过部署大型语言模型 (LLMs),对医疗保健产生了深远的影响。本研究评估了五个LLMs 3.5、ChatGPT 4、BARD、CLAUDE 和 COPILOT-它们的响应准确性、清晰度和与有关急性肝衰竭 (ALF) 的查询的相关性。随后,我们将这些结果与使用 Retrieval Augmented Generation (RAG) 技术增强的 Chat GPT4 进行了比较。方法 根据实际临床使用和美国胃肠病学会指南,我们制定了 16 个 ALF 问题或临床场景,以探讨 LLMs 处理不同临床问题的能力。使用 “New Chat” 功能,每个查询都跨模型单独处理,以减少任何偏差。此外,我们还采用了 GPT-4 的 RAG 功能,它集成了外部来源作为参考来为结果奠定基础。所有回答均由 4 名独立调查员按照 1 到 5 的李克特量表评估准确性、清晰度和相关性,以确保公正性。结果与其他 ChatGPT 4 相比,通过 RAG 增强的 ChatGPT 4 表现出卓越的性能,在所有三个领域中始终获得最高分(4.70、4.89、4.78)。ChatGPT 4 表现出显着的熟练度,准确率为 3.67,清晰度为 4.04,相关性为 4.01。相比之下,CLAUDE 的清晰度为 3.04,相关性为 3.6,准确率为 3.65。同时,BARD 和 COPILOT 表现出较低的性能水平;BARD 的准确率得分为 2.01 分,相关性得分为 3.03,而 COPILOT 的准确率得分为 2.26 分,相关性得分为 3.12。结论 该研究强调了 Chat GPT 4 +RAG 与其他 LLMs的性能优越。 通过将 RAG 与 LLMs,该系统将生成语言技能与准确、最新的信息相结合。这提高了响应的清晰度、相关性和准确性,使其在医疗保健领域更加有效。但是,AI 模型必须不断发展并与医疗实践保持一致,才能成功实现医疗保健集成。
更新日期:2024-12-17
down
wechat
bug