Evaluating the effectiveness of large language models in patient education for conjunctivitis,British Journal of Ophthalmology

当前位置： X-MOL 学术 › Br. J. Ophthalmol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating the effectiveness of large language models in patient education for conjunctivitis
British Journal of Ophthalmology ( IF 3.7 ) Pub Date : 2025-02-01 , DOI: 10.1136/bjo-2024-325599
Jingyuan Wang ₁ , Runhan Shi ₁ , Qihua Le ₁ , Kun Shan ₁ , Zhi Chen ₁ , Xujiao Zhou ₁ , Yao He ₂ , Jiaxu Hong _{3,

4,

5,

6}

Affiliation

Aims To evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions. Methods A two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study’s practical significance. Results In phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4’s responses as the most detailed, with PaLM 2’s being the most succinct. Phase 2 demonstrated GPT-4 and Qwen’s robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals. Conclusions Our study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs. All data relevant to the study are included in the article or uploaded as online supplemental information.

中文翻译：

评估大型语言模型在结膜炎患者教育中的有效性

目的评估大型语言模型（）LLMs 对患者生成的结膜炎问题的回答质量。方法在复旦大学眼耳鼻喉科医院进行了一项两阶段的横断面研究。在第 1 阶段，4 LLMs 个（GPT-4、Qwen、百川 2 和 PaLM 2）回答了 22 个常见的结膜炎问题。六位眼科医生专家使用 5 点李克特量表评估了这些反应的正确性、完整性、可读性、有用性和安全性，并辅以客观可读性分析。第 2 阶段涉及 30 名与 GPT-4 或 Qwen 互动的结膜炎患者，根据满意度、人性化、专业性和除第 1 阶段的正确性外的相同维度评估LLM生成的反应。三名眼科医生使用 1 期标准评估反应，允许对医学和患者评估进行比较分析，探讨该研究的实际意义。结果在第 1 阶段，GPT-4 在所有指标上都表现出色，尤其是在正确性（4.39±0.76）、完整性（4.31±0.96）和可读性（4.65±0.59）方面，而 Qwen 在有用性（4.37±0.93）和安全性（4.25±1.03）方面也表现出同样强劲的表现。百川 2 和 PaLM 2 有效，但落后于 GPT-4 和 Qwen。客观可读性分析显示 GPT-4 的回答最详细，其中 PaLM 2 的回答最简洁。第 2 阶段展示了 GPT-4 和 Qwen 的稳健性能，患者和专业人士都获得了很高的满意度和一致的评价。结论我们的研究表明LLMs，有效改善了结膜炎的患者教育。这些模型在真实世界的患者互动中显示出相当大的前景。尽管结果令人鼓舞，但在临床整合这些LLMs之前，进一步的改进，特别是在个性化和处理复杂的查询方面，是必不可少的。与研究相关的所有数据都包含在文章中或作为在线补充信息上传。

更新日期：2025-01-28

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南