当前位置: X-MOL 学术Br. J. Ophthalmol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluating the effectiveness of large language models in patient education for conjunctivitis
British Journal of Ophthalmology ( IF 3.7 ) Pub Date : 2024-08-30 , DOI: 10.1136/bjo-2024-325599
Jingyuan Wang 1 , Runhan Shi 1 , Qihua Le 1 , Kun Shan 1 , Zhi Chen 1 , Xujiao Zhou 1 , Yao He 2 , Jiaxu Hong 3, 4, 5, 6
Affiliation  

Aims To evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions. Methods A two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study’s practical significance. Results In phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4’s responses as the most detailed, with PaLM 2’s being the most succinct. Phase 2 demonstrated GPT-4 and Qwen’s robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals. Conclusions Our study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs. All data relevant to the study are included in the article or uploaded as online supplemental information.

中文翻译:


评估大语言模型在结膜炎患者教育中的有效性



目的 评估大型语言模型 ( LLMs ) 对患者提出的结膜炎问题的响应质量。方法在复旦大学眼耳鼻喉科医院进行两阶段横断面研究。在第一阶段,四位LLMs (GPT-4、Qwen、Baichuan 2 和 PaLM 2)回答了 22 个常见结膜炎问题。六位专业眼科医生使用 5 点李克特量表评估这些反应的正确性、完整性、可读性、有用性和安全性,并辅以客观的可读性分析。第 2 阶段涉及 30 名与 GPT-4 或 Qwen 相互作用的结膜炎患者,根据满意度、人性化、专业性和与第 1 阶段相同的维度(除了正确性之外)评估LLM生成的响应。三位眼科医生使用第 1 阶段标准评估响应,考虑到对医学评价和患者评价进行比较分析,探讨研究的现实意义。结果在第一阶段,GPT-4 在所有指标上都表现出色,特别是在正确性 (4.39±0.76)、完整性 (4.31±0.96) 和可读性 (4.65±0.59) 方面,而 Qwen 在帮助性 (4.37±0.93) 和安全性方面也表现出了同样出色的表现(4.25±1.03)。百川 2 和 PaLM 2 有效,但落后于 GPT-4 和 Qwen。客观可读性分析显示 GPT-4 的响应最详细,PaLM 2 的响应最简洁。第二阶段展示了 GPT-4 和 Qwen 的强劲性能,患者和专业人士的满意度很高,评价一致。结论 我们的研究表明LLMs可以有效改善结膜炎患者的教育。这些模型在现实世界的患者互动中显示出了巨大的前景。 尽管取得了令人鼓舞的结果,但在这些LLMs的临床整合之前,进一步的完善,特别是在个性化和处理复杂的询问方面是至关重要的。与研究相关的所有数据都包含在文章中或作为在线补充信息上传。
更新日期:2024-08-31
down
wechat
bug