npj Digital Medicine ( IF 12.4 ) Pub Date : 2024-09-09 , DOI: 10.1038/s41746-024-01239-w Gongbo Zhang 1 , Qiao Jin 2 , Yiliang Zhou 3 , Song Wang 4 , Betina Idnay 1 , Yiming Luo 5 , Elizabeth Park 5 , Jordan G Nestor 5 , Matthew E Spotnitz 6 , Ali Soroush 7, 8, 9 , Thomas R Campion 3, 10 , Zhiyong Lu 2 , Chunhua Weng 1 , Yifan Peng 3, 10
|
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.
中文翻译:

缩小用于医学证据摘要的开源和商业大型语言模型之间的差距
大型语言模型 (LLMs) 在总结医学证据方面具有很大的前景。最近的研究集中在专有 LLMs的应用上。使用专有LLMs 会带来多种风险因素,包括缺乏透明度和对供应商的依赖。虽然开源 LLMs 允许更好的透明度和自定义,但与专有 LLM 相比,它们的性能不足。在这项研究中,我们研究了微调开源 LLMs上进一步提高其性能。利用由 8161 对系统评价和摘要组成的基准数据集 MedReview,我们微调了三个广泛使用的开源 LLMs,即 PRIMERA、LongT5 和 Llama-2。总体而言,开源模型的性能在微调后都有所提高。微调后的 LongT5 的性能接近 GPT-3.5,零镜头设置。此外,与较大的零镜头模型相比,较小的微调模型有时甚至表现出卓越的性能。上述改进趋势在人工评估和更大规模的 GPT4 模拟评估中都有所体现。