npj Digital Medicine ( IF 12.4 ) Pub Date : 2024-11-18 , DOI: 10.1038/s41746-024-01315-1 Eyal Klang, Donald Apakama, Ethan E. Abbott, Akhil Vaid, Joshua Lampert, Ankit Sakhuja, Robert Freeman, Alexander W. Charney, David Reich, Monica Kraft, Girish N. Nadkarni, Benjamin S. Glicksberg
Large language models (LLMs) can optimize clinical workflows; however, the economic and computational challenges of their utilization at the health system scale are underexplored. We evaluated how concatenating queries with multiple clinical notes and tasks simultaneously affects model performance under increasing computational loads. We assessed ten LLMs of different capacities and sizes utilizing real-world patient data. We conducted >300,000 experiments of various task sizes and configurations, measuring accuracy in question-answering and the ability to properly format outputs. Performance deteriorated as the number of questions and notes increased. High-capacity models, like Llama-3–70b, had low failure rates and high accuracies. GPT-4-turbo-128k was similarly resilient across task burdens, but performance deteriorated after 50 tasks at large prompt sizes. After addressing mitigable failures, these two models can concatenate up to 50 simultaneous tasks effectively, with validation on a public medical question-answering dataset. An economic analysis demonstrated up to a 17-fold cost reduction at 50 tasks using concatenation. These results identify the limits of LLMs for effective utilization and highlight avenues for cost-efficiency at the enterprise scale.
中文翻译:
在卫生系统规模上经济高效地使用大型语言模型的策略
大型语言模型 (LLMs) 可以优化临床工作流程;然而,它们在卫生系统规模上的利用的经济和计算挑战尚未得到充分探索。我们评估了在计算负载增加的情况下,同时连接具有多个临床笔记和任务的查询如何影响模型性能。我们利用真实世界的患者数据评估了 10 个不同容量和大小的 LLMs。我们进行了各种任务大小和配置的 >300,000 次实验,测量问答的准确性和正确格式化输出的能力。随着问题和注释数量的增加,性能会变差。高容量模型,如 Llama-3-70b,具有低故障率和高准确性。GPT-4-turbo-128k 在任务负担方面同样具有弹性,但在 50 个任务后,性能在大提示大小下下降。在解决可缓解的故障后,这两个模型可以有效地连接多达 50 个同步任务,并在公共医学问答数据集上进行验证。一项经济分析表明,使用串联在 50 个任务中,成本最多可降低 17 倍。这些结果确定了 LLMs 有效利用的局限性,并突出了在企业范围内实现成本效益的途径。