Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages.,Radiology

当前位置： X-MOL 学术 › Radiology › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages.
Radiology ( IF 12.1 ) Pub Date : 2024-12-01 , DOI: 10.1148/radiol.241736
Aymen Meddeb,Sophia Lüken,Felix Busch,Lisa Adams,Lorenzo Ugga,Emmanouil Koltsakis,Antonios Tzortzakakis,Soumaya Jelassi,Insaf Dkhil,Michail E Klontzas,Matthaios Triantafyllou,Burak Kocak,Sabahattin Yüzkan,Longjiang Zhang,Bin Hu,Anna Andreychenko,Efimtcev Alexander Yurievich,Tatiana Logunova,Wipawee Morakote,Salita Angkurawaranon,Marcus R Makowski,Mike P Wattjes,Renato Cuocolo,Keno Bressem

Background High-quality translations of radiology reports are essential for optimal patient care. Because of limited availability of human translators with medical expertise, large language models (LLMs) are a promising solution, but their ability to translate radiology reports remains largely unexplored. Purpose To evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (English, Italian, French, German, and Chinese) and low-resource languages (Swedish, Turkish, Russian, Greek, and Thai). Materials and Methods A dataset of 100 synthetic free-text radiology reports from CT and MRI scans was translated by 18 radiologists between January 14 and May 2, 2024, into nine target languages. Ten LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), were used for automated translation. Translation accuracy and quality were assessed with use of BiLingual Evaluation Understudy (BLEU) score, translation error rate (TER), and CHaRacter-level F-score (chrF++) metrics. Statistical significance was evaluated with use of paired t tests with Holm-Bonferroni corrections. Radiologists also conducted a qualitative evaluation of translations with use of a standardized questionnaire. Results GPT-4 demonstrated the best overall translation quality, particularly from English to German (BLEU score: 35.0 ± 16.3 [SD]; TER: 61.7 ± 21.2; chrF++: 70.6 ± 9.4), to Greek (BLEU: 32.6 ± 10.1; TER: 52.4 ± 10.6; chrF++: 62.8 ± 6.4), to Thai (BLEU: 53.2 ± 7.3; TER: 74.3 ± 5.2; chrF++: 48.4 ± 6.6), and to Turkish (BLEU: 35.5 ± 6.6; TER: 52.7 ± 7.4; chrF++: 70.7 ± 3.7). GPT-3.5 showed highest accuracy in translations from English to French, and Qwen1.5 excelled in English-to-Chinese translations, whereas Mixtral 8x22B performed best in Italian-to-English translations. The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate medical terminology accuracy. Conclusion LLMs showed high accuracy and quality for translating radiology reports, although results varied by model and language pair. © RSNA, 2024 Supplemental material is available for this article.

中文翻译：

大型语言模型能够将 CT 和 MRI 自由文本放射学报告翻译成多种语言。

背景放射学报告的高质量翻译对于最佳患者护理至关重要。由于具有医学专业知识的人工翻译数量有限，大型语言模型（LLMs）是一个很有前途的解决方案，但它们翻译放射学报告的能力在很大程度上仍未得到探索。目的评估各种 LLMs 在翻译高资源语言（英语、意大利语、法语、德语和中文）和低资源语言（瑞典语、土耳其语、俄语、希腊语和泰语）的放射学报告的准确性和质量。材料和方法 18 名放射科医生在 2024 年 1 月 14 日至 5 月 2 日期间将 100 份来自 CT 和 MRI 扫描的合成自由文本放射学报告的数据集翻译成 9 种目标语言。10 LLMs，包括 GPT-4 （OpenAI）、Llama 3 （Meta）和 Mixtral 模型（Mistral AI），用于自动翻译。使用双语评估研究（BLEU）评分、翻译错误率（TER）和 CHaRacter 级 F 评分（chrF++）指标评估翻译准确性和质量。使用配对 t 检验和 Holm-Bonferroni 校正评估统计显着性。放射科医生还使用标准化问卷对翻译进行了定性评估。结果 GPT-4 表现出最好的整体翻译质量，尤其是从英语到德语（BLEU 分：35.0 ± 16.3 [SD];TER：61.7 ± 21.2;chrF++：70.6 ± 9.4），到希腊语（BLEU：32.6 ± 10.1;TER：52.4 ± 10.6;chrF++：62.8 ± 6.4），至泰语（BLEU：53.2 ± 7.3;TER：74.3 ± 5.2;chrF++：48.4 ± 6.6），土耳其语（BLEU：35.5 ± 6.6;TER：52.7 ± 7.4;chrF++：70.7 ± 3.7）。GPT-3.5 在从英语到法语和 Qwen1 的翻译中显示出最高的准确性。5 在英译中方面表现出色，而 Mixtral 8x22B 在意译英方面表现最好。定性评估显示，LLMs 在清晰度、可读性和与原意的一致性方面表现出色，但医学术语的准确性适中。结论 LLMs 翻译放射学报告的准确性和质量较高，但结果因模型和语言对而异。© RSNA，2024 年本文提供补充材料。

更新日期：2024-12-01

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南