Vision–language foundation model for echocardiogram interpretation,Nature Medicine

当前位置： X-MOL 学术 › Nat. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Vision–language foundation model for echocardiogram interpretation
Nature Medicine ( IF 82.9 ) Pub Date : 2024-04-30 , DOI: 10.1038/s41591-024-02959-y
Matthew Christensen , Milos Vukadinovic , Neal Yuan , David Ouyang

The development of robust artificial intelligence models for echocardiography has been limited by the availability of annotated clinical data. Here, to address this challenge and improve the performance of cardiac imaging models, we developed EchoCLIP, a vision–language foundation model for echocardiography, that learns the relationship between cardiac ultrasound images and the interpretations of expert cardiologists across a wide range of patients and indications for imaging. After training on 1,032,975 cardiac ultrasound videos and corresponding expert text, EchoCLIP performs well on a diverse range of benchmarks for cardiac image interpretation, despite not having been explicitly trained for individual interpretation tasks. EchoCLIP can assess cardiac function (mean absolute error of 7.1% when predicting left ventricular ejection fraction in an external validation dataset) and identify implanted intracardiac devices (area under the curve (AUC) of 0.84, 0.92 and 0.97 for pacemakers, percutaneous mitral valve repair and artificial aortic valves, respectively). We also developed a long-context variant (EchoCLIP-R) using a custom tokenizer based on common echocardiography concepts. EchoCLIP-R accurately identified unique patients across multiple videos (AUC of 0.86), identified clinical transitions such as heart transplants (AUC of 0.79) and cardiac surgery (AUC 0.77) and enabled robust image-to-text search (mean cross-modal retrieval rank in the top 1% of candidate text reports). These capabilities represent a substantial step toward understanding and applying foundation models in cardiovascular imaging for preliminary interpretation of echocardiographic findings.

中文翻译：

超声心动图解读的视觉语言基础模型

用于超声心动图的强大人工智能模型的开发受到带注释的临床数据的可用性的限制。在这里，为了应对这一挑战并提高心脏成像模型的性能，我们开发了 EchoCLIP，这是一种用于超声心动图的视觉语言基础模型，它可以学习心脏超声图像与心脏病专家对各种患者和适应症的解释之间的关系用于成像。在对 1,032,975 个心脏超声视频和相应的专家文本进行训练后，EchoCLIP 在心脏图像判读的各种基准上表现良好，尽管没有针对个人判读任务进行明确的培训。 EchoCLIP 可以评估心脏功能（在外部验证数据集中预测左心室射血分数时，平均绝对误差为 7.1%）并识别植入的心内设备（起搏器、经皮二尖瓣修复术的曲线下面积 (AUC) 为 0.84、0.92 和 0.97）和人工主动脉瓣）。我们还使用基于常见超声心动图概念的自定义标记器开发了长上下文变体 (EchoCLIP-R)。 EchoCLIP-R 在多个视频中准确识别了独特的患者（AUC 为 0.86），识别了心脏移植（AUC 为 0.79）和心脏手术（AUC 0.77）等临床转变，并实现了强大的图像到文本搜索（平均跨模式检索）排名在候选文本报告的前 1%）。这些功能代表了理解和应用心血管成像基础模型以初步解释超声心动图结果的重要一步。

更新日期：2024-04-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>