Interpretable medical image Visual Question Answering via multi-modal relationship graph learning,Medical Image Analysis

当前位置： X-MOL 学术 › Med. Image Anal. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
Medical Image Analysis ( IF 10.7 ) Pub Date : 2024-07-20 , DOI: 10.1016/j.media.2024.103279
Xinyue Hu ₁ , Lin Gu ₂ , Kazuma Kobayashi ₃ , Liangchen Liu ₄ , Mengliang Zhang ₁ , Tatsuya Harada ₂ , Ronald M Summers ₄ , Yingying Zhu ₁

Affiliation

Medical Visual Question Answering (VQA) is an important task in medical multi-modal Large Language Models (LLMs), aiming to answer clinically relevant questions regarding input medical images. This technique has the potential to improve the efficiency of medical professionals while relieving the burden on the public health system, particularly in resource-poor countries. However, existing medical VQA datasets are small and only contain simple questions (equivalent to classification tasks), which lack semantic reasoning and clinical knowledge. Our previous work proposed a clinical knowledge-driven image difference VQA benchmark using a rule-based approach (Hu et al., 2023). However, given the same breadth of information coverage, the rule-based approach shows an 85% error rate on extracted labels. We trained an LLM method to extract labels with 62% increased accuracy. We also comprehensively evaluated our labels with 2 clinical experts on 100 samples to help us fine-tune the LLM. Based on the trained LLM model, we proposed a large-scale medical VQA dataset, Medical-CXR-VQA, using LLMs focused on chest X-ray images. The questions involved detailed information, such as abnormalities, locations, levels, and types. Based on this dataset, we proposed a novel VQA method by constructing three different relationship graphs: spatial relationships, semantic relationships, and implicit relationship graphs on the image regions, questions, and semantic labels. We leveraged graph attention to learn the logical reasoning paths for different questions. These learned graph VQA reasoning paths can be further used for LLM prompt engineering and chain-of-thought, which are crucial for further fine-tuning and training multi-modal large language models. Moreover, we demonstrate that our approach has the qualities of evidence and faithfulness, which are crucial in the clinical field. The code and the dataset is available at https://github.com/Holipori/Medical-CXR-VQA.

中文翻译：

通过多模态关系图学习进行可解释的医学图像视觉问答

医学视觉问答（VQA）是医学多模态大语言模型（ LLMs ）中的一项重要任务，旨在回答有关输入医学图像的临床相关问题。这项技术有可能提高医疗专业人员的效率，同时减轻公共卫生系统的负担，特别是在资源匮乏的国家。然而，现有的医学VQA数据集很小，只包含简单的问题（相当于分类任务），缺乏语义推理和临床知识。我们之前的工作提出了一种使用基于规则的方法的临床知识驱动的图像差异 VQA 基准（Hu et al., 2023）。然而，考虑到相同的信息覆盖范围，基于规则的方法在提取的标签上显示出 85% 的错误率。我们训练了一种LLM方法来提取标签，准确率提高了 62%。我们还与 2 名临床专家对 100 个样本进行了全面评估，以帮助我们微调LLM 。基于经过训练的LLM模型，我们使用专注于胸部 X 射线图像的LLMs提出了一个大规模医学 VQA 数据集 Medical-CXR-VQA。这些问题涉及详细信息，例如异常情况、位置、级别和类型。基于该数据集，我们提出了一种新颖的 VQA 方法，通过构建三种不同的关系图：空间关系、语义关系以及图像区域、问题和语义标签上的隐式关系图。我们利用图注意力来学习不同问题的逻辑推理路径。这些学习到的图VQA推理路径可以进一步用于LLM提示工程和思维链，这对于进一步微调和训练多模态大语言模型至关重要。此外，我们证明我们的方法具有证据和忠实的品质，这在临床领域至关重要。代码和数据集可在 https://github.com/Holipori/Medical-CXR-VQA 获取。

更新日期：2024-07-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南