International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-10-06 , DOI: 10.1007/s11263-024-02246-w Wenting Chen, Jie Liu, Tianming Liu, Yixuan Yuan
Medical reports containing specific diagnostic results and additional information not present in medical images can be effectively employed to assist image understanding tasks, and the modality gap between vision and language can be bridged by vision-language matching (VLM). However, current vision-language models distort the intra-model relation and only include class information in reports that is insufficient for segmentation task. In this paper, we introduce a novel Bi-level class-severity-aware Vision-Language Graph Matching (Bi-VLGM) for text guided medical image segmentation, composed of a word-level VLGM module and a sentence-level VLGM module, to exploit the class-severity-aware relation among visual-textual features. In word-level VLGM, to mitigate the distorted intra-modal relation during VLM, we reformulate VLM as graph matching problem and introduce a vision-language graph matching (VLGM) to exploit the high-order relation among visual-textual features. Then, we perform VLGM between the local features for each class region and class-aware prompts to bridge their gap. In sentence-level VLGM, to provide disease severity information for segmentation task, we introduce a severity-aware prompting to quantify the severity level of disease lesion, and perform VLGM between the global features and the severity-aware prompts. By exploiting the relation between the local (global) and class (severity) features, the segmentation model can include the class-aware and severity-aware information to promote segmentation performance. Extensive experiments proved the effectiveness of our method and its superiority to existing methods. The source code will be released.
中文翻译:
Bi-VLGM:用于文本引导医学图像分割的双层类别严重性感知视觉语言图匹配
包含特定诊断结果和医学图像中不存在的附加信息的医学报告可以有效地用于辅助图像理解任务,并且可以通过视觉语言匹配(VLM)来弥合视觉和语言之间的模态差距。然而,当前的视觉语言模型扭曲了模型内的关系,并且仅在报告中包含类别信息,这不足以完成分割任务。在本文中,我们介绍了一种用于文本引导医学图像分割的新型双级类别严重性感知视觉语言图匹配(Bi-VLGM),由词级 VLGM 模块和句子级 VLGM 模块组成,以利用视觉文本特征之间的类别严重性感知关系。在字级 VLGM 中,为了减轻 VLM 期间扭曲的模态内关系,我们将 VLM 重新表述为图匹配问题,并引入视觉语言图匹配(VLGM)来利用视觉文本特征之间的高阶关系。然后,我们在每个类区域的局部特征和类感知提示之间执行 VLGM,以弥补它们之间的差距。在句子级VLGM中,为了为分割任务提供疾病严重程度信息,我们引入了严重性感知提示来量化疾病病变的严重程度,并在全局特征和严重性感知提示之间执行VLGM。通过利用局部(全局)和类别(严重性)特征之间的关系,分割模型可以包含类别感知和严重性感知信息以提高分割性能。大量的实验证明了我们的方法的有效性及其相对于现有方法的优越性。源代码将被发布。