SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes,Information Fusion

当前位置： X-MOL 学术 › Inform. Fusion › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes
Information Fusion ( IF 14.7 ) Pub Date : 2024-06-05 , DOI: 10.1016/j.inffus.2024.102509
Hongtao Zheng , Yifei Ding , Zilong Wang , Xinyan Huang

Open-vocabulary learning can identify categories marked during training (seen categories) and generalize to categories not annotated in the training set (unseen categories). It could theoretically extend segmentation systems to more universal applications. However, current open-vocabulary segmentation frameworks are primarily suited for specific tasks or require retraining according to the task, and they significantly underperform in inferring seen categories compared to fully supervised frameworks. Therefore, we introduce a universal open-vocabulary segmentation framework based on the latent diffusion process (), which requires only a single training session on a panoptic dataset to achieve inference across all open-vocabulary segmentation tasks, and reaches SOTA segmentation performance for both seen and unseen categories in every task. Specifically, SegLD comprises two stages: in the first stage, we deploy two parallel latent diffusion processes to deeply fuse the text (image caption or category labels) and image information, further aggregating the multi-scale features output from both latent diffusion processes on a scale basis. In the second stage, we introduce text queries, text list queries, and task queries, facilitating the learning of inter-category and inter-task differences through the computation of contrastive losses between them. Text queries are then further fed into a Transformer Decoder to obtain category-agnostic segmentation masks. Then we establish classification loss functions for the type of text input during training, whether image captions or category labels, to help assign a category label from the open vocabulary to each predicted binary mask. Experimental results show that, with just a single training session, SegLD significantly outperforms other contemporary SOTA fully supervised segmentation frameworks and open-vocabulary segmentation frameworks across almost all evaluation metrics for both known and unknown categories on the ADE20K, Cityscapes, and COCO datasets. This highlights SegLD’s capability as a universal segmentation framework, with the potential to replace other segmentation frameworks and adapt to various segmentation domains. The project link for SegLD is .

中文翻译：

SegLD：通过潜在扩散过程的多模态融合实现通用、零样本和开放词汇分割

开放词汇学习可以识别训练期间标记的类别（已见类别），并泛化到训练集中未注释的类别（未见类别）。理论上它可以将分割系统扩展到更普遍的应用。然而，当前的开放词汇分割框架主要适用于特定任务或需要根据任务进行重新训练，并且与完全监督的框架相比，它们在推断所见类别方面表现明显不佳。因此，我们引入了一种基于潜在扩散过程（）的通用开放词汇分割框架，该框架仅需要在全景数据集上进行一次训练即可实现所有开放词汇分割任务的推理，并达到可见和可见的 SOTA 分割性能。以及每个任务中看不见的类别。具体来说，SegLD 包括两个阶段：在第一阶段，我们部署两个并行的潜在扩散过程来深度融合文本（图像标题或类别标签）和图像信息，进一步聚合两个潜在扩散过程输出的多尺度特征规模基础。在第二阶段，我们引入文本查询、文本列表查询和任务查询，通过计算它们之间的对比损失来促进类别间和任务间差异的学习。然后，文本查询被进一步输入到 Transformer Decoder 中，以获得与类别无关的分割掩码。然后，我们为训练期间的文本输入类型（无论是图像标题还是类别标签）建立分类损失函数，以帮助将开放词汇表中的类别标签分配给每个预测的二进制掩码。实验结果表明，只需一次训练，SegLD 在 ADE20K、Cityscapes 和 COCO 数据集上已知和未知类别的几乎所有评估指标上都显着优于其他当代 SOTA 完全监督分割框架和开放词汇分割框架。这凸显了 SegLD 作为通用分割框架的能力，有可能取代其他分割框架并适应各种分割领域。 SegLD 的项目链接是。

更新日期：2024-06-05

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南