International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-09-05 , DOI: 10.1007/s11263-024-02224-2 Lianghui Zhu , Xinggang Wang , Jiapei Feng , Tianheng Cheng , Yingyue Li , Bo Jiang , Dingwen Zhang , Junwei Han
Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the \(\frac{1}{16}\) down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.
中文翻译:
WeakCLIP:采用 CLIP 进行弱监督语义分割
对比语言和图像预训练(CLIP)在各种计算机视觉任务中取得了巨大成功,也为利用其大规模预训练知识增强弱监督图像理解提供了一条合适的途径。作为减少对像素级人工注释标签依赖的有效方法,弱监督语义分割(WSSS)旨在细化类激活图(CAM)并生成高质量的伪掩模。弱监督语义分割(WSSS)旨在将类激活图(CAM)细化为伪掩模,但严重依赖于手工制作的先验和数字图像处理方法等归纳偏差。对于视觉语言预训练模型,即 CLIP,我们提出了一种新颖的 WSSS 文本到像素匹配范例。然而,由于三个关键问题,直接将 CLIP 应用于 WSSS 具有挑战性:(1) 对比预训练和 WSSS CAM 细化之间的任务差距,(2) 缺乏文本到像素建模来充分利用预训练的知识, (3) 由于 ViT 的\(\frac{1}{16}\)下采样分辨率导致细节不足。因此,我们提出 WeakCLIP 来解决这些问题,并利用从 CLIP 到 WSSS 的预训练知识。具体来说,我们首先通过提出金字塔适配器和可学习提示来提取 WSSS 特定表示来解决任务差距。然后,我们设计了一个共同注意匹配模块来建模文本到像素的关系。最后,引入金字塔适配器和文本引导解码器来收集多级信息并将其与文本引导分层集成。 WeakCLIP 提供了一种有效且参数高效的方法来传输 CLIP 知识以改进 CAM。 大量实验表明,WeakCLIP 在标准基准上实现了最先进的 WSSS 性能,即在 PASCAL VOC 2012 的验证集上达到 74.0% mIoU,在 COCO 2014 的验证集上达到 46.1% mIoU。 源代码和模型检查点发布于 https://github.com/hustvl/WeakCLIP。