当前位置:
X-MOL 学术
›
IEEE Trans. Image Process.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2024-10-29 , DOI: 10.1109/tip.2024.3485518 Yifan Xu, Mengdan Zhang, Xiaoshan Yang, Changsheng Xu
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2024-10-29 , DOI: 10.1109/tip.2024.3485518 Yifan Xu, Mengdan Zhang, Xiaoshan Yang, Changsheng Xu
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.
中文翻译:
探索用于开放词汇对象检测的多模态上下文知识
我们探索通过多模态掩码语言建模学习的多模态上下文知识,为开放词汇表对象检测 (OVD) 中的新类提供明确的本地化指导。直观地讲,一个建模良好且正确预测的掩码概念词应该有效地捕捉文本上下文、视觉上下文以及文本与区域之间的跨模态对应关系,从而自动激活对相应区域的高度关注。有鉴于此,我们提出了一个多模态上下文知识蒸馏框架 MMC-Det,以显式监督学生检测器,在教师融合转换器中对掩蔽概念词的上下文感知注意力。教师融合转换器使用我们新提出的多样化多模态掩码语言建模 (D-MLM) 策略进行训练,该策略显着增强了融合转换器中的细粒度区域级视觉上下文建模。所提出的蒸馏过程为检测器的概念区域匹配提供了额外的上下文指导,从而进一步提高了 OVD 性能。在各种检测数据集上进行的广泛实验表明了我们的多模态上下文学习策略的有效性。
更新日期:2024-10-29
中文翻译:
探索用于开放词汇对象检测的多模态上下文知识
我们探索通过多模态掩码语言建模学习的多模态上下文知识,为开放词汇表对象检测 (OVD) 中的新类提供明确的本地化指导。直观地讲,一个建模良好且正确预测的掩码概念词应该有效地捕捉文本上下文、视觉上下文以及文本与区域之间的跨模态对应关系,从而自动激活对相应区域的高度关注。有鉴于此,我们提出了一个多模态上下文知识蒸馏框架 MMC-Det,以显式监督学生检测器,在教师融合转换器中对掩蔽概念词的上下文感知注意力。教师融合转换器使用我们新提出的多样化多模态掩码语言建模 (D-MLM) 策略进行训练,该策略显着增强了融合转换器中的细粒度区域级视觉上下文建模。所提出的蒸馏过程为检测器的概念区域匹配提供了额外的上下文指导,从而进一步提高了 OVD 性能。在各种检测数据集上进行的广泛实验表明了我们的多模态上下文学习策略的有效性。