当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-12-05 , DOI: 10.1007/s11263-024-02289-z
Yuanyuan Jiang, Jianqin Yin

While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://github.com/Bravo5542/CLIP-TASS.



中文翻译:


由 CLIP 提供支持的 TASS:用于视听问答的目标感知单流网络



虽然视觉语言预训练模型 (VLM) 在各种多模态理解任务中表现出色,但它们在细粒度视听推理中的潜力,特别是对于视听问答 (AVQA) 的潜力,在很大程度上仍未得到探索。AVQA 给 VLM 带来了特定的挑战,因为需要在区域层面进行视觉理解,并与音频模态无缝集成。以前基于 VLM 的 AVQA 方法仅使用 CLIP 作为特征编码器,但未充分利用其知识,并且像大多数 AVQA 方法一样,将音频和视频视为双流框架中的独立实体。本文通过自然界的视听匹配特性,利用 CLIP 模型的预训练知识,为 AVQA 提出了一种新的 CLIP 驱动的目标感知单流 (TASS) 网络。它由两个关键组件组成:目标感知空间接地模块 (TSG+) 和单流联合时间接地模块 (JTG)。具体来说,TSG+ 模块将 CLIP 模型中的图文匹配知识转移到所需的地域文本匹配过程中,而无需相应的真值标签。此外,与以前仍然需要额外视听融合模块的独立双流网络不同,JTG 将视听融合和问题感知时间基础统一在一个简化的单流架构中。它将音频和视频视为一个内聚的实体,并通过保留它们与我们提出的跨模态同步 (CMS) 损失的时间相关性,进一步将图像-文本匹配知识扩展到音频-文本匹配。此外,我们提出了一种简单而有效的预处理策略来优化精度-效率权衡。 在 MUSIC-AVQA 基准测试上进行的广泛实验验证了我们提出的方法相对于现有最先进方法的有效性。该代码可在 https://github.com/Bravo5542/CLIP-TASS 获取。

更新日期:2024-12-05
down
wechat
bug