当前位置: X-MOL 学术Complex Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2024-11-09 , DOI: 10.1007/s40747-024-01654-2
Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi

Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.



中文翻译:


具有双重时间感知场景理解和图像文本知识桥接的视听事件定位



视听事件定位 (AVEL) 任务旨在判断和分类可听和可见事件。现有的方法通过传递预先训练的知识以及理解视听场景的时间依赖性和跨模态相关性来致力于实现这一目标。然而,大多数作品都是从纠缠的时间感知角度来理解视听场景,而忽略了在前向和后向时间感知视图中对时间依赖和跨模态相关性的学习。最近,从对比语言-图像预训练模型 (CLIP) 转移预训练知识在各种任务中都显示出显着的效果。然而,由于 AVEL 任务的视听知识和 CLIP 的图文对齐知识存在异质差距,因此如何将 CLIP 的图文对齐知识转移到 AVEL 领域中几乎没有研究。为了应对这些挑战,该文提出了一种新的双时态感知场景理解和图像文本知识桥接 (DTKB) 模型。DTKB 由前向和后向时间感知场景理解流组成,其中时间依赖性和跨模态相关性是从双重时间感知视角显式捕获的。因此,DTKB 可以实现对事件定位的精细场景理解。此外,提出了一个知识桥接 (KB) 模块,以同时将 CLIP 的图像文本表示和对齐知识转移到 AVEL 任务中。该模块调节了音视频融合特征与 CLIP 视觉特征之间的比例,从而将 CLIP 的图文对齐知识与事件类别预测的视听新知识联系起来。此外,KB 模块与以前的型号兼容。 广泛的实验结果表明,DTKB 的性能明显优于最先进的模型。

更新日期:2024-11-09
down
wechat
bug