International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-10-15 , DOI: 10.1007/s11263-024-02261-x Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.
中文翻译:
使用语义的音视频分割
我们提出了一个称为视听分割 (AVS) 的新问题,其中的目标是输出在图像帧时产生声音的对象的像素级映射。为了促进这项研究,我们构建了第一个视听分割基准,即 AVSBench,为可听视频中的发声对象提供像素级注释。它包含三个子集:AVSBench 对象 (单源子集、多源子集) 和 AVSBench 语义 (语义标签子集) 。因此,研究了三种设置:1) 具有单个声源的半监督音视频分割;2) 具有多个声源的完全监督的视听分割,以及 3) 完全监督的视听语义分割。前两个设置需要生成探测对象的二进制掩码,指示与音频对应的像素,而第三个设置还需要生成指示对象类别的语义映射。为了解决这些问题,我们提出了一种新的基线方法,该方法使用时间像素级音视频交互模块注入音频语义作为视觉分割过程的指导。我们还设计了一个正则化损失,以鼓励在训练期间进行音视频映射。AVSBench 数据集上的定量和定性实验将我们的方法与用于相关任务的几种现有方法进行了比较,表明所提出的方法有望在音频和像素级视觉语义之间架起一座桥梁。代码可以在 https://github.com/OpenNLPLab/AVSBench 中找到。