Efficient audio–visual information fusion using encoding pace synchronization for Audio–Visual Speech Separation,Information Fusion

当前位置： X-MOL 学术 › Inform. Fusion › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient audio–visual information fusion using encoding pace synchronization for Audio–Visual Speech Separation
Information Fusion ( IF 14.7 ) Pub Date : 2024-10-23 , DOI: 10.1016/j.inffus.2024.102749
Xinmeng Xu, Weiping Tu, Yuhong Yang

Contemporary audio–visual speech separation (AVSS) models typically use encoders that merge audio and visual representations by concatenating them at a specific layer. This approach assumes that both modalities progress at the same pace and that information is adequately encoded at the chosen fusion layer. However, this assumption is often flawed due to inherent differences between the audio and visual modalities. In particular, the audio modality, being more directly tied to the final output (i.e., denoised speech), tends to converge faster than the visual modality. This discrepancy creates a persistent challenge in selecting the appropriate layer for fusion. To address this, we propose the Encoding Pace Synchronization Network (EPS-Net) for AVSS. EPS-Net allows for the independent encoding of the two modalities, enabling each to be processed at its own pace. At the same time, it establishes communication between the audio and visual modalities at corresponding encoding layers, progressively synchronizing their encoding speeds. This approach facilitates the gradual fusion of information while preserving the unique characteristics of each modality. The effectiveness of the proposed method has been validated through extensive experiments on the LRS2, LRS3, and VoxCeleb2 datasets, demonstrating superior performance over state-of-the-art methods.

中文翻译：

使用编码速度同步实现音视频语音分离的高效音视频信息融合

现代音视频语音分离（AVSS）模型通常使用编码器，通过在特定层连接音频和视频表示来合并它们。这种方法假设两种模式以相同的速度进行，并且信息在选定的融合层进行了充分编码。然而，由于音频和视频模态之间的固有差异，这种假设往往是有缺陷的。特别是，音频模态与最终输出（即去噪语音）更直接相关，往往比视觉模态收敛得更快。这种差异在选择合适的熔融层方面带来了持续的挑战。为了解决这个问题，我们建议为 AVSS 使用编码速度同步网络（EPS-Net）。EPS-Net 允许对两种模式进行独立编码，使每种模式都可以按照自己的节奏进行处理。同时，它在相应的编码层建立音频和视频模态之间的通信，逐步同步它们的编码速度。这种方法有助于信息的逐渐融合，同时保留每种模态的独特特征。通过在 LRS2 、 LRS3 和 VoxCeleb2 数据集上进行大量实验，验证了所提出的方法的有效性，证明了优于最先进的方法。

更新日期：2024-10-23

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南