当前位置: X-MOL 学术Inform. Fusion › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
X-TF-GridNet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion
Information Fusion ( IF 14.7 ) Pub Date : 2024-06-28 , DOI: 10.1016/j.inffus.2024.102550
Fengyuan Hao , Xiaodong Li , Chengshi Zheng

Target speaker extraction (TSE) which has the capability to directly extract desired speech given enrollment utterances of the target speaker has attracted more and more attention for its potential applications in solving the cocktail-party problem. Despite the considerable progress made by existing time-domain methods, which have become the dominant approach for TSE, these methods often significantly degrade their performance under more realistic conditions. This paper proposes an innovative approach in the time–frequency (T–F) domain, namely X-TF-GridNet, which uses complex spectrum mapping to extract the real and imaginary (RI) components of the target speech. Specifically, the TF-GridNet block was designed to serve as the primary speaker extractor module. Our proposed method boasts two key extensions: first, a U-Net style network adeptly extracts robust fixed speaker embeddings, which could efficiently capture and represent target speaker information. Second, an adaptive embedding fusion (AEA) mechanism ensures the effective utilization of target speaker information, which makes the backbone extractor focus on the speech of interest. Additionally, we also introduced a multi-task learning framework, comprising two distinct loss functions, to explicitly enhance both the discriminative speaker embeddings for the reference speech and the overall quality of the target speech. We conducted extensive ablation studies and quantitative comparisons against previous TSE methods on both the WSJ0-2mix and its noisy and reverberant counterparts. Our proposed method achieved a commendable SI-SDR of 19.7 dB with a moderate model size on the WSJ0-2mix dataset, and the SI-SDR can be improved to 20.7 dB with a larger model. Experimental results demonstrated that, compared with existing time-domain approaches, our proposed method not only achieved competitive performance across multiple objective metrics but also mitigated speaker confusion errors under more challenging conditions, including various interferences such as noises and reverberation.

中文翻译:


X-TF-GridNet:具有自适应说话人嵌入融合的时频域目标说话人提取网络



目标说话人提取(TSE)能够根据目标说话人的注册话语直接提取所需的语音,因其在解决鸡尾酒会问题中的潜在应用而受到越来越多的关注。尽管现有时域方法取得了相当大的进步,已成为 TSE 的主要方法,但这些方法在更现实的条件下往往会显着降低其性能。本文提出了一种时频(T-F)域的创新方法,即 X-TF-GridNet,它使用复杂的频谱映射来提取目标语音的实部和虚部(RI)分量。具体来说,TF-GridNet 块被设计用作主要说话人提取器模块。我们提出的方法有两个关键扩展:首先,U-Net 风格的网络熟练地提取鲁棒的固定说话人嵌入,它可以有效地捕获和表示目标说话人信息。其次,自适应嵌入融合(AEA)机制确保了目标说话人信息的有效利用,这使得主干提取器专注于感兴趣的语音。此外,我们还引入了一个多任务学习框架,包括两个不同的损失函数,以明确增强参考语音的判别性说话人嵌入和目标语音的整体质量。我们对 WSJ0-2mix 及其噪声和混响对应产品进行了广泛的消融研究,并与之前的 TSE 方法进行了定量比较。我们提出的方法在 WSJ0-2mix 数据集上以适中的模型大小实现了值得称赞的 19.7 dB 的 SI-SDR,并且使用更大的模型可以将 SI-SDR 提高到 20.7 dB。 实验结果表明,与现有的时域方法相比,我们提出的方法不仅在多个客观指标上实现了有竞争力的性能,而且还减轻了更具挑战性的条件下的说话者混淆错误,包括各种干扰,例如噪声和混响。
更新日期:2024-06-28
down
wechat
bug