Diffusion-based diverse audio captioning with retrieval-guided Langevin dynamics,Information Fusion

当前位置： X-MOL 学术 › Inform. Fusion › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Diffusion-based diverse audio captioning with retrieval-guided Langevin dynamics
Information Fusion ( IF 14.7 ) Pub Date : 2024-08-21 , DOI: 10.1016/j.inffus.2024.102643
Yonggang Zhu , Aidong Men , Li Xiao

Audio captioning, a comprehensive task of audio understanding, aims to provide a natural-language description of an audio clip. Beyond accuracy, diversity is also a critical requirement for this task. Human-produced captions possess rich variability due to the ambiguity of audio semantics (such as insects buzzing and electrical humming making similar sounds) and the existence of subjective judgments (metaphor, affections, etc.). However, current diverse audio captioning systems fail to produce captions with near-human diversity. Recently, diffusion models have demonstrated the potential to generate data with diversity while maintaining decent accuracy, yet they have not been explored in audio captioning. On the other hand, diffusion models tend to have low generation accuracy and fluency on text data. Directly applying diffusion models to audio captioning tasks may aggravate this problem due to the small size of annotated datasets and the mutable supervision target caused by the variability in human-produced captions. In this work, we propose a model by incorporating the BART language model into the diffusion model for better utilization of the pre-trained linguistic knowledge. We also propose a retrieval-guided Langevin dynamics module, which enables dynamic run-time alignment between generated captions and the target audio. Extensive experiments on standard audio captioning benchmark datasets (Clotho and AudioCaps) demonstrate that our model can achieve better performance on metrics compared with state-of-the-art diverse audio captioning systems. The implementation is available at .

中文翻译：

基于扩散的多样化音频字幕，具有检索引导的 Langevin 动力学

音频字幕是音频理解的一项综合任务，旨在提供音频剪辑的自然语言描述。除了准确性之外，多样性也是这项任务的关键要求。由于音频语义的模糊性（例如昆虫的嗡嗡声和电嗡嗡声发出类似的声音）以及主观判断（隐喻、情感等）的存在，人类制作的字幕具有丰富的可变性。然而，当前多样化的音频字幕系统无法产生接近人类多样性的字幕。最近，扩散模型已经证明了在保持良好准确性的同时生成多样性数据的潜力，但尚未在音频字幕中进行探索。另一方面，扩散模型在文本数据上的生成精度和流畅度往往较低。由于注释数据集的规模较小以及人类生成的字幕的可变性导致的可变监督目标，直接将扩散模型应用于音频字幕任务可能会加剧这个问题。在这项工作中，我们提出了一种将 BART 语言模型合并到扩散模型中的模型，以更好地利用预先训练的语言知识。我们还提出了一个检索引导的 Langevin 动力学模块，该模块可以在生成的字幕和目标音频之间进行动态运行时对齐。对标准音频字幕基准数据集（Clotho 和 AudioCaps）的大量实验表明，与最先进的多样化音频字幕系统相比，我们的模型可以在指标上实现更好的性能。该实施可在处获得。

更新日期：2024-08-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11