End-to-End Label Uncertainty Modeling in Speech Emotion Recognition Using Bayesian Neural Networks and Label Distribution Learning,IEEE Transactions on Affective Computing

当前位置： X-MOL 学术 › IEEE Trans. Affect. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-End Label Uncertainty Modeling in Speech Emotion Recognition Using Bayesian Neural Networks and Label Distribution Learning
IEEE Transactions on Affective Computing ( IF 9.6 ) Pub Date : 2023-06-07 , DOI: 10.1109/taffc.2023.3283595
Navin Raj Prabhu ₁ , Nale Lehmann-Willenbrock ₂ , Timo Gerkmann ₁

Affiliation

To train machine learning algorithms to predict emotional expressions in terms of arousal and valence, annotated datasets are needed. However, as different people perceive others’ emotional expressions differently, their annotations are subjective. To account for this, annotations are typically collected from multiple annotators and averaged to obtain ground-truth labels. However, when exclusively trained on this averaged ground-truth, the model is agnostic to the inherent subjectivity in emotional expressions. In this work, we therefore propose an end-to-end Bayesian neural network capable of being trained on a distribution of annotations to also capture the subjectivity-based label uncertainty. Instead of a Gaussian, we model the annotation distribution using Student's

$t$

-distribution, which also accounts for the number of annotations available. We derive the corresponding Kullback-Leibler divergence loss and use it to train an estimator for the annotation distribution, from which the mean and uncertainty can be inferred. We validate the proposed method using two in-the-wild datasets. We show that the proposed

$t$

-distribution based approach achieves state-of-the-art uncertainty modeling results in speech emotion recognition, and also consistent results in cross-corpora evaluations. Furthermore, analyses reveal that the advantage of a

$t$

-distribution over a Gaussian grows with increasing inter-annotator correlation and a decreasing number of annotations available.

中文翻译：

使用贝叶斯神经网络和标签分布学习进行语音情感识别中的端到端标签不确定性建模

为了训练机器学习算法来预测情绪表达的唤醒度和效价，需要带注释的数据集。然而，由于不同的人对他人情绪表达的感知不同，他们的注释是主观的。为了解决这个问题，通常从多个注释者收集注释并进行平均以获得真实标签。然而，当专门针对这种平均基本事实进行训练时，该模型对情感表达中固有的主观性是不可知的。因此，在这项工作中，我们提出了一种端到端贝叶斯神经网络，能够对注释的分布进行训练，以捕获基于主观性的标签不确定性。我们使用学生的 $t$ 分布来建模注释分布，而不是高斯分布，这也考虑了可用注释的数量。我们推导出相应的 Kullback-Leibler 散度损失，并用它来训练注释分布的估计器，从中可以推断出平均值和不确定性。我们使用两个野外数据集验证了所提出的方法。我们表明，所提出的基于 $t$ 分布的方法在语音情感识别中实现了最先进的不确定性建模结果，并且在跨语料库评估中也取得了一致的结果。此外，分析表明，$t$ 分布相对于高斯分布的优势随着注释者间相关性的增加和可用注释数量的减少而增长。

更新日期：2023-06-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>