A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition,IEEE Transactions on Affective Computing

当前位置： X-MOL 学术 › IEEE Trans. Affect. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition
IEEE Transactions on Affective Computing ( IF 9.6 ) Pub Date : 2024-01-31 , DOI: 10.1109/taffc.2024.3354382
Xiaoheng Zhang ₁ , Weigang Cui ₂ , Bin Hu ₃ , Yang Li ₁

Affiliation

Emotion recognition in conversation (ERC) based on multiple modalities has attracted enormous attention. However, most research simply concatenated multimodal representations, generally neglecting the impact of cross-modal correspondences and uncertain factors, and leading to the cross-modal misalignment problems. Furthermore, recent methods only considered simple contextual features, commonly ignoring semantic clues and resulting in an insufficient capture of the semantic consistency. To address these limitations, we propose a novel multi-level alignment and cross-modal unified semantic graph refinement network (MA-CMU-SGRNet) for ERC task. Specifically, a multi-level alignment (MA) is first designed to bridge the gap between acoustic and lexical modalities, which can effectively contrast both the instance-level and prototype-level relationships, separating the multimodal features in the latent space. Second, a cross-modal uncertainty-aware unification (CMU) is adopted to generate a unified representation in joint space considering the ambiguity of emotion. Finally, a dual-encoding semantic graph refinement network (SGRNet) is investigated, which includes a syntactic encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. Extensive experiments on three multimodal public datasets show the effectiveness of our proposed method compared with the state-of-the-art methods, indicating its potential application in conversational emotion recognition.

中文翻译：

用于会话情感识别的多级对齐和跨模态统一语义图细化网络

基于多种模态的对话情绪识别（ERC）引起了极大的关注。然而，大多数研究只是简单地串联多模态表示，普遍忽略了跨模态对应和不确定因素的影响，导致了跨模态错位问题。此外，最近的方法仅考虑简单的上下文特征，通常忽略语义线索并导致语义一致性的捕获不足。为了解决这些限制，我们提出了一种用于 ERC 任务的新型多级对齐和跨模态统一语义图细化网络（MA-CMU-SGRNet）。具体来说，首先设计了多级对齐（MA）来弥合声学模态和词汇模态之间的差距，它可以有效地对比实例级和原型级关系，分离潜在空间中的多模态特征。其次，考虑到情感的模糊性，采用跨模态不确定性感知统一（CMU）来生成联合空间中的统一表示。最后，研究了双编码语义图细化网络（SGRNet），其中包括一个用于聚合来自近邻的信息的语法编码器和一个专注于有用的语义近邻的语义编码器。对三个多模态公共数据集的广泛实验表明，与最先进的方法相比，我们提出的方法的有效性，表明其在会话情感识别中的潜在应用。

更新日期：2024-01-31

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南