当前位置: X-MOL 学术Inform. Fusion › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic movie genre classification & emotion recognition via a BiProjection Multimodal Transformer
Information Fusion ( IF 14.7 ) Pub Date : 2024-08-16 , DOI: 10.1016/j.inffus.2024.102641
Diego Aarón Moreno-Galván , Roberto López-Santillán , Luis Carlos González-Gurrola , Manuel Montes-Y-Gómez , Fernando Sanchez-Vega , Adrián Pastor López-Monroy

Analyzing, manipulating, and comprehending data from multiple sources (e.g., websites, software applications, files, or databases) and of diverse modalities (e.g., video, images, audio and text) has become increasingly important in many domains. Despite recent advances in multimodal classification (MC), there are still several challenges to be addressed, such as: the combination of modalities of very diverse nature, the optimal feature engineering for each modality, as well as the semantic alignment between text and images. Accordingly, the main motivation of our research relies in devising a neural architecture that effectively processes and combines text, image, video and audio modalities, so it can offer a noteworthy performance in different MC tasks. In this regard, the Multimodal Transformer (MulT) model is a cutting-edge approach often employed in multimodal supervised tasks, which, although effective, has the problem of having a fixed architecture that limits its performance in specific tasks as well as its contextual understanding, meaning it may struggle to capture fine-grained temporal patterns in audio or effectively model spatial relationships in images. To address these issues, our research modifies and extends the MulT model in several aspects. Firstly, we focus on leveraging the Gated Multimodal Unit (GMU) module within the architecture to efficiently and dynamically weigh modalities at the instance level and to visualize the use of modalities. Secondly, to overcome the problem of vanishing and exploding gradients we focus on strategically placing residual connections in the architecture. The proposed architecture is evaluated in two different and complex classification tasks, on the one hand, the movie genre categorization (MGC) and, on the other hand, the multimodal emotion recognition (MER). The results obtained are encouraging as they indicate that the proposed architecture is competitive against state-of-the-art (SOTA) models in MGC, outperforming them by up to 2% on the Moviescope dataset, and by 1% on the MM-IMDB datasets. Furthermore, in the MER task the unaligned version of the datasets was employed, which is considerably more difficult; we improve accuracy SOTA results by up to 1% on the IEMOCAP dataset, and attained a competitive outcome on the CMU-MOSEI11Dai et al., (2021) collection, outperforming SOTA results in several emotions.

中文翻译:


通过 BiProjection Multimodal Transformer 自动进行电影类型分类和情感识别



分析、操作和理解来自多个源(例如,网站、软件应用程序、文件或数据库)和不同模式(例如,视频、图像、音频和文本)的数据在许多领域变得越来越重要。尽管多模态分类(MC)最近取得了进展,但仍然存在一些需要解决的挑战,例如:性质非常多样化的模态的组合、每种模态的最佳特征工程以及文本和图像之间的语义对齐。因此,我们研究的主要动机在于设计一种能够有效处理和组合文本、图像、视频和音频模式的神经架构,因此它可以在不同的 MC 任务中提供值得注意的性能。在这方面,多模态变压器(MulT)模型是多模态监督任务中经常采用的前沿方法,虽然有效,但存在固定架构的问题,限制了其在特定任务中的性能以及上下文理解,这意味着它可能很难捕获音频中的细粒度时间模式或有效地建模图像中的空间关系。为了解决这些问题,我们的研究在几个方面对Mult模型进行了修改和扩展。首先,我们专注于利用架构内的门控多模态单元(GMU)模块,在实例级别高效、动态地权衡模态,并可视化模态的使用。其次,为了克服梯度消失和爆炸的问题,我们重点关注在架构中战略性地放置剩余连接。 所提出的架构在两个不同且复杂的分类任务中进行评估,一方面是电影类型分类(MGC),另一方面是多模态情感识别(MER)。获得的结果令人鼓舞,因为它们表明所提出的架构与 MGC 中最先进的 (SOTA) 模型相比具有竞争力,在 Moviescope 数据集上优于它们高达 2%,在 MM-IMDB 上优于它们 1%数据集。此外,在 MER 任务中,使用了未对齐版本的数据集,这要困难得多;我们在 IEMOCAP 数据集上将 SOTA 结果的准确性提高了 1%,并在 CMU-MOSEI11Dai 等人 (2021) 数据集上取得了有竞争力的结果,在多种情绪上优于 SOTA 结果。
更新日期:2024-08-16
down
wechat
bug