Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions,ACM Computing Surveys

当前位置： X-MOL 学术 › ACM Comput. Surv. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
ACM Computing Surveys ( IF 23.8 ) Pub Date : 2024-06-22 , DOI: 10.1145/3656580
Paul Pu Liang ₁ , Amir Zadeh ₁ , Louis-Philippe Morency ₁

Affiliation

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this article is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity, connections, and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.

中文翻译：

多模态机器学习的基础和趋势：原则、挑战和悬而未决的问题

多模态机器学习是一个充满活力的多学科研究领域，旨在通过整合多种交流方式（包括语言、声学、视觉、触觉和生理信息）来设计具有理解、推理和学习等智能能力的计算机代理。近年来，随着医疗保健和机器人等应用领域对视频理解、具体自主代理、文本到图像生成和多传感器融合的兴趣，多模态机器学习给机器学习社区带来了独特的计算和理论挑战，因为数据源和模式之间经常发现的互连。然而，多模态研究的广泛进展使得很难确定该领域的共同主题和悬而未决的问题。通过从历史和最近的角度综合广泛的应用领域和理论框架，本文旨在概述多模态机器学习的计算和理论基础。我们首先定义了驱动后续创新的模态异质性、连接和交互的三个关键原则，并提出了六个核心技术挑战的分类：涵盖历史和近期趋势的表示、对齐、推理、生成、转移和量化。最近的技术成就将通过这种分类学的视角呈现，使研究人员能够了解新方法的异同。最后，我们提出了一些由我们的分类法确定的未来研究的开放问题。

更新日期：2024-06-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>