From sight to insight: A multi-task approach with the visual language decoding model,Information Fusion

当前位置： X-MOL 学术 › Inform. Fusion › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

From sight to insight: A multi-task approach with the visual language decoding model
Information Fusion ( IF 14.7 ) Pub Date : 2024-07-05 , DOI: 10.1016/j.inffus.2024.102573
Wei Huang , Pengfei Yang , Ying Tang , Fan Qin , Hengjiang Li , Diwei Wu , Wei Ren , Sizhuo Wang , Jingpeng Li , Yucheng Zhu , Bo Zhou , Jingyuan Sun , Qiang Li , Kaiwen Cheng , Hongmei Yan , Huafu Chen

Visual neural decoding aims to unlock the mysteries of how the human brain interprets the visual world through predicting perceived visual information from visual neural activity. While early studies made some progress in decoding visual activity for singular type of information, they failed to concurrently reveal the multi-level interweaving linguistic information in the brain. Here, we developed a novel Visual Language Decoding Model (VLDM) that can simultaneously decode the main categories, semantic labels, and textual descriptions of visual stimuli from visual activities. The large-scale NSD dataset was utilized to ensure the efficiency of the decoding model in joint training and evaluation across multiple tasks. For category decoding, we achieved the effective classification of 12 categories with an accuracy of nearly 70 %, significantly surpassing the chance level. For label decoding, we attained the precise prediction of 80 specific semantic labels with a 16-fold improvement over the chance level. For text decoding, the scores of the decoded text surpassed the corresponding baseline levels by remarkable margins on six evaluation metrics. These results highlight the complexity of how the brain processes visual information and the close connection between visual perception and language cognition. This study contributes significantly to extensive applications in multi-layered brain-computer interfaces, potentially leading to more natural and efficient human-computer interaction experiences.

中文翻译：

从视觉到洞察：视觉语言解码模型的多任务方法

视觉神经解码旨在通过预测视觉神经活动中感知的视觉信息来解开人脑如何解释视觉世界的奥秘。虽然早期研究在解码单一类型信息的视觉活动方面取得了一些进展，但它们未能同时揭示大脑中多层次交织的语言信息。在这里，我们开发了一种新颖的视觉语言解码模型（VLDM），它可以同时解码视觉活动中视觉刺激的主要类别、语义标签和文本描述。利用大规模NSD数据集来保证解码模型在跨多个任务的联合训练和评估中的效率。对于类别解码，我们实现了 12 个类别的有效分类，准确率接近 70%，显着超过了机会水平。对于标签解码，我们实现了 80 个特定语义标签的精确预测，比机会水平提高了 16 倍。对于文本解码，解码文本的分数在六个评估指标上显着超出了相应的基线水平。这些结果凸显了大脑处理视觉信息的复杂性以及视觉感知和语言认知之间的密切联系。这项研究对多层脑机接口的广泛应用做出了重大贡献，有可能带来更自然、更高效的人机交互体验。

更新日期：2024-07-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南