当前位置: X-MOL 学术IEEE Trans. Autom. Sci. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Recognizing Video Activities in the Wild via View-to-Scene Joint Learning
IEEE Transactions on Automation Science and Engineering ( IF 5.9 ) Pub Date : 7-23-2024 , DOI: 10.1109/tase.2024.3431128
Jiahui Yu 1 , Yifan Chen 2 , Xuna Wang 3 , Xu Cheng 4 , Zhaojie Ju 2 , Yingke Xu 1
Affiliation  

Recognizing video actions in the wild is challenging for visual control systems. In-the-wild videos show actions not seen in training data, recorded from various angles and scenes with the same labels. Most existing methods address this challenge by developing complex frameworks to extract spatiotemporal features. To achieve view robustness and scene generalization cost-effectively, we explore view consistency and scene joint understanding. Based on this, we propose a neural network (called Wild-VAR) to learn view and scene information jointly without any 3D pose ground truth labels, a new approach to recognizing video actions in the wild. Unlike most existing methods, first, we propose a Cubing module to self-learn body consistency between views instead of comprehensive image features, boosting the generalization performance of across-view settings. Specifically, we map 3D representations to multiple 2D features and then adopt a self-adaptive scheme to constrain 2D features from different perspectives. Moreover, we propose temporal neural networks (called T-Scene) to develop a recognizing framework, enabling Wild-VAR to flexibly learn scenes across time, including key interactors and context, in video sequences. Extensive experiments show that Wild-VAR consistently outperforms state-of-the-art methods on four benchmarks. Notably, with only half the computation costs, Wild-VAR improves accuracy by 2.2% and 1.3% on the Kinetics-400 and the Something-Somthing V2 datasets, respectively. Note to Practitioners—In human-robot interaction tasks, video action recognition technology is a prerequisite for visual control. In real applications, humans move freely in 3D space, which results in significant changes in the view of video capture and constantly changing scenes. Deep Neural Networks are limited by the perspectives and scenarios contained in the training data, resulting in most existing methods are only effective for identifying actions from 2-4 fixed views, and the background is single. Therefore, existing models are often difficult to generalize to unconstrained application environments. Human view and video scene understanding are often treated separately. Inspired by the human visual system, this paper proposes a view-to-scene video processing method in a cost-efficient way. In real-world applications, this lightweight method can be integrated into robots to help identify human behavior in complex environments. Fewer parameters indicate that the method can be easily migrated to different types of behaviors, and the reduced computational costs represent the ability to achieve real-time performance under limited hardware conditions.

中文翻译:


通过视图到场景联合学习识别野外视频活动



对于视觉控制系统来说,识别野外视频动作是一项挑战。野外视频显示了训练数据中未见的动作,这些动作是从不同角度和场景记录的,具有相同的标签。大多数现有方法通过开发复杂的框架来提取时空特征来解决这一挑战。为了经济高效地实现视图鲁棒性和场景泛化,我们探索视图一致性和场景联合理解。基于此,我们提出了一种神经网络(称为 Wild-VAR)来联合学习视图和场景信息,而无需任何 3D 姿势地面实况标签,这是一种识别野外视频动作的新方法。与大多数现有方法不同,首先,我们提出了一个 Cubing 模块来自学习视图之间的身体一致性,而不是综合图像特征,从而提高了跨视图设置的泛化性能。具体来说,我们将 3D 表示映射到多个 2D 特征,然后采用自适应方案从不同角度约束 2D 特征。此外,我们提出时间神经网络(称为 T-Scene)来开发一个识别框架,使 Wild-VAR 能够灵活地学习跨时间的场景,包括视频序列中的关键交互者和上下文。大量实验表明,Wild-VAR 在四个基准测试中始终优于最先进的方法。值得注意的是,Wild-VAR 的计算成本仅为一半,在 Kinetics-400 和 Something-Somthing V2 数据集上的准确度分别提高了 2.2% 和 1.3%。从业人员注意——在人机交互任务中,视频动作识别技术是视觉控制的前提。在实际应用中,人类在3D空间中自由移动,这导致视频捕捉的视图发生显着变化,场景不断变化。 深度神经网络受到训练数据所包含的视角和场景的限制,导致大多数现有方法只能有效识别2-4个固定视角的动作,且背景单一。因此,现有模型通常难以推广到不受约束的应用环境。人类视图和视频场景理解通常被分开处理。受人类视觉系统的启发,本文提出了一种经济高效的视图到场景视频处理方法。在现实应用中,这种轻量级方法可以集成到机器人中,以帮助识别复杂环境中的人类行为。更少的参数表明该方法可以轻松迁移到不同类型的行为,而计算成本的降低代表了在有限的硬件条件下实现实时性能的能力。
更新日期:2024-08-22
down
wechat
bug