NovPhy: A physical reasoning benchmark for open-world AI systems,Artificial Intelligence

当前位置： X-MOL 学术 › Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

NovPhy: A physical reasoning benchmark for open-world AI systems
Artificial Intelligence ( IF 5.1 ) Pub Date : 2024-08-02 , DOI: 10.1016/j.artint.2024.104198
Vimukthini Pinto , Chathura Gamage , Cheng Xue , Peng Zhang , Ekaterina Nikonova , Matthew Stephenson , Jochen Renz

Due to the emergence of AI systems that interact with the physical environment, there is an increased interest in incorporating physical reasoning capabilities into those AI systems. But is it enough to only have physical reasoning capabilities to operate in a real physical environment? In the real world, we constantly face novel situations we have not encountered before. As humans, we are competent at successfully adapting to those situations. Similarly, an agent needs to have the ability to function under the impact of novelties in order to properly operate in an open-world physical environment. To facilitate the development of such AI systems, we propose a new benchmark, NovPhy, that requires an agent to reason about physical scenarios in the presence of novelties and take actions accordingly. The benchmark consists of tasks that require agents to detect and adapt to novelties in physical scenarios. To create tasks in the benchmark, we develop eight novelties representing a diverse novelty space and apply them to five commonly encountered scenarios in a physical environment, related to applying forces and motions such as rolling, falling, and sliding of objects. According to our benchmark design, we evaluate two capabilities of an agent: the performance on a novelty when it is applied to different physical scenarios and the performance on a physical scenario when different novelties are applied to it. We conduct a thorough evaluation with human players, learning agents, and heuristic agents. Our evaluation shows that humans' performance is far beyond the agents' performance. Some agents, even with good normal task performance, perform significantly worse when there is a novelty, and the agents that can adapt to novelties typically adapt slower than humans. We promote the development of intelligent agents capable of performing at the human level or above when operating in open-world physical environments. Benchmark website: https://github.com/phy-q/novphy.

中文翻译：

NovPhy：开放世界人工智能系统的物理推理基准

由于与物理环境交互的人工智能系统的出现，人们越来越有兴趣将物理推理能力融入到这些人工智能系统中。但仅仅具备物理推理能力就足以在真实的物理环境中运行吗？在现实世界中，我们不断面临从未遇到过的新情况。作为人类，我们有能力成功适应这些情况。同样，智能体需要有能力在新事物的影响下发挥作用，以便在开放世界的物理环境中正确运行。为了促进此类人工智能系统的开发，我们提出了一个新的基准 NovPhy，它要求代理在存在新奇事物的情况下推理物理场景并采取相应的行动。该基准包括要求代理检测并适应物理场景中的新奇事物的任务。为了在基准测试中创建任务，我们开发了代表不同新奇空间的八个新奇事物，并将它们应用于物理环境中的五个常见场景，涉及施加力和运动，例如物体的滚动、下落和滑动。根据我们的基准设计，我们评估代理的两种能力：新颖性应用于不同物理场景时的性能以及不同新颖性应用于物理场景时的性能。我们对人类玩家、学习代理和启发式代理进行了彻底的评估。我们的评估表明，人类的表现远远超出了智能体的表现。有些智能体，即使在正常任务中表现良好，在出现新奇事物时表现也会明显较差，并且能够适应新奇事物的智能体通常比人类适应得慢。我们促进开发能够在开放世界的物理环境中运行时达到人类或更高水平的智能代理。基准网站：https://github.com/phy-q/novphy。

更新日期：2024-08-02

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南