Infinite Time Horizon Maximum Causal Entropy Inverse Reinforcement Learning,IEEE Transactions on Automatic Control

当前位置： X-MOL 学术 › IEEE Trans. Autom. Control › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Infinite Time Horizon Maximum Causal Entropy Inverse Reinforcement Learning
IEEE Transactions on Automatic Control ( IF 6.2 ) Pub Date : 2018-09-01 , DOI: 10.1109/tac.2017.2775960
Zhengyuan Zhou , Michael Bloem , Nicholas Bambos

Inverse reinforcement learning (IRL) attempts to use demonstrations of “expert” decision making in a Markov decision process to infer a corresponding policy that shares the “structured, purposeful” qualities of the expert's actions. In this paper, we extend the maximum causal entropy framework, a notable paradigm in IRL, to the infinite time horizon setting. We consider two formulations (maximum discounted causal entropy and maximum average causal entropy) appropriate for the infinite horizon case and show that both result in optimization programs that can be reformulated as convex optimization problems; thus, admitting efficient computation. We then develop a gradient-based algorithm for the maximum discounted causal entropy formulation that enjoys the desired feature of being model agnostic, a property that is absent in many previous IRL algorithms. We propose the stationary soft Bellman policy, a key building block in the gradient-based algorithm, and study its properties in depth, which not only lead to theoretical insight into its analytical properties, but also help motivate a large toolkit of methods for implementing the gradient-based algorithm. Finally, we select three algorithms of this type and apply them to two problem instances involving demonstration data from a simple controlled queuing network model inspired by problems in air traffic management.

中文翻译：

无限时间范围最大因果熵逆强化学习

逆向强化学习 (IRL) 尝试在马尔可夫决策过程中使用“专家”决策的示范来推断相应的策略，该策略共享专家行为的“结构化、有目的”的品质。在本文中，我们将最大因果熵框架（IRL 中的一个显着范式）扩展到无限时间范围设置。我们考虑了适用于无限范围情况的两种公式（最大折扣因果熵和最大平均因果熵），并表明这两种公式都可以生成可以重新表述为凸优化问题的优化程序；因此，允许有效的计算。然后，我们为最大折扣因果熵公式开发了一种基于梯度的算法，该算法具有模型不可知的所需特征，许多以前的 IRL 算法中不存在的属性。我们提出了基于梯度的算法中的关键构建块——平稳软 Bellman 策略，并深入研究了它的特性，这不仅导致对其分析特性的理论洞察，而且有助于激发大量的方法来实现基于梯度的算法。最后，我们选择了三种这种类型的算法，并将它们应用于两个问题实例，这些实例涉及来自一个简单的受控排队网络模型的演示数据，该模型受到空中交通管理问题的启发。但也有助于激发用于实现基于梯度的算法的大型方法工具包。最后，我们选择了三种这种类型的算法，并将它们应用于两个问题实例，这些实例涉及来自一个简单的受控排队网络模型的演示数据，该模型受到空中交通管理问题的启发。但也有助于激发用于实现基于梯度的算法的大型方法工具包。最后，我们选择了三种这种类型的算法，并将它们应用于两个问题实例，这些实例涉及来自一个简单的受控排队网络模型的演示数据，该模型受到空中交通管理问题的启发。

更新日期：2018-09-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>