当前位置:
X-MOL 学术
›
Artif. Intell.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Joint learning of reward machines and policies in environments with partially known semantics
Artificial Intelligence ( IF 5.1 ) Pub Date : 2024-05-23 , DOI: 10.1016/j.artint.2024.104146 Christos K. Verginis , Cevahir Koprulu , Sandeep Chinchali , Ufuk Topcu
Artificial Intelligence ( IF 5.1 ) Pub Date : 2024-05-23 , DOI: 10.1016/j.artint.2024.104146 Christos K. Verginis , Cevahir Koprulu , Sandeep Chinchali , Ufuk Topcu
We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcement-learning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions' truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions' truth value. Finally, the algorithm uses a Q-learning procedure for the states of the hypothesis reward machine to determine an optimal policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.
中文翻译:
在语义部分已知的环境中联合学习奖励机器和策略
我们研究奖励机器编码的任务的强化学习问题。该任务是通过环境中的一组属性(称为原子命题)定义的,并由布尔变量表示。文献中常用的一种不切实际的假设是这些命题的真值是准确已知的。然而,在实际情况中,这些真值是不确定的,因为它们来自存在缺陷的传感器。与此同时,奖励机器可能很难明确地建模,特别是当它们编码复杂的任务时。我们开发了一种强化学习算法,该算法可以推断出一个奖励机器,该机器对底层任务进行编码,同时学习如何执行它,尽管命题的真值存在不确定性。为了解决这种不确定性,算法保持对原子命题真值的概率估计;它根据对环境的探索获得的新的感官测量结果来更新这一估计。此外,该算法还维护一个假设奖励机,它充当对要学习的任务进行编码的奖励机的估计。当智能体探索环境时,算法根据获得的奖励和原子命题真值的估计来更新假设奖励机。最后,该算法对假设奖励机的状态使用 Q 学习过程来确定完成任务的最佳策略。我们证明该算法成功地推断了奖励机并渐进地学习了完成相应任务的策略。
更新日期:2024-05-23
中文翻译:
在语义部分已知的环境中联合学习奖励机器和策略
我们研究奖励机器编码的任务的强化学习问题。该任务是通过环境中的一组属性(称为原子命题)定义的,并由布尔变量表示。文献中常用的一种不切实际的假设是这些命题的真值是准确已知的。然而,在实际情况中,这些真值是不确定的,因为它们来自存在缺陷的传感器。与此同时,奖励机器可能很难明确地建模,特别是当它们编码复杂的任务时。我们开发了一种强化学习算法,该算法可以推断出一个奖励机器,该机器对底层任务进行编码,同时学习如何执行它,尽管命题的真值存在不确定性。为了解决这种不确定性,算法保持对原子命题真值的概率估计;它根据对环境的探索获得的新的感官测量结果来更新这一估计。此外,该算法还维护一个假设奖励机,它充当对要学习的任务进行编码的奖励机的估计。当智能体探索环境时,算法根据获得的奖励和原子命题真值的估计来更新假设奖励机。最后,该算法对假设奖励机的状态使用 Q 学习过程来确定完成任务的最佳策略。我们证明该算法成功地推断了奖励机并渐进地学习了完成相应任务的策略。