Addressing maximization bias in reinforcement learning with two-sample testing,Artificial Intelligence

当前位置： X-MOL 学术 › Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Addressing maximization bias in reinforcement learning with two-sample testing
Artificial Intelligence ( IF 5.1 ) Pub Date : 2024-08-16 , DOI: 10.1016/j.artint.2024.104204
Martin Waltz , Ostap Okhrin

Value-based reinforcement-learning algorithms have shown strong results in games, robotics, and other real-world applications. Overestimation bias is a known threat to those algorithms and can sometimes lead to dramatic performance decreases or even complete algorithmic failure. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the T-Estimator (TE) based on two-sample testing for the mean, that flexibly interpolates between over- and underestimation by adjusting the significance level of the underlying hypothesis tests. We also introduce a generalization, termed K-Estimator (KE), that obeys the same bias and variance bounds as the TE and relies on a nearly arbitrary kernel function. We introduce modifications of Q-Learning and the Bootstrapped Deep Q-Network (BDQN) using the TE and the KE, and prove convergence in the tabular setting. Furthermore, we propose an adaptive variant of the TE-based BDQN that dynamically adjusts the significance level to minimize the absolute estimation bias. All proposed estimators and algorithms are thoroughly tested and validated on diverse tasks and environments, illustrating the bias control and performance potential of the TE and KE.

中文翻译：

通过两个样本测试解决强化学习中的最大化偏差

基于价值的强化学习算法在游戏、机器人和其他现实世界的应用中显示出了强大的结果。高估偏差是这些算法的已知威胁，有时会导致性能急剧下降，甚至算法完全失败。我们以统计方式构建偏差问题，并将其视为估计一组随机变量的最大期望值（MEV）的实例。我们提出基于均值的两个样本检验的 T 估计器 (TE)，通过调整基础假设检验的显着性水平，在高估和低估之间灵活插值。我们还引入了一种泛化，称为 K 估计器 (KE)，它遵循与 TE 相同的偏差和方差界限，并且依赖于几乎任意的核函数。我们介绍了使用 TE 和 KE 对 Q-Learning 和 Bootstrapped Deep Q-Network (BDQN) 进行的修改，并证明了表格设置中的收敛性。此外，我们提出了基于 TE 的 BDQN 的自适应变体，它动态调整显着性水平以最小化绝对估计偏差。所有提出的估计器和算法都在不同的任务和环境中进行了彻底的测试和验证，说明了 TE 和 KE 的偏差控制和性能潜力。

更新日期：2024-08-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南