当前位置: X-MOL 学术Complex Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Moor: Model-based offline policy optimization with a risk dynamics model
Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2024-11-11 , DOI: 10.1007/s40747-024-01621-x
Xiaolong Su, Peng Li, Shaofei Chen

Offline reinforcement learning (RL) has been widely used in safety-critical domains by avoiding dangerous and costly online interaction. A significant challenge is addressing uncertainties and risks outside of offline data. Risk-sensitive offline RL attempts to solve this issue by risk aversion. However, current model-based approaches only extract state transition information and reward information using dynamics models, which cannot capture risk information implicit in offline data and may result in the misuse of high-risk data. In this work, we propose a model-based offline policy optimization approach with a risk dynamics model (MOOR). Specifically, we construct a risk dynamics model using a quantile network that can learn the risk information of data, then we reshape model-generated data based on errors of the risk dynamics model and the risk information of data. Finally, we use a risk-averse algorithm to learn the policy on the combined dataset of offline and generated data. We theoretically prove that MOOR can identify risk information of data and avoid utilizing high-risk data, our experiments show that MOOR outperforms existing approaches and achieves state-of-the-art results in risk-sensitive D4RL and risky navigation tasks.



中文翻译:


Moor:使用风险动态模型进行基于模型的离线策略优化



离线强化学习 (RL) 通过避免危险且昂贵的在线交互,已广泛应用于安全关键领域。一个重大挑战是解决离线数据之外的不确定性和风险。风险敏感型离线 RL 试图通过风险规避来解决这个问题。然而,当前基于模型的方法仅使用动态模型提取状态转换信息和奖励信息,无法捕获离线数据中隐含的风险信息,并可能导致高风险数据的滥用。在这项工作中,我们提出了一种基于模型的离线策略优化方法,其中包含风险动力学模型 (MOOR)。具体来说,我们使用分位数网络构建一个可以学习数据风险信息的风险动力学模型,然后根据风险动力学模型的误差和数据的风险信息重塑模型生成的数据。最后,我们使用风险规避算法来学习离线数据和生成数据的组合数据集上的策略。我们从理论上证明 MOOR 可以识别数据的风险信息并避免利用高风险数据,我们的实验表明 MOOR 优于现有方法,并在风险敏感的 D4RL 和风险导航任务中取得了最先进的结果。

更新日期:2024-11-11
down
wechat
bug