当前位置: X-MOL 学术ICGA J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Descent wins five gold medals at the Computer Olympiad
ICGA Journal ( IF 0.2 ) Pub Date : 2021-10-15 , DOI: 10.3233/icg-210192
Quentin Cohen-Solal 1 , Tristan Cazenave 1
Affiliation  

Unlike AlphaZero-like algorithms (Silver et al., 2018), the Descent framework uses a variant of Unbounded Minimax (Korf and Chickering, 1996), instead of Monte Carlo Tree Search, to construct the partial game tree used to determine the best action to play and to collect data for learning. During training, at each move, the best sequences of moves are iteratively extended until terminal states. During evaluations, the safest action is chosen (after that the best sequences of moves are iteratively extended each until a leaf state is reached). Moreover, it also does not use a policy network, only a value network. The actions therefore do not need to be encoded. Unlike the AlphaZero paradigm, with Descent all data generated during the searches to determine the best actions to play is used for learning. As a result, much more data is generated per game, and thus the training is done more quickly and does not require a (massive) parallelization to give good results (contrary to AlphaZero). It can use end-of-game heuristic evaluation to improve its level of play faster, such as game score or game length (in order to win quickly and lose slowly).

中文翻译:

Descent在计算机奥林匹克竞赛中获得五枚金牌

与类似 AlphaZero 的算法 (Silver et al., 2018) 不同,Descent 框架使用无界 Minimax 的变体 (Korf and Chickering, 1996),而不是 Monte Carlo Tree Search,来构建用于确定最佳动作的部分博弈树玩和收集数据用于学习。在训练期间,在每次移动时,最佳移动序列被迭代地扩展直到终止状态。在评估期间,选择最安全的动作(之后迭代地扩展最佳的移动序列,直到达到叶状态)。而且,它也不使用策略网络,仅使用价值网络。因此不需要对动作进行编码。与 AlphaZero 范式不同,使用 Descent,在搜索过程中生成的所有数据以确定要玩的最佳动作都用于学习。结果,每场比赛产生更多的数据,因此训练完成得更快,并且不需要(大规模)并行化来提供良好的结果(与 AlphaZero 相反)。它可以使用游戏结束的启发式评估来更快地提高其游戏水平,例如游戏得分或游戏时长(以便快速获胜和缓慢失败)。
更新日期:2021-10-15
down
wechat
bug