当前位置: X-MOL 学术Int. J. Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adaptive synchronous strategy for distributed machine learning
International Journal of Intelligent Systems ( IF 5.0 ) Pub Date : 2022-09-20 , DOI: 10.1002/int.23060
Miaoquan Tan 1 , Wai‐Xi Liu 1 , Junming Luo 1 , Haosen Chen 1 , Zhen‐Zheng Guo 1
Affiliation  

In distributed machine learning training, bulk synchronous parallel (BSP) and asynchronous parallel (ASP) are two main synchronization methods to help achieve gradient aggregation. However, BSP needs longer training time due to “stragglers” problem, while ASP sacrifices the accuracy due to “gradient staleness” problem. In this article, we propose a distributed training paradigm on parameter server framework called adaptive synchronous strategy (A2S) which improves the BSP and ASP paradigms by adaptively adopting different parallel training schemes for workers with different training speeds. Based on the stale value between the fastest and slowest worker, A2S adaptively adds a relaxed synchronous barrier for fast workers to alleviate gradient staleness, where a differentiated weighting gradient aggregation method is used to reduce the impact of slow gradients. Simultaneously, A2S adopts ASP training for slow workers to eliminate stragglers. Hence, A2S not only improves the “gradient staleness” and “stragglers” problems, but also obtains convergence stability and synchronous gain through synchronous and asynchronous parallel, respectively. Specially, we theoretically proved the convergence of A2S by deriving the regret bound. Moreover, experiment results show that A2S improves accuracy by up to 2.64% and accelerates training by up to 41% more than the state-of-the-art synchronization methods BSP, ASP, stale synchronous parallel (SSP), dynamic SSP, and Sync-switch.

中文翻译:

分布式机器学习的自适应同步策略

在分布式机器学习训练中,批量同步并行(BSP)和异步并行(ASP)是帮助实现梯度聚合的两种主要同步方式。但是,BSP 由于“落后者”问题需要更长的训练时间,而 ASP 由于“梯度陈旧”问题而牺牲了准确性。在本文中,我们在参数服务器框架上提出了一种称为自适应同步策略 (A2S) 的分布式训练范式,它通过针对不同训练速度的工人自适应地采用不同的并行训练方案来改进 BSP 和 ASP 范式。基于最快和最慢 worker 之间的陈旧值,A2S 自适应地为快速 worker 添加一个宽松的同步屏障,以缓解梯度陈旧,其中采用差异化的加权梯度聚合方法来减少慢梯度的影响。同时,A2S对慢工采用ASP培训,淘汰散兵游勇。因此,A2S不仅改善了“梯度陈旧”和“落后者”问题,而且分别通过同步和异步并行获得了收敛稳定性和同步增益。特别地,我们通过推导从理论上证明了 A2S 的收敛性后悔结。此外,实验结果表明,与最先进的同步方法 BSP、ASP、陈旧同步并行 (SSP)、动态 SSP 和 Sync 相比,A2S 提高了高达 2.64% 的准确性,并使训练速度提高了高达 41% -转变。
更新日期:2022-09-20
down
wechat
bug