当前位置:
X-MOL 学术
›
Future Gener. Comput. Syst.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
AISAW: An adaptive interference-aware scheduling algorithm for acceleration of deep learning workloads training on distributed heterogeneous systems
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2024-12-06 , DOI: 10.1016/j.future.2024.107642 Yushen Bi, Yupeng Xi, Chao Jing
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2024-12-06 , DOI: 10.1016/j.future.2024.107642 Yushen Bi, Yupeng Xi, Chao Jing
Owing to the widespread application of artificial intelligence, deep learning (DL) has attracted considerable attention from both academia and industry. The DL workload-training process is a key step in determining the quality of DL-based applications. However, owing to the limited computational power of conventionally centralized clusters, it is more beneficial to accelerate workload training while placing them in distributed heterogeneous systems. Unfortunately, current scheduling algorithms do not account for the various capabilities of nodes and the limited network bandwidth, which leads to poor performance in distributed heterogeneous systems. To address this problem, we propose an adaptive interference-aware scheduling algorithm for accelerating DL workloads (called AISAW). By doing so, we initially established a predictive model consisting of a job performance model and an interference-aware model to reduce the impact of job co-location. Subsequently, to improve the system efficiency, we developed an adaptive priority-aware allocation scheme (APS) to find the optimal performance match in terms of adaptively allocating DL jobs to computing nodes. In addition, under the constraint of network bandwidth, we devised a deadline-aware overhead minimization dynamic migration scheme (DOMS) to avoid the high overhead caused by frequent job migration. Finally, we conducted experiments on real distributed heterogeneous systems deployed with several GPU-based servers. The results demonstrate that AISAW is capable of improving the system efficiency by decreasing the makespan and average JCT by at least 23.86% and 13.02%, respectively, compared to state-of-the-art algorithms such as Gandiva, Tiresias, and MLF-H.
中文翻译:
AISAW:一种自适应干扰感知调度算法,用于加速分布式异构系统上的深度学习工作负载训练
由于人工智能的广泛应用,深度学习 (DL) 引起了学术界和工业界的广泛关注。DL 工作负载训练过程是确定基于 DL 的应用程序质量的关键步骤。然而,由于传统集中式集群的计算能力有限,因此在将工作负载放置在分布式异构系统中时加速工作负载训练更有利。遗憾的是,当前的调度算法没有考虑到节点的各种功能和有限的网络带宽,这导致分布式异构系统的性能不佳。为了解决这个问题,我们提出了一种用于加速 DL 工作负载的自适应干扰感知调度算法(称为 AISAW)。通过这样做,我们最初建立了一个由工作绩效模型和干扰感知模型组成的预测模型,以减少工作共址的影响。随后,为了提高系统效率,我们开发了一种自适应优先级感知分配方案 (APS),以在自适应地将 DL 作业分配给计算节点方面找到最佳性能匹配。此外,在网络带宽的约束下,我们设计了一种截止时间感知开销最小化动态迁移方案 (DOMS),以避免频繁的作业迁移带来的高开销。最后,我们在部署了多个基于 GPU 的服务器的真实分布式异构系统上进行了实验。结果表明,与 Gandiva、Tiresias 和 MLF-H 等最先进的算法相比,AISAW 能够通过将 makespan 和平均 JCT 分别降低至少 23.86% 和 13.02% 来提高系统效率。
更新日期:2024-12-06
中文翻译:
AISAW:一种自适应干扰感知调度算法,用于加速分布式异构系统上的深度学习工作负载训练
由于人工智能的广泛应用,深度学习 (DL) 引起了学术界和工业界的广泛关注。DL 工作负载训练过程是确定基于 DL 的应用程序质量的关键步骤。然而,由于传统集中式集群的计算能力有限,因此在将工作负载放置在分布式异构系统中时加速工作负载训练更有利。遗憾的是,当前的调度算法没有考虑到节点的各种功能和有限的网络带宽,这导致分布式异构系统的性能不佳。为了解决这个问题,我们提出了一种用于加速 DL 工作负载的自适应干扰感知调度算法(称为 AISAW)。通过这样做,我们最初建立了一个由工作绩效模型和干扰感知模型组成的预测模型,以减少工作共址的影响。随后,为了提高系统效率,我们开发了一种自适应优先级感知分配方案 (APS),以在自适应地将 DL 作业分配给计算节点方面找到最佳性能匹配。此外,在网络带宽的约束下,我们设计了一种截止时间感知开销最小化动态迁移方案 (DOMS),以避免频繁的作业迁移带来的高开销。最后,我们在部署了多个基于 GPU 的服务器的真实分布式异构系统上进行了实验。结果表明,与 Gandiva、Tiresias 和 MLF-H 等最先进的算法相比,AISAW 能够通过将 makespan 和平均 JCT 分别降低至少 23.86% 和 13.02% 来提高系统效率。