RAMP: A flat nanosecond optical network and MPI operations for distributed deep learning systems,Optical Switching and Networking

当前位置： X-MOL 学术 › Opt. Switch. Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

RAMP: A flat nanosecond optical network and MPI operations for distributed deep learning systems
Optical Switching and Networking ( IF 1.9 ) Pub Date : 2023-08-17 , DOI: 10.1016/j.osn.2023.100761
Alessandro Ottino , Joshua Benjamin , Georgios Zervas

Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8 Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171 $\times$ speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16 $\times$ and 7.8-58 $\times$ reduction in Megatron and DLRM training time respectively while offering 38-47 $\times$ and 6.4-26.5 $\times$ improvement in energy consumption and cost respectively.

中文翻译：

RAMP：用于分布式深度学习系统的平坦纳秒光网络和 MPI 操作

分布式深度学习（DDL）系统强烈依赖于网络性能。当前的电子分组交换（EPS）网络架构和技术受到可变直径拓扑、低平分带宽和超额订阅的影响，影响了通信和集体操作的完成时间。我们引入了一种近百亿亿级、全对分带宽、全对全、单跳、全光网络架构，具有纳秒级重新配置功能，称为 RAMP，它支持大规模分布式并行计算系统（每个节点 12.8 Tbps至 65,536 个节点）。首次提出定制 RAMP-x MPI 策略和网络转码器，以无调度和无争用的方式在光电路交换 (OCS) 网络上运行 MPI 集体操作。RAMP 达到 7.6-171 $\times$ 与实际的 EPS 和 OCS 操作相比，所有 MPI 操作的完成时间都加快了。它还可以提供 1.3-16 $\times$ 和7.8-58 $\times$ 分别减少威震天和 DLRM 训练时间，同时提供 38-47 $\times$ 和6.4-26.5 $\times$ 分别改善能源消耗和成本。

更新日期：2023-08-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南