Optical Switching and Networking ( IF 1.9 ) Pub Date : 2023-08-17 , DOI: 10.1016/j.osn.2023.100761 Alessandro Ottino , Joshua Benjamin , Georgios Zervas
Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8 Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171 speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16 and 7.8-58 reduction in Megatron and DLRM training time respectively while offering 38-47 and 6.4-26.5 improvement in energy consumption and cost respectively.
中文翻译:
RAMP:用于分布式深度学习系统的平坦纳秒光网络和 MPI 操作
分布式深度学习(DDL)系统强烈依赖于网络性能。当前的电子分组交换(EPS)网络架构和技术受到可变直径拓扑、低平分带宽和超额订阅的影响,影响了通信和集体操作的完成时间。我们引入了一种近百亿亿级、全对分带宽、全对全、单跳、全光网络架构,具有纳秒级重新配置功能,称为 RAMP,它支持大规模分布式并行计算系统(每个节点 12.8 Tbps至 65,536 个节点)。首次提出定制 RAMP-x MPI 策略和网络转码器,以无调度和无争用的方式在光电路交换 (OCS) 网络上运行 MPI 集体操作。RAMP 达到 7.6-171与实际的 EPS 和 OCS 操作相比,所有 MPI 操作的完成时间都加快了。它还可以提供 1.3-16和7.8-58分别减少威震天和 DLRM 训练时间,同时提供 38-47和6.4-26.5分别改善能源消耗和成本。