Automated parallel execution of distributed task graphs with FPGA clusters,Future Generation Computer Systems

当前位置： X-MOL 学术 › Future Gener. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Automated parallel execution of distributed task graphs with FPGA clusters
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2024-06-28 , DOI: 10.1016/j.future.2024.06.041
Juan Miguel de Haro Ruiz , Carlos Álvarez Martínez , Daniel Jiménez-González , Xavier Martorell , Tomohiro Ueno , Kentaro Sano , Burkhard Ringlein , François Abel , Beat Weiss

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level task-based programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

中文翻译：

使用 FPGA 集群自动并行执行分布式任务图

多年来，现场可编程门阵列 (FPGA) 在高性能计算 (HPC) 领域越来越受欢迎，因为它们的可重构性能够以较低的能源成本实现非常细粒度的优化。然而，集群的不同特性、架构和网络拓扑阻碍了FPGA的大规模使用。在这项工作中，我们提出了 OmpSs@FPGA 的演变，这是一种基于任务的高级编程模型和 OmpSs-2 的扩展，旨在通过使用与 FPGA 加速器兼容的消息传递接口来统一所有 FPGA 集群。这些加速器使用 C/C++ 编译指示进行编程，并使用高级综合工具进行综合。新框架包括用于在 FPGA 之间交换消息的自定义协议，与架构和网络类型无关。最重要的是，我们提出了一种称为隐式消息传递（IMP）的新通信范例，用户不需要调用任何消息传递 API。相反，运行时会自动推断节点之间的数据移动。我们在两个不同的 FPGA 集群上使用三个基准测试经典消息传递和 IMP。其中之一是cloudFPGA，这是一个采用AMD FPGA的分散平台，仅通过UDP/TCP/IP连接到网络。另一个是 ESSPER，由连接 CPU 的英特尔 FPGA 组成，具有以太网级别的专用网络。在这两种情况下，我们都证明，由于简化了节点之间的通信，使用 OmpSs@FPGA 的 IMP 可以大规模提高 FPGA 程序员的工作效率，而不会限制应用程序的可扩展性。我们实现了 N 体、热模拟和 Cholesky 分解基准，并表明 FPGA 集群获得了 2.6 倍和 2.6 倍。对于 N 体和 Heat，每瓦性能比仅使用 CPU 的超级计算机高 4 倍。

更新日期：2024-06-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文