当前位置:
X-MOL 学术
›
Acta Numer.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Linear algebra software for large-scale accelerated multicore computing
Acta Numerica ( IF 16.3 ) Pub Date : 2016-05-27 , DOI: 10.1017/s0962492916000015 A. Abdelfattah , H. Anzt , J. Dongarra , M. Gates , A. Haidar , J. Kurzak , P. Luszczek , S. Tomov , I. Yamazaki , A. YarKhan
Acta Numerica ( IF 16.3 ) Pub Date : 2016-05-27 , DOI: 10.1017/s0962492916000015 A. Abdelfattah , H. Anzt , J. Dongarra , M. Gates , A. Haidar , J. Kurzak , P. Luszczek , S. Tomov , I. Yamazaki , A. YarKhan
Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems.To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
中文翻译:
用于大规模加速多核计算的线性代数软件
许多关键的科学计算应用,从国家安全到医学进步,都依赖于高性能线性代数算法和技术,强调了它们的重要性和广泛的影响。在这里,我们介绍了在大规模加速多核系统上加速主要线性代数算法的最先进的设计和实现实践。给出了基本密集线性代数算法的示例——从求解线性方程组所需的 LU、QR、Cholesky 和 LDLT 分解,到特征值和奇异值分解 (SVD) 问题。所呈现的实现可通过开源 PLASMA 和 MAGMA 库轻松获得,它们代表了用于加速多核系统的流行 LAPACK 库的下一代现代化。为了产生有效使用这些系统所需的极端并行度,重新设计感兴趣的算法,然后将其拆分为精心挑选的计算任务。使用静态调度或轻量级运行时系统,在具有 GPU 加速器和/或 Xeon Phi 协处理器的多核 CPU 混合系统的计算组件上调度任务执行。轻量级运行时系统的使用使调度开销保持在较低水平,类似于静态调度,同时通过类似顺序的代码实现并行性的表达。这简化了开发工作,并允许探索各种硬件组件的独特优势。最后,我们强调使用三种技术开发创新的线性代数算法——混合精度算术、批量运算、
更新日期:2016-05-27
中文翻译:
用于大规模加速多核计算的线性代数软件
许多关键的科学计算应用,从国家安全到医学进步,都依赖于高性能线性代数算法和技术,强调了它们的重要性和广泛的影响。在这里,我们介绍了在大规模加速多核系统上加速主要线性代数算法的最先进的设计和实现实践。给出了基本密集线性代数算法的示例——从求解线性方程组所需的 LU、QR、Cholesky 和 LDLT 分解,到特征值和奇异值分解 (SVD) 问题。所呈现的实现可通过开源 PLASMA 和 MAGMA 库轻松获得,它们代表了用于加速多核系统的流行 LAPACK 库的下一代现代化。为了产生有效使用这些系统所需的极端并行度,重新设计感兴趣的算法,然后将其拆分为精心挑选的计算任务。使用静态调度或轻量级运行时系统,在具有 GPU 加速器和/或 Xeon Phi 协处理器的多核 CPU 混合系统的计算组件上调度任务执行。轻量级运行时系统的使用使调度开销保持在较低水平,类似于静态调度,同时通过类似顺序的代码实现并行性的表达。这简化了开发工作,并允许探索各种硬件组件的独特优势。最后,我们强调使用三种技术开发创新的线性代数算法——混合精度算术、批量运算、