当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
GRAAFE: GRaph Anomaly Anticipation Framework for Exascale HPC systems
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2024-06-21 , DOI: 10.1016/j.future.2024.06.032
Martin Molan , Mohsen Seyedkazemi Ardebili , Junaid Ahmed Khan , Francesco Beneventi , Daniele Cesarini , Andrea Borghesi , Andrea Bartolini

The main limitation of applying predictive tools to large-scale supercomputers is the complexity of deploying Artificial Intelligence (AI) services in production and modeling heterogeneous data sources while preserving topological information in compact models. This paper proposes GRAAFE, a framework for continuously predicting compute node failures in the Marconi100 supercomputer. The framework consists of (i) an anomaly prediction model based on graph neural networks (GNNs) that leverage nodes’ physical layout in the compute room and (ii) the computationally efficient integration into the Marconi100’s ExaMon holistic monitoring system with Kubeflow, an MLOps Kubernetes framework which enables continuous deployment of AI pipelines. The GRAAFE GNN model achieves an area under the curve (AUC) from 0.91 to 0.78, surpassing state-of-the-art (SoA), achieving AUC between 0.64 and 0.5. GRAAFE sustains the anomaly prediction for all the Marconi100 nodes every 120s, requiring an additional 30% CPU resources and less than 5% more RAM w.r.t. monitoring only.

中文翻译:


GRAAFE:Exascale HPC 系统的 GRaph 异常预期框架



将预测工具应用于大型超级计算机的主要限制是在生产中部署人工智能 (AI) 服务和对异构数据源进行建模,同时在紧凑模型中保留拓扑信息的复杂性。本文提出了 GRAAFE,一种用于连续预测 Marconi100 超级计算机中计算节点故障的框架。该框架包括 (i) 基于图神经网络 (GNN) 的异常预测模型,该模型利用计算室中节点的物理布局,以及 (ii) 通过 Kubeflow(一种 MLOps Kubernetes)将计算效率集成到 Marconi100 的 ExaMon 整体监控系统中支持持续部署人工智能管道的框架。 GRAAFE GNN 模型实现了从 0.91 到 0.78 的曲线下面积 (AUC),超越了最先进的 (SoA),实现了 0.64 到 0.5 之间的 AUC。 GRAAFE 每 120 秒对所有 Marconi100 节点进行一次异常预测,需要额外 30% 的 CPU 资源和不到 5% 的 RAM 资源。仅监控。
更新日期:2024-06-21
down
wechat
bug