当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Advancing anomaly detection in computational workflows with active learning
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2024-12-04 , DOI: 10.1016/j.future.2024.107608
Krishnan Raghavan, George Papadimitriou, Hongwei Jin, Anirban Mandal, Mariam Kiran, Prasanna Balaprakash, Ewa Deelman

A computational workflow, also known as workflow, consists of tasks that are executed in a certain order to attain a specific computational campaign. Computational workflows are commonly employed in science domains, such as physics, chemistry, genomics, to complete large-scale experiments in distributed and heterogeneous computing environments. However, running computations at such a large scale makes the workflow applications prone to failures and performance degradation, which can slowdown, stall, and ultimately lead to workflow failure. Learning how these workflows behave under normal and anomalous conditions can help us identify the causes of degraded performance and subsequently trigger appropriate actions to resolve them. However, learning in such circumstances is a challenging task because of the large volume of high-quality historical data needed to train accurate and reliable models. Generating such datasets not only takes a lot of time and effort but it also requires a lot of resources to be devoted to data generation for training purposes. Active learning is a promising approach to this problem. It is an approach where the data is generated as required by the machine learning model and thus it can potentially reduce the training data needed to derive accurate models. In this work, we present an active learning approach that is supported by an experimental framework, Poseidon-X, that utilizes a modern workflow management system and two cloud testbeds. We evaluate our approach using three computational workflows. For one workflow we run an end-to-end live active learning experiment, for the other two we evaluate our active learning algorithms using pre-captured data traces provided by the Flow-Bench benchmark. Our findings indicate that active learning not only saves resources, but it also improves the accuracy of the detection of anomalies.

中文翻译:


通过主动学习推进计算工作流中的异常检测



计算工作流,也称为工作流,由按特定顺序执行以实现特定计算活动的任务组成。计算工作流程通常用于物理学、化学、基因组学等科学领域,以在分布式和异构计算环境中完成大规模实验。但是,以如此大规模运行计算会使工作流应用程序容易出现故障和性能下降,这可能会减慢、停滞并最终导致工作流失败。了解这些工作流在正常和异常条件下的行为方式可以帮助我们确定性能下降的原因,并随后触发适当的措施来解决这些问题。然而,在这种情况下学习是一项具有挑战性的任务,因为训练准确可靠的模型需要大量高质量的历史数据。生成此类数据集不仅需要大量时间和精力,而且还需要大量资源用于数据生成以进行训练。主动学习是解决这个问题的一种很有前途的方法。在这种方法中,数据是根据机器学习模型的要求生成的,因此它可能会减少得出准确模型所需的训练数据。在这项工作中,我们提出了一种主动学习方法,该方法由实验框架 Poseidon-X 支持,该框架利用现代工作流管理系统和两个云测试平台。我们使用三个计算工作流程来评估我们的方法。对于一个工作流,我们运行端到端实时主动学习实验,对于另外两个工作流,我们使用 Flow-Bench 基准测试提供的预捕获数据跟踪来评估我们的主动学习算法。 我们的研究结果表明,主动学习不仅可以节省资源,还可以提高异常检测的准确性。
更新日期:2024-12-04
down
wechat
bug