当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Formal definition and implementation of reproducibility tenets for computational workflows
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2024-12-20 , DOI: 10.1016/j.future.2024.107684
Nicholas J. Pritchard, Andreas Wicenec

Computational workflow management systems power contemporary data-intensive sciences. The slowly resolving reproducibility crisis presents both a sobering warning and an opportunity to iterate on what science and data processing entails. The Square Kilometre Array (SKA), the world’s largest radio telescope, is among the most extensive scientific projects underway and presents grand scientific collaboration and data-processing challenges. In this work, we aim to improve the ability of workflow management systems to facilitate reproducible, high-quality science. This work presents a scale and system-agnostic computational workflow model and extends five well-known reproducibility concepts into seven well-defined tenets for this workflow model. Additionally, we present a method to construct workflow execution signatures using cryptographic primitives in amortized constant time. We combine these three concepts and provide a concrete implementation in Data Activated Flow Graph Engine (DALiuGE), a workflow management system for the SKA to embed specific provenance information into workflow signatures, demonstrating the possibility of facilitating automatic formal verification of scientific quality in amortized constant time. We validate our approach with a simple yet representative astronomical processing task: filtering a noisy signal with a lowpass filter using CPU and GPU methods. This example shows the practicality and efficacy of combining formal tenet definitions with a workflow signature generation mechanism. Our framework, spanning formal UML specification, principled provenance information collection based on reproducibility tenets, and finally, a concrete example implementation in DALiuGE illuminates otherwise obscure scientific discrepancies and similarities between principally identical workflow executions.

中文翻译:


计算工作流程可重复性原则的正式定义和实施



计算工作流管理系统为现代数据密集型科学提供动力。缓慢解决的可重复性危机既是一个发人深省的警告,也是一个迭代科学和数据处理需要什么的机会。平方公里阵列 (SKA) 是世界上最大的射电望远镜,是正在进行的最广泛的科学项目之一,带来了巨大的科学合作和数据处理挑战。在这项工作中,我们的目标是提高工作流程管理系统的能力,以促进可重复的高质量科学。这项工作提出了一个与规模和系统无关的计算工作流模型,并将五个众所周知的可重复性概念扩展为该工作流模型的七个明确定义的原则。此外,我们还提出了一种在摊销恒定时间内使用加密基元构建工作流程执行签名的方法。我们将这三个概念结合起来,并在数据激活流图引擎 (DALiuGE) 中提供了具体实现,DALiuGE 是 SKA 的工作流管理系统,用于将特定的出处信息嵌入到工作流签名中,展示了在摊销恒定时间内促进科学质量的自动形式验证的可能性。我们通过一个简单但具有代表性的天文处理任务来验证我们的方法:使用 CPU 和 GPU 方法用低通滤波器过滤一个嘈杂的信号。此示例显示了将正式原则定义与工作流签名生成机制相结合的实用性和有效性。 我们的框架,跨越正式的 UML 规范,基于可重复性原则的原则性来源信息收集,最后,DALiuGE 中的具体示例实现阐明了基本相同的工作流执行之间模糊的科学差异和相似之处。
更新日期:2024-12-20
down
wechat
bug