A systematic survey on fault-tolerant solutions for distributed data analytics: Taxonomy, comparison, and future directions,Computer Science Review

当前位置： X-MOL 学术 › Comput. Sci. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A systematic survey on fault-tolerant solutions for distributed data analytics: Taxonomy, comparison, and future directions
Computer Science Review ( IF 13.3 ) Pub Date : 2024-08-05 , DOI: 10.1016/j.cosrev.2024.100660
Sucharitha Isukapalli , Satish Narayana Srirama

Fault tolerance is becoming increasingly important for upcoming exascale systems, supporting distributed data processing, due to the expected decrease in the Mean Time Between Failures (MTBF). To ensure the availability, reliability, dependability, and performance of the system, addressing the fault tolerance challenge is crucial. It aims to keep the distributed system running at a reduced capacity while avoiding complete data loss, even in the presence of faults, with minimal impact on system performance. This comprehensive survey aims to provide a detailed understanding of the importance of fault tolerance in distributed systems, including a classification of faults, errors, failures, and fault-tolerant techniques (reactive, proactive, and predictive). We collected a corpus of 490 papers published from 2014 to 2023 by searching in Scopus, IEEE Xplore, Springer, and ACM digital library databases. After a systematic review, 17 reactive models, 17 proactive models, and 14 predictive models were shortlisted and compared. A taxonomy of ideas behind the proposed models was also created for each of these categories of fault-tolerant solutions. Additionally, it examines how fault tolerance capability is incorporated into popular big data processing tools such as Apache Hadoop, Spark, and Flink. Finally, promising future research directions in this domain are discussed.

中文翻译：

分布式数据分析容错解决方案的系统调查：分类、比较和未来方向

由于平均故障间隔时间 (MTBF) 预计会减少，容错对于即将推出的支持分布式数据处理的百亿亿级系统变得越来越重要。为了确保系统的可用性、可靠性、可靠性和性能，解决容错挑战至关重要。它的目的是让分布式系统以降低的容量运行，同时避免数据完全丢失，即使在出现故障的情况下，对系统性能的影响也最小。这项综合调查旨在详细了解分布式系统中容错的重要性，包括故障、错误、失效和容错技术（反应式、主动式和预测式）的分类。我们通过在 Scopus、IEEE Xplore、Springer 和 ACM 数字图书馆数据库中检索，收集了 2014 年至 2023 年发表的 490 篇论文的语料库。经过系统审查，17 个反应模型、17 个主动模型和 14 个预测模型被入围并进行比较。还为每个类别的容错解决方案创建了所提出模型背后的思想分类。此外，它还研究了如何将容错功能集成到流行的大数据处理工具（例如 Apache Hadoop、Spark 和 Flink）中。最后，讨论了该领域未来有前景的研究方向。

更新日期：2024-08-05

点击分享查看原文

点击收藏

阅读更多本刊新发论文