当前位置: X-MOL 学术J. Netw. Comput. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Caching or not: An online cost optimization algorithm for geodistributed data analysis in cloud environments
Journal of Network and Computer Applications ( IF 7.7 ) Pub Date : 2024-06-22 , DOI: 10.1016/j.jnca.2024.103942
Weitao Yang , Li Pan , Shijun Liu

With the wide application of big data technology, a large number of data geographically stored in data centers across various regions are generated everyday, waiting to be analyzed by big data tasks. Examples of such data analysis tasks include weather prediction and intelligent healthcare applications. Clouds are being used by more and more enterprises due to their nearly infinite resources, ease of scaling, and other characteristics. Organizations often rent StaaS (Storage-as-a-Service) storage products offered by cloud providers, such as OSS (Object Storage Service), for massive data storage, while building a big data cluster in cloud environments to analyze the collected data from various regions. However, when the data to be analyzed are not located in the rented cluster, how to efficiently and economically process the distributed input data stored in clouds becomes an urgent problem to be solved. A simple approach is to only cache frequently accessed data rather than all data from other regions into the cluster to reduce total traffic costs. However, it is generally very hard to predict future data access curves. Thus, a rash caching decision may incur more costs. To address this problem, in this paper we propose an online algorithm for guiding cloud users to make cost-effective caching decisions properly, while not requiring any future information. We prove theoretically that the competitive ratio of our online algorithm is less than 2. Finally we verify the effectiveness of our proposed algorithm through extensive experiments based on the real price of Alibaba’s public IaaS cloud products using both real-world Yahoo S2 data and synthesized datasets.

中文翻译:


是否缓存:云环境中地理分布式数据分析的在线成本优化算法



随着大数据技术的广泛应用,每天都会产生大量的数据,分布在不同地区的数据中心,等待大数据任务的分析。此类数据分析任务的示例包括天气预报和智能医疗保健应用程序。由于云资源近乎无限、易于扩展等特点,越来越多的企业正在使用云。组织通常会租用云提供商提供的StaaS(存储即服务)存储产品,例如OSS(对象存储服务)来进行海量数据存储,同时在云环境中构建大数据集群来分析从各个方面收集的数据。地区。然而,当待分析的数据不位于租用的集群中时,如何高效、经济地处理存储在云端的分布式输入数据就成为迫切需要解决的问题。一种简单的做法是只将经常访问的数据而不是其他区域的所有数据缓存到集群中,以降低总流量成本。然而,通常很难预测未来的数据访问曲线。因此,草率的缓存决定可能会产生更多的成本。为了解决这个问题,在本文中,我们提出了一种在线算法,用于指导云用户正确做出具有成本效益的缓存决策,同时不需要任何未来信息。我们从理论上证明了我们的在线算法的竞争比小于 2。最后,我们基于阿里巴巴公共 IaaS 云产品的真实价格,使用真实的 Yahoo S2 数据和合成数据集,通过大量实验验证了我们提出的算法的有效性。 。
更新日期:2024-06-22
down
wechat
bug