Cost-Minimizing Online Algorithms for Geo-Distributed Data Analytics

Modern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims t...

Full description

Bibliographic Details
Main Authors: Jiao Huang, Jing Huang, Shang Gao, Bo Yang
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8891688/
Description
Summary:Modern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims to efficiently control data movements and achieve low latency for overall queries processing, both constrained by limited and expensive network resources across datacenters. Previous papers focus on offline settings of single analytical queries and do not consider time in optimizing system performance, and therefore ignores the dynamics of data and task placement in terms of inter-DC bandwidth utilization. In this paper, we consider the online setting and formulate a cost-minimizing optimization problem over time for arbitrary Directed Acyclic Graph query processing. Considering dynamics of network resource usage, we developed two online algorithms, Online Switch Resist (OSR) and Most Fixed Horizon Control (MFHC) with good competitive ratios. We performed extensive simulations and comparative studies using the TPC-CH benchmark and verified the efficacy of proposed algorithms. The algorithm we proposed is better than the existing algorithm, and its performance approximates the theoretical optimal value.
ISSN:2169-3536