TY - JOUR
T1 - Efficient Data Placement and Replication for QoS-Aware Approximate Query Evaluation of Big Data Analytics
AU - Xia, Qiufen
AU - Xu, Zichuan
AU - Liang, Weifa
AU - Yu, Shui
AU - Guo, Song
AU - Zomaya, Albert Y.
N1 - Publisher Copyright:
© 1990-2012 IEEE.
PY - 2019/12/1
Y1 - 2019/12/1
N2 - Enterprise users at different geographic locations generate large-volume data that is stored at different geographic datacenters. These users may also perform big data analytics on the stored data to identify valuable information in order to make strategic decisions. However, it is well known that performing big data analytics on data in geographical-located datacenters usually is time-consuming and costly. In some delay-sensitive applications, the query result may become useless if answering a query takes too long time. Instead, sometimes users may only be interested in timely approximate rather than exact query results. When such approximate query evaluation is the case, applications must sacrifice timeliness to get more accurate evaluation results or tolerate evaluation result with a guaranteed error bound obtained from analyzing the samples of the data to meet their stringent timeline. In this paper, we study quality-of-service (QoS)-aware data replication and placement for approximate query evaluation of big data analytics in a distributed cloud, where the original (source) data of a query is distributed at different geo-distributed datacenters. We focus on the problems of placing data samples of the source data at some strategic datacenters to meet stringent query delay requirements of users, by exploring a non-trivial trade-off between the cost of query evaluation and the error bound of the evaluation result. We first propose an approximation algorithm with a provable approximation ratio for a single approximate query. We then develop an efficient heuristic algorithm for evaluating a set of approximate queries with the aim to minimize the evaluation cost while meeting the delay requirements of these queries. We finally demonstrate the effectiveness and efficiency of the proposed algorithms through both experimental simulations and implementations in a real test-bed, real datasets are employed. Experimental results show that the proposed algorithms are promising.
AB - Enterprise users at different geographic locations generate large-volume data that is stored at different geographic datacenters. These users may also perform big data analytics on the stored data to identify valuable information in order to make strategic decisions. However, it is well known that performing big data analytics on data in geographical-located datacenters usually is time-consuming and costly. In some delay-sensitive applications, the query result may become useless if answering a query takes too long time. Instead, sometimes users may only be interested in timely approximate rather than exact query results. When such approximate query evaluation is the case, applications must sacrifice timeliness to get more accurate evaluation results or tolerate evaluation result with a guaranteed error bound obtained from analyzing the samples of the data to meet their stringent timeline. In this paper, we study quality-of-service (QoS)-aware data replication and placement for approximate query evaluation of big data analytics in a distributed cloud, where the original (source) data of a query is distributed at different geo-distributed datacenters. We focus on the problems of placing data samples of the source data at some strategic datacenters to meet stringent query delay requirements of users, by exploring a non-trivial trade-off between the cost of query evaluation and the error bound of the evaluation result. We first propose an approximation algorithm with a provable approximation ratio for a single approximate query. We then develop an efficient heuristic algorithm for evaluating a set of approximate queries with the aim to minimize the evaluation cost while meeting the delay requirements of these queries. We finally demonstrate the effectiveness and efficiency of the proposed algorithms through both experimental simulations and implementations in a real test-bed, real datasets are employed. Experimental results show that the proposed algorithms are promising.
KW - Data replication and placement
KW - algorithm analysis
KW - approximate query evaluation
KW - approximation algorithms
KW - big data analytics
UR - http://www.scopus.com/inward/record.url?scp=85075111334&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2019.2921337
DO - 10.1109/TPDS.2019.2921337
M3 - Article
SN - 1045-9219
VL - 30
SP - 2677
EP - 2691
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 12
M1 - 8732398
ER -