Profitoriented task scheduling algorithm in Hadoop cluster
 Xuqing Chai^{1}Email author,
 Yongliang Dong^{2} and
 Junfei Li^{3}
https://doi.org/10.1186/s136390160026x
© Chai et al. 2016
Received: 25 November 2015
Accepted: 24 February 2016
Published: 29 March 2016
Abstract
Nowadays, many enterprises provide cloud services based on their own Hadoop clusters. Because the resources of a Hadoop cluster are limited, the Hadoop cluster must select some specific tasks to allocate limited resources in order to get the maximal profit. In this paper, we study the maximal profit problem for a given candidate task set. We describe the candidate task set with a valid sequence and propose a sequencebased scheduling strategy. In order to improve the efficiency of finding a valid sequence, we design some pruning strategies and give the corresponding scheduling algorithm. Finally, we propose a timeout handling algorithm when some task runs timeout. Experiments show that the total profit of the proposed algorithm is very close to the ideal maxima and is obviously bigger than related scheduling algorithms under different experimental settings.
Keywords
1 Introduction
With the rapid development of computer networks and sensor networks, data are exponentially increased, especially on the Internet. In order to deal with largescale data efficiently, a parallel and distributed cluster with good scalability, flexibility, and fault tolerance is needed. The MapReduce architecture [1], proposed by Google, applies a divide and conquer method to deal with dataintensive tasks and is a de facto standard in big data field. The researches on MapReduce have attracted more and more researchers and engineers. In Google, it uses a largescale cluster running MapReduce and its related techniques, such as GFS [2] and Bigtable [3], to handle hundreds of petabyte data every week. Based on the analyzing results upon these data, it provides a series of services to people around the world, such as searching, Google earth, advertisements, and so on.
Hadoop [4, 5], contributed by Yahoo!, is the opensource implementation of MapReduce and its related techniques. Hadoop is studied extensively in both academia and industry and has been deployed in many enterprises. Currently, a lot of IT enterprises build their Hadoop/MapReduce clusters and provide all kinds of cloud services to customers. While paying only a little money, the customers can use the powerful Hadoop/MapReduce cluster on demand. During this kind of service process, the service details between enterprises and customers are usually described by a service level agreement (SLA) [6, 7]. The SLAs usually include two kinds, pricing for quantity and pricing for effectiveness. The pricing for quantity SLAs charges the customers proportional to the scale of hardware and the service time. The pricing for effectiveness SLAs charges the customers according to the service effectiveness. Taking the spam email detection service [8] for example, the service must be finished in a certain time, so if only the service finishes within the required time, money would be paid.
In this paper, we study how to schedule customers’ tasks to maximize the total profit of a Hadoop cluster. In our research, we mainly focus on the timed MapReduce tasks, which are priced for effectiveness of time, i.e., tasks must be finished within the given time. Here, we abstract each task with four parts, i.e., userdefined Map/Reduce functions, time to complete, profit, and penalty, and we try to find a scheduling algorithm that maximizes the total profit of the Hadoop cluster.
The rest of the paper is organized as follows. In Section 1.1, we briefly describe the MapReduce programming environment and review related works about scheduling algorithms in MapReduce/Hadoop. In Section 1.2, we formalize the problem of maximal profit. In Section 1.3, we propose a sequencebased scheduling strategy and present a corresponding scheduling algorithm. Experiments and conclusions are given in Sections 1.4 and 2, respectively.
1.1 Background and related works
In this section, we give a short introduction to MapReduce and then review related works about task scheduling in MapReduce.
1.1.1 Background
MapReduce is a popular programming model for dataintensive tasks and has been widely used in many fields [9–14]. Hadoop is an opensource implementation of MapReduce, and a Hadoop cluster can be made up of thousands of commodity computers. The Hadoop cluster runs on top of the Hadoop distributed file system (HDFS). In the HDFS, data are partitioned into many small chunks and each chunk has multiple backup copies. The multiple backup copy mechanism of HDFS makes the running MapReduce tasks faulttolerant.
1.1.2 Related works
In MapReduce, there are some general task schedulers, such as FIFO scheduler [15], capacitybased scheduler [16], and fairnessbased scheduler [17]. Concerning the specific applications, Sandholm and Lai [18] proposed a scheduling algorithm, which allows users to adjust the required computing resources dynamically according to the importance of MapReduce tasks, Zaharia et al. [19] proposed a scheduling algorithm for heterogeneous cluster environments, and Kwon et al. [20] proposed the Skewtune algorithm for dealing with skewness in the processes of MapReduce tasks.
In addition, there are some scheduling algorithms, which concern the MapReduce tasks to be finished within a given time. Polo et al. [21] proposed a performancedriven task coscheduling algorithm, which estimates the required finish time for each task and allocates resources prior for the tasks that cannot be completed timely. Kc and Anyanwu [22] proposed a deadline constraint (DC) scheduler, which tries to allocate a fixed number of Map jobs to each task according to the size of tasks and assumes that each task can utilize all job slots in the Reduce step. However, the workload complementary (WC) scheduling mechanism, proposed by Verma et al. [23, 24], tries to allocate a fixed number of both Map and Reduce jobs to each task according to the size of tasks and to minimize the number of job slots for each task.
1.2 Problem statement

j. N _{ m }, the number of Map jobs in j.

j. N _{ r }, the number of Reduce jobs in j. In order to get high efficiency, both j. N _{ m } and j. N _{ r } are the integer multiples of M.

j. deadline, the required time or deadline of j.

j. profit, the profit of j if finished before deadline. Here, we must note that if j does not finish before deadline, then the penalty of j is j. profit. α.
where E(⋅) indicates whether or not the given task is effective.
1.3 The proposed scheduling algorithm
In this section, we first propose a sequencebased task scheduling strategy and then propose a scheduling algorithm based on that strategy and finally present an approach for handling timeout.
1.3.1 Sequencebased scheduling strategy
For each task j ∈ J, we can estimate its average processing time for Map jobs, j. T _{ m }, and its average processing time for Reduce jobs, j. T _{ r }. If all task slots are used to process task j, then it needs TC _{ m }(j) = ⌈j. N _{ m }/M⌉ × j. T _{ m } time to finish all Map jobs and needs TC _{ r }(j) = ⌈j. N _{ r }/M⌉ × j. T _{ r } time to finish all Reduce jobs.
Definition 1. Sequence
For a task set JS (the number of tasks is JS), a sequence S is a permutation of all tasks in JS, and it specifies the order or jobs according to their finished times. If the finished time of j in the Map step is COT _{ m }(j), then for a given sequence S = {j _{1}, j _{2}, …, j _{JS}}, let COT _{ m }(j _{ i }) < COT _{ m }(j _{ i + 1}) for ∀ j _{ i } ∈ S(0 < i < JS).

Map: When an idle task slot requires a Map job, select a Map job of the first task in sequence S. When all Map jobs of the first task in S are allocated, remove the first task from S.

Reduce: Sort the tasks in JS increasingly according to their times to finish and then get a sorted queue \( {L}_a=\left\{{j}_1^{\prime },{j}_2^{\prime },\dots, {j}_{\leftJS\right}^{\prime}\right\} \). When an idle task slot requires a Reduce job, search in L _{ d } orderly for a task whose Map jobs have all finished and then select a Reduce job of the selected task.

Given a sequence S = {j _{1}, j _{2}, …, j _{JS}}, for ∀ j _{ i } ∈ S, COT _{ m }(j _{ i }) = COT _{ m }(j _{ i − 1}) + TC _{ m }(j) = ∑_{ k ∈ [1,i]} TC _{ m }(j _{ k }).

Given JS and L _{ d }, for the first task \( {j}_1^{\prime } \) in L _{ d }, its finished time of the Map step can be calculated out, i.e., \( CO{T}_m\left({j}_1^{\prime}\right) \), using the above method, then \( CO{T}_r\left({j}_1^{\prime}\right)=CO{T}_m\left({j}_1^{\prime}\right)+T{C}_r\left({j}_1^{\prime}\right) \). We tag the time slice \( \left[CO{T}_m\left({j}_1^{\prime}\right),CO{T}_r\left({j}_1^{\prime}\right)\right] \) as occupied. For the ith task in L _{ d }, we first computed \( CO{T}_m\left({j}_i^{\prime}\right) \) and then find a series of unoccupied time slices, whose sum to be \( T{C}_r\left({j}_i^{\prime}\right) \), beginning with the moment \( CO{T}_m\left({j}_i^{\prime}\right) \), and tag these time slices as occupied. Then, the finished time of the Reduce step of \( {j}_i^{\prime } \), i.e., \( CO{T}_r\left({j}_i^{\prime}\right) \), is the finished time of the latest time slices.
Based on the above scheduling strategy and the computation of finished time, we give the definition of a valid sequence.
Definition 2. Valid sequence
Given a task set JS and for any sequence S, if for ∀ j _{ i } ∈ JS, we have COT _{ r }(j) ≤ j. deadline based on the above scheduling strategy; then, S is a valid sequence.
Theorem 1
For a task set JS and a sequence S, the proposed scheduling strategy can make sure that, for ∀ j _{ i } ∈ JS, j _{ i } can finish its Map step with minimal time under the constraint of S.
Proof. Given the sequence S = {j _{1}, j _{2}, …, j _{JS}} and its ith task j _{ i }, S ensures that the Map step of j _{ i } starts after all Map steps of j _{ k }(1 ≤ k < i) finish, i.e., the earliest finished time of the Map step of j _{ i } is ∑_{ k ∈ [1,i]} TC _{ m }(j _{ k }). At the same time, allocating jobs based on the proposed scheduling strategy, the finished time of the Map step of j _{ i }, COT _{ m }(j _{ i }), also equals to ∑_{ k ∈ [1,i]} TC _{ m }(j _{ k }). For ∀ j _{ i } ∈ S, we can have the same conclusion.
Theorem 2
For a task set JS and a sequence S, if task timeout occurs when using the proposed scheduling strategy, then whatever scheduling strategy is used, it is impossible to finish all tasks in JS on time, and thus, S must be not a valid sequence.

COT _{ m }(j) + TC _{ r }(j) > j. If the Reduce step of j run immediately when its Map jobs finish and still cannot finish on time, then whatever scheduling strategy is used, j cannot finish on time.

COT _{ m }(j) + TC _{ r }(j) ≤ j. deadline and COT _{ r }(j) > j. deadline. The finished time of the Reduce step of j is later than deadline. According to the proposed scheduling strategy, there must be some period in time slice [COT _{ m }(j _{ i }), COT _{ r }(j _{ i })] occupied by other tasks’ Reduce jobs, whose time to finish is less than j. deadline. Select the task with minimal time required to finish its Map step and denote it as j′. For all tasks, whose Reduce steps run in time slice \( \left[CO{T}_m\left({j}_1^{\prime}\right),CO{T}_r\left({j}_1^{\prime}\right)\right] \), judge whether or not the tasks, whose time to finish is less than j. deadline, exist. If the tasks exist, repeat the above progress until we find a final task j _{ f }, such that all tasks, whose Reduce step runs in time slice [COT _{ m }(j _{ f }), COT _{ r }(j)], and finish later than j. deadline. Obviously, there is no any idle time slice in time slice [COT _{ m }(j _{ f }), COT _{ r }(j)]. So, if we use other scheduling strategies to make j finish on time, then there must be some other tasks that will be timeout.
In both of the above situations, it is impossible to finish all tasks in JS on time, so S must be not a valid sequence.
Based on theorems 1 and 2, we can conclude that the proposed scheduling strategy is optimal for a fixed sequence S. That means if timeout tasks under the proposed strategy exist, then they must exist in any other scheduling strategy.
1.3.2 Scheduling algorithm
Based on the proposed sequencebased scheduling strategy, we propose a scheduling algorithm. Firstly, when the candidate task set is static, we use a scoring strategy to specify priorities for all tasks, apply an efficient pruning strategy to find the set of acceptable tasks, and then find a valid sequence. Secondly, when the candidate task set is updated dynamically, we implement an incremental method for judging the set of acceptable tasks and update the valid sequence when necessary.
For a candidate task set c, we need to find the set of acceptable tasks \( A=\left\{{j}_1^{\prime },{j}_2^{\prime },\dots, {j}_{\leftA\right}^{\prime}\right\} \), ascertain the valid sequence of A, and then maximize the total profit. However, there are 2^{J} different acceptable sets for J, and for an acceptable set \( A=\left\{{j}_1^{\prime },{j}_2^{\prime },\dots, {j}_{\leftA\right}^{\prime}\right\} \), there are still A ! different sequences.

As the ability of Hadoop cluster is limited, in order to maximize the total profit, the tasks with a bigger profit ratio should be accepted prior. For ∀ j ∈ J, the consumed time of the system can be quantified as STC(j) = TC _{ m }(j) ⋅ M/(M + R) + TC _{ r }(j) ⋅ R/(M + R) and then the profit ratio of j is P _{ r }(j) = j. profit/STC(j), i.e., the profit per second when running j.

In MapReduce, if some task j is too long, then most task slots will be idle when running Map/Reduce jobs of j. This would waste lots of resources and affect the accept of other tasks.
where Ad(j) is the adjusting coefficient of j and STC(j) ⋅ Ad(j) is the adjusting time of j. By Eq. 2, the task with a higher score would be given a higher priority.
Let the total consumed time of Map jobs for all j ∈ J be Total TC _{ m } = ∑_{ j ∈ J } TC _{ m }(j). For task j, compute the average consumed time of all Map jobs for other tasks, \( \overline{T}{\overline{C}}_m(j)=\left(\mathrm{Total}\kern0.5em T{C}_mT{C}_m(j)\right)/\left(\leftJ\right1\right) \). Given a penalty threshold β(β > 1), if \( T{C}_m(j)>\overline{T}{\overline{C}}_m\cdot \beta \), then we think that the Map step of j is too long, and with the same reason, if \( T{C}_r(j)>\overline{T}{\overline{C}}_r\cdot \beta \), then we think that the Reduce step of j is too long. The computation of the adjusting coefficient Ad(j) for task j is as follows:
If \( T{C}_m(j)>\overline{T}{\overline{C}}_m(j)\cdot \beta \), \( T{C}_r(j)>\overline{T}{\overline{C}}_r(j)\cdot \beta \),
then \( \frac{T{C}_m(j)\overline{T}{\overline{C}}_m(j)\cdot \beta }{T{C}_m(j)}\cdot \frac{M}{M+R}+\frac{T{C}_r(j)\overline{T}{\overline{C}}_r(j)\cdot \beta }{T{C}_r(j)}\cdot \frac{M}{M+R}+1 \);
if \( T{C}_m(j)>\overline{T}{\overline{C}}_m(j)\cdot \beta \), \( T{C}_r(j)\le \overline{T}{\overline{C}}_r(j)\cdot \beta \),
then \( \frac{T{C}_m(j)\overline{T}{\overline{C}}_m(j)\cdot \beta }{T{C}_m(j)}\cdot \frac{M}{M+R}+1 \);
if \( T{C}_m(j)\le \overline{T}{\overline{C}}_m(j)\cdot \beta \), \( T{C}_r(j)>\overline{T}{\overline{C}}_r(j)\cdot \beta \),
then \( \frac{T{C}_r(j)\overline{T}{\overline{C}}_r(j)\cdot \beta }{T{C}_r(j)}\cdot \frac{M}{M+R}+1 \);
and if \( T{C}_m(j)\le \overline{T}{\overline{C}}_m(j)\cdot \beta \), \( T{C}_r(j)\le \overline{T}{\overline{C}}_r(j)\cdot \beta \),
then Ad(j) = 1.
Now, we analyze how to improve the efficiency of finding a valid sequence. Assuming that the candidate set is sorted by Eq. 2, i.e., ∀ j ∈ J, Score (j _{ i }) ≤ Score (j _{ i + 1}). The brute force searching method needs (A + 1) ! complexity to traverse all candidate sequences. In order to improve the searching speed, we give the following two approaches.
Theorem 3
Given a task set A and one of its valid sequences S = {j _{1}, j _{2}, …, j _{ n }}, for a new task j _{new}, there are n + 1 locations that can be inserted by j _{new}. If TC _{ m }(j _{new}) + COT _{ m }(j _{ i }) + TC _{ r }(j _{ i }) > j _{ i }. deadline, then j _{new} cannot be inserted into locations [1, i].
Proof. Obviously, if j _{new} is inserted into any location of [1, i], then j _{new} will be timeout.
Theorem 4
Given a task set A and one of its valid sequences S = {j _{1}, j _{2}, …, j _{ n }}, for a new task j _{new}, if TC _{ m }(j _{new}) + COT _{ m }(j _{ i }) + TC _{ r }(j _{new}) > j _{ i }. deadline, then j _{new} cannot be inserted into locations [i + 1, n + 1].
Proof. Assuming that j _{new} can be inserted into one location in [i + 1, n + 1], according to the proposed scheduling strategy, we have that the earliest finished time of j _{new} is equal to or larger than TC _{ m }(j _{new}) + COT _{ m }(j _{ i }) + TC _{ r }(j _{new}). So, j _{new} must be timeout.
Based on theorems 3 and 4, we proposed an algorithm for rapidly finding the acceptable set and its corresponding valid sequence, and the details are in algorithm 1. With the proposed algorithm, the maximum profit for the candidate task set can be got.
1.3.3 Timeout handling approach
We propose the above scheduling algorithm in the homogeneous Hadoop cluster, and in most cases, the estimation values of j. T _{ m } and j. T _{ r } are close to real values. However, in some abnormal situations, such as network congestion and node crashes, some accepted tasks cannot finish on time. In these situations, we must adjust the running tasks in order to get the maximum profit.
According to Eq. 1, in order to get the maximum profit, we should drop the task with the lowest profit while making other tasks finish on time. Based on this idea, we propose a timeout handling algorithm, and the details of the algorithm are in algorithm 2.
1.4 Experiments
1.4.1 Experimental setting
In the experiments, the Hadoop cluster contains one master node and 40 slave nodes, and each node contains an Intel Core i3 3.1 GHz CPU, 8 GB memory, and 500 GB storage and runs Redhat Linux 6.1. In the slave nodes, each node is configured with two Map task slots and two Reduce task slots.

Average task size L, i.e., the average size (number of chunks) of all tasks in L;

Task number N, i.e., the number of tasks in L;

Average deadline D, i.e., the average deadline (time to finish) of all tasks in L.
1.4.2 Results
The baseline algorithms we use in the experiments are DC [22] and WC [24].
2 Conclusions
In this paper, we study the problem of maximal profit in a Hadoop cluster, where the resources are not enough for the whole candidate task set. In order to maximize the total profit, we select some highprofit ratio tasks based on the valid sequence of a candidate task set. Furthermore, in order to improve the efficiency of finding a valid sequence, we design some pruning strategies and give the corresponding scheduling algorithm. We also propose a timeout handling algorithm. Experiments show that the total profit of the proposed algorithm is very close to the ideal maxima and is obviously bigger than related scheduling algorithms under different experimental settings.
Declarations
Acknowledgements
This work was supported by the following funds: Department of Science and Technology of Henan (9412012Y0004, 9412012Y0005) and Education Department of Henan (13A510520, 2013gh12, 14A520053, SKL2014795).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 J Dean, S Ghemawat, MapReduce: simplified data processing on large clusters. Commun ACM 51(1), 107–113 (2008)View ArticleGoogle Scholar
 S. Ghemawat, H. Gobioff, S.T. Leung, in ACM SIGOPS Operating Systems Review. The Google file system, vol. 37 (ACM, 2003), pp. 29–43Google Scholar
 F Chang, J Dean, S Ghemawat, WC Hsieh, DA Wallach, M Burrows, T Chandra, A Fikes, RE Gruber, Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26(2), 4 (2008)View ArticleGoogle Scholar
 D Borthakur, The Hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007), 21 (2007)Google Scholar
 K. Shvachko, H. Kuang, S. Radia, R. Chansler, in Mass Storage Systems and Technologies (MSST). The Hadoop distributed file system. 2010 IEEE 26th Symposium On, (IEEE, 2010), pp. 1–10.Google Scholar
 J.M. Peha, F. Tobagi et al., in INFOCOM’91. Proceedings. Tenth Annual Joint Conference of the IEEE Computer and Communications Societies. Networking in the 90s. A costbased scheduling algorithm to support integrated services, (IEEE, 1991), pp. 741–753.Google Scholar
 Y Chi, HJ Moon, H Hacigümüs, iCBS: incremental costbased scheduling under piecewise linear SLAs. Proceedings of the VLDB Endowment 4(9), 563–574 (2011)View ArticleGoogle Scholar
 M.T.B. Aun, B.M. Goi, V.T.H. Kim, in Sustainable Utilization and Development in Engineering and Technology (STUDENT), 2011 IEEE Conference On. Cloud enabled spam filtering services: challenges and opportunities, (IEEE, 2011), pp. 63–68.Google Scholar
 A McKenna, M Hanna, E Banks, A Sivachenko, K Cibulskis, A Kernytsky, K Garimella, D Altshuler, S Gabriel, M Daly et al., The genome analysis toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res 20(9), 1297–1303 (2010)View ArticleGoogle Scholar
 R.L. Ferreira Cordeiro, C. Traina Junior, A.J. Machado Traina, J. López, U. Kang, C. Faloutsos, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Clustering very large multidimensional datasets with MapReduce, (ACM, 2011), pp. 690–698.Google Scholar
 K Wiley, A Connolly, J Gardner, S Krughoff, M Balazinska, B Howe, Y Kwon, Y Bu, Astronomy in the cloud: using MapReduce for image coaddition. Stronomy 123(901), 366–380 (2011)Google Scholar
 M.F. Husain, P. Doshi, L. Khan, B. Thuraisingham, in Cloud Computing. Storage and retrieval of large RDF graph using Hadoop and MapReduce, (Springer, 2009), pp. 680–686.Google Scholar
 W Dou, X Zhang, J Chen, KASR: a keywordaware service recommendation method on MapReduce for big data application. IEEE Transactions on Parallel & Distributed Systems 1, 1 (2014)Google Scholar
 D Dahiphale, R Karve, AV Vasilakos, H Liu, Z Yu, A Chhajer, J Wang, C Wang, An advanced MapReduce: cloud MapReduce, enhancements and applications. Network and Service Management, IEEE Transactions on 11(1), 101–115 (2014)View ArticleGoogle Scholar
 RB Thirumala, Survey on improved scheduling in Hadoop MapReduce in cloud environments. Int J Comput Appl 34(9), 29–33 (2011)Google Scholar
 M. Yong, N. Garegrat, S. Mohan, in Proceedings of the 2009 IEEE International Conference on Web Services. Towards a resource aware scheduler in Hadoop, (Los Angeles, CA, USA, 2009), pp. 102–109Google Scholar
 M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, I. Stoica, in Proceedings of the 5th European Conference on Computer Systems. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling, (ACM, 2010), pp. 265–278.Google Scholar
 T. Sandholm, K. Lai, in Job Scheduling Strategies for Parallel Processing. Dynamic proportional share scheduling in Hadoop, (Springer, 2010), pp. 110–131.Google Scholar
 M. Zaharia, A. Konwinski, A.D. Joseph, R.H. Katz, I. Stoica, in OSDI. Improving MapReduce performance in heterogeneous environments, vol. 8 (2008), p. 7Google Scholar
 Y. Kwon, M. Balazinska, B. Howe, J. Rolia, in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. Skewtune: mitigating skew in MapReduce applications, (ACM, 2012), pp. 25–36.Google Scholar
 J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguadé, M. Steinder, I. Whalley, in Network Operations and Management Symposium (NOMS). Performancedriven task coscheduling for MapReduce environments, (IEEE, 2010), pp. 373–380.Google Scholar
 K. Kc, K. Anyanwu, in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference On. Scheduling Hadoop jobs to meet deadlines, (IEEE, 2010), pp. 388–392.Google Scholar
 A. Verma, L. Cherkasova, R.H. Campbell, in Proceedings of the 8th ACM International Conference on Autonomic Computing. Aria: automatic resource inference and allocation for MapReduce environments, (ACM, 2011), pp. 235–244.Google Scholar
 A. Verma, L. Cherkasova, V.S. Kumar, R.H. Campbell, in Network Operations an d Management Symposium (NOMS). Deadlinebased workload management for MapReduce environments: Pieces of the performance puzzle, (IEEE, 2012), pp. 900–905.Google Scholar