Design Insights for MapReduce from Diverse Production Workloads
2012
Abstract : In this paper, we analyze seven MapReduce
workloadtraces from production clusters at Facebook and at Cloudera customers in e-commerce, telecommunications media, and retail. Cumulatively, these traces comprise over a year's worth of data logged from over 5000 machines, and contain over two million jobs that perform 1.6
exabytesof I/O. Key observations include input data forms up to 77% of all
bytes, 90% of jobs access KB to GB sized files that make up less than 16% of stored
bytes, up to 60% of jobs re-access data that has been touched within the past 6 hours, peak-to-median job submission rates are 9:1 or greater, an average of 68% of all compute time is spent in map, task-seconds-per-
byteis a key metric for balancing compute and data bandwidth task durations range from seconds to hours, and five out of seven
workloadscontain map-only jobs. We have also deployed a public
workloadrepository with
workloadreplay tools so that the researchers can systematically assess design priorities and compare performance across diverse MapReduce
workloads.
Keywords:
-
Correction
-
Source
-
Cite
-
Save
39
References
46
Citations
NaN
KQI