Design Insights for MapReduce from Diverse Production Workloads

2012
Abstract : In this paper, we analyze seven MapReduce workloadtraces from production clusters at Facebook and at Cloudera customers in e-commerce, telecommunications media, and retail. Cumulatively, these traces comprise over a year's worth of data logged from over 5000 machines, and contain over two million jobs that perform 1.6 exabytesof I/O. Key observations include input data forms up to 77% of all bytes, 90% of jobs access KB to GB sized files that make up less than 16% of stored bytes, up to 60% of jobs re-access data that has been touched within the past 6 hours, peak-to-median job submission rates are 9:1 or greater, an average of 68% of all compute time is spent in map, task-seconds-per- byteis a key metric for balancing compute and data bandwidth task durations range from seconds to hours, and five out of seven workloadscontain map-only jobs. We have also deployed a public workloadrepository with workloadreplay tools so that the researchers can systematically assess design priorities and compare performance across diverse MapReduce workloads.
    • Correction
    • Source
    • Cite
    • Save
    39
    References
    46
    Citations
    NaN
    KQI
    []
    Baidu
    map