MapReduce in the Cloud: Data-Location-Aware VM Scheduling
Tung Nguyen Weisong Shi
This paper discusses the inefficiency of MapReduce in the cloud. We study the causes of this inefficiency and propose a solution. Inefficiency mainly occurs during data movement. Transferring large data to computing nodes is very time-consuming and also violates the rationale of Hadoop, which is to move computation to the data. To address this issue, we developed a distributed cache system and virtual machine scheduler. We show that our prototype can improve performance significantly when running different applications.