Exploiting Bloom Filters for Efficient Joins in MapReduce
Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim
MapReduce is a programming model that is extensively used for large-scale data analysis. However, it is inefficient to perform join operations us-ing MapReduce, because large intermediate results are produced, even in cases where only a small fraction of input data participate in the join. We alleviate this problem by exploiting Bloom filters within a single MapRe-duce job. We create Bloom filters for an input dataset, and filter out the redundant records in the other input dataset in the map phase.