MapReduce is a programming paradigm for processing and generating large data sets using a parallel, distributed algorithm on a cluster.
A MapReduce job is divided into map and reduce tasks which run on the different machines which conform the cluster, always with the aim of trying to perform the processing on the same machine where the data lives.
The map's tasks concerns usually are to load, parse, transform and filter data. Once this is done, the reduce task handles a subset of the map task output for grouping and performing aggregations.
The input data for the MapReduce job comes from files stored in the Hadoop Distributed File System (HDFS). The content of these files is split with an input format which defines how to separate the file into chunks called input splits. The default cryteria is split by new lines.
Each map task consists of the following phases:
And each reduce task consists of the following steps:
Shuffle and sort Takes the output files written by all the partitioners and downloads them to the machine where the reducer is running. These records are sorted by key so equivalent keys are grouped together and their values can be iterated over easily in the reduce task. This step is handled by the framework automatically and the only thing the developer can do is specify a custom Comparator object to decide how keys are sorted.
Reduce Applies a Reduce function once for each key grouping within the data from the previous phase. This function receives a key and the iterator over all of the values under that key. The reducer may aggregate, filter or combine the data in a number of ways. After it is done, it will send zero or more key/value pairs to the final step. The reduce function is the other important piece of logic after the map function, so it will be provided by the developer. It's not always needed, so for certain use cases we won't make use of it.
Output format It translates the final key/value pair from the Reduce function and writes it out to a file on HDFS.
Let's see now a basic example. It's the word count problem, solved in MapReduce terms:
These are just the very basic concepts of data processing using MapReduce. To learn more, some good resources are: