This website uses cookies to allow us to see how the site is used. If you continue to use this site we will assume that you are happy with this.Click to hide
Menu
CoDE - NCR Edinburgh
CoDE - NCR Edinburgh
  • Home
  • People
  • Events
  • Social
  • Products
  • Jobs
  • Edinburgh
  • Blog
  • Contact
 home > blog > map reduce

MapReduce

MapReduce is a programming paradigm for processing and generating large data sets using a parallel, distributed algorithm on a cluster.

A MapReduce job is divided into map and reduce tasks which run on the different machines which conform the cluster, always with the aim of trying to perform the processing on the same machine where the data lives.

The map's tasks concerns usually are to load, parse, transform and filter data. Once this is done, the reduce task handles a subset of the map task output for grouping and performing aggregations.

The input data for the MapReduce job comes from files stored in the Hadoop Distributed File System (HDFS). The content of these files is split with an input format which defines how to separate the file into chunks called input splits. The default cryteria is split by new lines.

Each map task consists of the following phases:

  • Record reader Translates an input split generated by input format into records.
  • Map Code provided by the user is executed on each input key/value pair and produces zero or more new key/value pairs, the intermediate pairs. It's very important to choose wisely what is the key and what is the value, as it has a big impact regarding what the job is meant to accomplish. The key is what the data will be grouped on and the value is the information to be analyzed by the reducer.
  • Combiner This optional phase will perform an additional reduce step at the end of the map phase, and its purpose is to reduce the workload of the reducers. For example, as the count of an aggregation is the sum of the counts of each part, you can produce an intermediate count with a combiner and then sum those intermediate counts in the reducer to obtain the final result.
  • Partitioner This piece takes the key/value pairs from the mapper and groups them into shards, each one to be assigned to a reducer. The default strategy is to assign the records randomly between the reducers based upone their hash. This behaviour can be customized, usually when a sorting needs to be performed.

And each reduce task consists of the following steps:

  • Shuffle and sort Takes the output files written by all the partitioners and downloads them to the machine where the reducer is running. These records are sorted by key so equivalent keys are grouped together and their values can be iterated over easily in the reduce task. This step is handled by the framework automatically and the only thing the developer can do is specify a custom Comparator object to decide how keys are sorted.

  • Reduce Applies a Reduce function once for each key grouping within the data from the previous phase. This function receives a key and the iterator over all of the values under that key. The reducer may aggregate, filter or combine the data in a number of ways. After it is done, it will send zero or more key/value pairs to the final step. The reduce function is the other important piece of logic after the map function, so it will be provided by the developer. It's not always needed, so for certain use cases we won't make use of it.

  • Output format It translates the final key/value pair from the Reduce function and writes it out to a file on HDFS.

Let's see now a basic example. It's the word count problem, solved in MapReduce terms:

  1. On the Split phase, each line in the input file is taken as a record.
  2. On the Map phase, each word on each record is assigned a count of 1, the word being the key and the count being the value.
  3. On the Combine phase, it will aggregate all the counts for the same word within the same mapper.
  4. On the Shuffle & Sort phase, it will move all the key/value pairs from the various mappers to the reducer, and they will get sorted by key, in our case, by the word.
  5. On the Reduce phase, the same operation as in the Combine phase will be performed: all the counts for each specific word will be added together. The difference is that this time it is applied over the whole dataset, so we obtain the final result.

These are just the very basic concepts of data processing using MapReduce. To learn more, some good resources are:

  • The Apache Hadoop Website
  • Hadoop, The Definitive Guide
  • MapReduce Design Patterns

  • Top
  • Home
  • People
  • Events
  • Social
  • Products
  • Jobs
  • Edinburgh
  • Blog
  • Contact
  • NCR Global
  • Home
  • NCR Global