1. MapReduce Framework (Basics)
MapReduce is a software framework designed by Google that enables parallel and distributed computing over large data sets across multiple machines. It provides two main functions:
- Map Function: Processes input data and generates intermediate (key, value) pairs
- Reduce Function: Processes grouped intermediate data to generate final output
The user defines
Map()
andReduce()
functions, and the framework handles data splitting, parallel execution, scheduling, and communication automatically.

2. Data Flow in MapReduce
Step | Process |
---|---|
1️⃣ | Input data is split and given to multiple Map functions |
2️⃣ | Map() generates intermediate (key, value) pairs |
3️⃣ | Intermediate data is sorted and grouped by key |
4️⃣ | Reduce() gets data as (key, [list of values]), processes it, and outputs results |

3. Example: Word Count (MapReduce)
Input:
- Line 1: “most people ignore most poetry”
- Line 2: “most poetry ignores most people”
Map Output:
- (most, 1), (people, 1), (ignore, 1), (most, 1), …
Grouped:
- (most, [1,1,1,1]), (people, [1,1]), (poetry, [1,1]), …
Reduce Output:
- (most, 4), (people, 2), (poetry, 2), …
4. Formal Definitions
Map Function:
(key1, val1) → List of (key2, val2)
Reduce Function:
(key2, List(val2)) → List(val2)
5. Actual Data & Control Flow (Steps)
Step | Description |
---|---|
1️⃣ Data partitioning | Partition Input Data stored in Google File System (GFS) |
2️⃣ Computation partitioning | Fork user programs on cluster nodes (master & workers) |
3️⃣ Determining the master and workers | Assign map/reduce tasks to workers |
4️⃣ Reading the input data | Map workers read their data block |
5️⃣ Map function | Map() processes and outputs intermediate (key, value) pairs |
6️⃣ Combiner function | (Optional) Combiner locally reduces map output before sending |
7️⃣ Partitioning function | Partitioning Function (e.g., hash(key) % R) groups data for reduce |
8️⃣ Synchronization | Synchronization: Reduce starts only after all map tasks complete |
9️⃣ Communication | Communication: Reduce workers pull grouped data from mappers |
🔟 Sorting and grouping | Data is sorted & grouped locally at each reduce worker |
1️⃣1️⃣ Reduce function | Reduce() processes grouped data and stores final output |


6. Compute-Data Affinity
MapReduce is designed to bring the computation to where the data is (instead of sending large data to computation), minimizing network load. This works efficiently with GFS where data is stored in 64MB blocks (same size as MapReduce splits).
7. Twister and Iterative MapReduce
Twister is an advanced version of MapReduce, also called MapReduce++, designed to handle iterative applications like machine learning (e.g., K-means, PageRank).
Why Twister:
- Traditional MapReduce:
- Writes intermediate output to disk → slow
- No built-in support for iterations
- Twister:
- Keeps long-running threads
- Shares static data
- Streams only delta data (changes) → avoids writing to disk repeatedly
- More efficient for iterative programs
Comparison with MPI:
Feature | MapReduce | MPI |
---|---|---|
Communication | Full data flow | Only delta flow |
Disk Writes | Frequent | None (in-memory) |
Performance | Slower | Faster |
Fault Tolerance | Strong | Weak |
Iteration Support | Not built-in | Yes |
Differences: Hadoop vs Twister vs MPI
Paradigm | Intermediate Data | Supports Iteration | Disk Usage |
---|---|---|---|
Hadoop (MR) | Written to disk | No | High |
Twister | In-memory streaming | Yes | Low |
MPI | In-memory, delta | Yes | Very low |

