1. MapReduce Framework (Basics)

MapReduce is a software framework designed by Google that enables parallel and distributed computing over large data sets across multiple machines. It provides two main functions:

Map Function: Processes input data and generates intermediate (key, value) pairs
Reduce Function: Processes grouped intermediate data to generate final output

The user defines Map() and Reduce() functions, and the framework handles data splitting, parallel execution, scheduling, and communication automatically.

2. Data Flow in MapReduce

Step	Process
1️⃣	Input data is split and given to multiple Map functions
2️⃣	Map() generates intermediate (key, value) pairs
3️⃣	Intermediate data is sorted and grouped by key
4️⃣	Reduce() gets data as (key, [list of values]), processes it, and outputs results

3. Example: Word Count (MapReduce)

Input:

Line 1: “most people ignore most poetry”
Line 2: “most poetry ignores most people”

Map Output:

(most, 1), (people, 1), (ignore, 1), (most, 1), …

Grouped:

(most, [1,1,1,1]), (people, [1,1]), (poetry, [1,1]), …

Reduce Output:

(most, 4), (people, 2), (poetry, 2), …

4. Formal Definitions

Map Function:

(key1, val1) → List of (key2, val2)

Reduce Function:

(key2, List(val2)) → List(val2)

5. Actual Data & Control Flow (Steps)

Step	Description
1️⃣ Data partitioning	Partition Input Data stored in Google File System (GFS)
2️⃣ Computation partitioning	Fork user programs on cluster nodes (master & workers)
3️⃣ Determining the master and workers	Assign map/reduce tasks to workers
4️⃣ Reading the input data	Map workers read their data block
5️⃣ Map function	Map() processes and outputs intermediate (key, value) pairs
6️⃣ Combiner function	(Optional) Combiner locally reduces map output before sending
7️⃣ Partitioning function	Partitioning Function (e.g., hash(key) % R) groups data for reduce
8️⃣ Synchronization	Synchronization: Reduce starts only after all map tasks complete
9️⃣ Communication	Communication: Reduce workers pull grouped data from mappers
🔟 Sorting and grouping	Data is sorted & grouped locally at each reduce worker
1️⃣1️⃣ Reduce function	Reduce() processes grouped data and stores final output

6. Compute-Data Affinity

MapReduce is designed to bring the computation to where the data is (instead of sending large data to computation), minimizing network load. This works efficiently with GFS where data is stored in 64MB blocks (same size as MapReduce splits).

7. Twister and Iterative MapReduce

Twister is an advanced version of MapReduce, also called MapReduce++, designed to handle iterative applications like machine learning (e.g., K-means, PageRank).

Why Twister:

Traditional MapReduce:
- Writes intermediate output to disk → slow
- No built-in support for iterations
Twister:
- Keeps long-running threads
- Shares static data
- Streams only delta data (changes) → avoids writing to disk repeatedly
- More efficient for iterative programs

Comparison with MPI:

Feature	MapReduce	MPI
Communication	Full data flow	Only delta flow
Disk Writes	Frequent	None (in-memory)
Performance	Slower	Faster
Fault Tolerance	Strong	Weak
Iteration Support	Not built-in	Yes

Differences: Hadoop vs Twister vs MPI

Paradigm	Intermediate Data	Supports Iteration	Disk Usage
Hadoop (MR)	Written to disk	No	High
Twister	In-memory streaming	Yes	Low
MPI	In-memory, delta	Yes	Very low

MapReduce, Twister, and Iterative MapReduce.

1. MapReduce Framework (Basics)

2. Data Flow in MapReduce

3. Example: Word Count (MapReduce)

4. Formal Definitions

Map Function:

Reduce Function:

5. Actual Data & Control Flow (Steps)

6. Compute-Data Affinity

7. Twister and Iterative MapReduce

Why Twister:

Comparison with MPI:

Differences: Hadoop vs Twister vs MPI

Leave a ReplyCancel Reply

1. MapReduce Framework (Basics)

2. Data Flow in MapReduce

3. Example: Word Count (MapReduce)

4. Formal Definitions

Map Function:

Reduce Function:

5. Actual Data & Control Flow (Steps)

6. Compute-Data Affinity

7. Twister and Iterative MapReduce

Why Twister:

Comparison with MPI:

Differences: Hadoop vs Twister vs MPI

Related Posts

AWS and Azure Programming

Hadoop Library from Apache

Data and Software Protection Techniques

Leave a ReplyCancel Reply