MapReduce, Twister, and Iterative MapReduce.

1. MapReduce Framework (Basics)

MapReduce is a software framework designed by Google that enables parallel and distributed computing over large data sets across multiple machines. It provides two main functions:

  • Map Function: Processes input data and generates intermediate (key, value) pairs
  • Reduce Function: Processes grouped intermediate data to generate final output

The user defines Map() and Reduce() functions, and the framework handles data splitting, parallel execution, scheduling, and communication automatically.


2. Data Flow in MapReduce

StepProcess
1️⃣Input data is split and given to multiple Map functions
2️⃣Map() generates intermediate (key, value) pairs
3️⃣Intermediate data is sorted and grouped by key
4️⃣Reduce() gets data as (key, [list of values]), processes it, and outputs results

3. Example: Word Count (MapReduce)

Input:

  • Line 1: “most people ignore most poetry”
  • Line 2: “most poetry ignores most people”

Map Output:

  • (most, 1), (people, 1), (ignore, 1), (most, 1), …

Grouped:

  • (most, [1,1,1,1]), (people, [1,1]), (poetry, [1,1]), …

Reduce Output:

  • (most, 4), (people, 2), (poetry, 2), …

4. Formal Definitions

Map Function:

(key1, val1) → List of (key2, val2)

Reduce Function:

(key2, List(val2)) → List(val2)

5. Actual Data & Control Flow (Steps)

StepDescription
1️⃣ Data partitioningPartition Input Data stored in Google File System (GFS)
2️⃣ Computation partitioningFork user programs on cluster nodes (master & workers)
3️⃣ Determining the master and workersAssign map/reduce tasks to workers
4️⃣ Reading the input dataMap workers read their data block
5️⃣ Map functionMap() processes and outputs intermediate (key, value) pairs
6️⃣ Combiner function(Optional) Combiner locally reduces map output before sending
7️⃣ Partitioning functionPartitioning Function (e.g., hash(key) % R) groups data for reduce
8️⃣ SynchronizationSynchronization: Reduce starts only after all map tasks complete
9️⃣ CommunicationCommunication: Reduce workers pull grouped data from mappers
🔟 Sorting and groupingData is sorted & grouped locally at each reduce worker
1️⃣1️⃣ Reduce functionReduce() processes grouped data and stores final output

6. Compute-Data Affinity

MapReduce is designed to bring the computation to where the data is (instead of sending large data to computation), minimizing network load. This works efficiently with GFS where data is stored in 64MB blocks (same size as MapReduce splits).


7. Twister and Iterative MapReduce

Twister is an advanced version of MapReduce, also called MapReduce++, designed to handle iterative applications like machine learning (e.g., K-means, PageRank).

Why Twister:

  • Traditional MapReduce:
    • Writes intermediate output to disk → slow
    • No built-in support for iterations
  • Twister:
    • Keeps long-running threads
    • Shares static data
    • Streams only delta data (changes) → avoids writing to disk repeatedly
    • More efficient for iterative programs

Comparison with MPI:

FeatureMapReduceMPI
CommunicationFull data flowOnly delta flow
Disk WritesFrequentNone (in-memory)
PerformanceSlowerFaster
Fault ToleranceStrongWeak
Iteration SupportNot built-inYes

Differences: Hadoop vs Twister vs MPI

ParadigmIntermediate DataSupports IterationDisk Usage
Hadoop (MR)Written to diskNoHigh
TwisterIn-memory streamingYesLow
MPIIn-memory, deltaYesVery low

Leave a Reply

Your email address will not be published. Required fields are marked *