MapReduce Execution Steps:

Purpose: Divides the input data into smaller, manageable chunks to facilitate parallel processing.
How it Works:
- Data is split into fixed-size blocks (default size: 64 MB or 128 MB in Hadoop).
- Each split is processed independently by a mapper.
Example: A large file containing logs is divided into smaller chunks, such as lines 1-1000, 1001-2000, etc.

Purpose: Converts raw input data from splits into key-value pairs that can be processed by the mapper.
How it Works:
- For text files, the RecordReader generates pairs like (line_number, line_content).
- It ensures that the input format is compatible with the MapReduce framework.
Example: Line 5 of a file:
Key = 5, Value = This is an example line.

Purpose: Processes the input key-value pairs and generates intermediate key-value pairs.
How it Works:
- The mapper applies the user-defined map() function to each input pair.
- The output is a set of intermediate key-value pairs.
Example: Counting words in a text file:
Input: (line_number, "apple banana apple")
Output: ("apple", 1), ("banana", 1), ("apple", 1)

Purpose: Locally aggregates intermediate results to reduce data transfer to reducers.
How it Works:
- Combiner acts as a “mini-reducer” on the mapper node.
- It reduces intermediate data size by pre-aggregating values with the same key.
Example:
Input: ("apple", 1), ("apple", 1), ("banana", 1)
Output: ("apple", 2), ("banana", 1)

Purpose: Organizes and groups intermediate key-value pairs by key and prepares them for the reducer.
How it Works:
- Shuffle: Data is transferred from mappers to reducers. Keys are grouped together from different mappers.
- Sort: Ensures that keys are in sorted order before being processed by the reducer.
Example:
Input from mappers:
Mapper 1: ("apple", 2), ("banana", 1)
Mapper 2: ("apple", 1), ("orange", 1)
After shuffle and sort:
Reducer 1: ("apple", [2, 1])
Reducer 2: ("banana", [1]), ("orange", [1])

Purpose: Aggregates the grouped key-value pairs to produce the final output.
How it Works:
- The reducer applies the user-defined reduce() function to grouped values.
- This produces a single value or result for each key.
Example:
Input: ("apple", [2, 1])
Output: ("apple", 3)

Purpose: Writes the final key-value pairs to the Hadoop Distributed File System (HDFS).
How it Works:
- Each reducer writes its output to a separate file in HDFS.
- The final output is split across files, one per reducer.
Example:
Reducer 1 writes: ("apple", 3)
Reducer 2 writes: ("banana", 1), ("orange", 1)

Leave a ReplyCancel Reply