7.A) Explain MapReduce execution steps with a neat diagram. – 10 Marks
Answer:-
MapReduce Execution Steps:
1. Input Split
- Purpose: Divides the input data into smaller, manageable chunks to facilitate parallel processing.
- How it Works:
- Data is split into fixed-size blocks (default size: 64 MB or 128 MB in Hadoop).
- Each split is processed independently by a mapper.
- Example: A large file containing logs is divided into smaller chunks, such as lines 1-1000, 1001-2000, etc.
2. RecordReader (RR)
- Purpose: Converts raw input data from splits into key-value pairs that can be processed by the mapper.
- How it Works:
- For text files, the RecordReader generates pairs like
(line_number, line_content)
. - It ensures that the input format is compatible with the MapReduce framework.
- For text files, the RecordReader generates pairs like
- Example: Line 5 of a file:
Key =5
, Value =This is an example line
.
3. Mapping Phase (MAP)
- Purpose: Processes the input key-value pairs and generates intermediate key-value pairs.
- How it Works:
- The mapper applies the user-defined
map()
function to each input pair. - The output is a set of intermediate key-value pairs.
- The mapper applies the user-defined
- Example: Counting words in a text file:
Input:(line_number, "apple banana apple")
Output:("apple", 1), ("banana", 1), ("apple", 1)
4. Combine Phase (Optional)
- Purpose: Locally aggregates intermediate results to reduce data transfer to reducers.
- How it Works:
- Combiner acts as a “mini-reducer” on the mapper node.
- It reduces intermediate data size by pre-aggregating values with the same key.
- Example:
Input:("apple", 1), ("apple", 1), ("banana", 1)
Output:("apple", 2), ("banana", 1)
5. Shuffle and Sort
- Purpose: Organizes and groups intermediate key-value pairs by key and prepares them for the reducer.
- How it Works:
- Shuffle: Data is transferred from mappers to reducers. Keys are grouped together from different mappers.
- Sort: Ensures that keys are in sorted order before being processed by the reducer.
- Example:
Input from mappers:
Mapper 1:("apple", 2), ("banana", 1)
Mapper 2:("apple", 1), ("orange", 1)
After shuffle and sort:
Reducer 1:("apple", [2, 1])
Reducer 2:("banana", [1])
,("orange", [1])
6. Reducing Phase (REDUCE)
- Purpose: Aggregates the grouped key-value pairs to produce the final output.
- How it Works:
- The reducer applies the user-defined
reduce()
function to grouped values. - This produces a single value or result for each key.
- The reducer applies the user-defined
- Example:
Input:("apple", [2, 1])
Output:("apple", 3)
7. Output
- Purpose: Writes the final key-value pairs to the Hadoop Distributed File System (HDFS).
- How it Works:
- Each reducer writes its output to a separate file in HDFS.
- The final output is split across files, one per reducer.
- Example:
Reducer 1 writes:("apple", 3)
Reducer 2 writes:("banana", 1)
,("orange", 1)
Overall Example: Word Count
- Input: A text file with content:
"apple banana apple orange banana apple"
- Process:
- Split into chunks and converted into pairs:
(1, "apple banana apple")
,(2, "orange banana apple")
- Mapper outputs:
("apple", 1), ("banana", 1), ("apple", 1), ...
- Shuffle and sort groups data:
("apple", [1, 1, 1])
,("banana", [1, 1])
,("orange", [1])
- Reducer outputs:
("apple", 3), ("banana", 2), ("orange", 1)
- Split into chunks and converted into pairs:
- Output: Written to HDFS as a file.