Explain MapReduce execution steps with a neat diagram

7.A) Explain MapReduce execution steps with a neat diagram. – 10 Marks

Answer:-


MapReduce Execution Steps:

1. Input Split

  • Purpose: Divides the input data into smaller, manageable chunks to facilitate parallel processing.
  • How it Works:
    • Data is split into fixed-size blocks (default size: 64 MB or 128 MB in Hadoop).
    • Each split is processed independently by a mapper.
  • Example: A large file containing logs is divided into smaller chunks, such as lines 1-1000, 1001-2000, etc.

2. RecordReader (RR)

  • Purpose: Converts raw input data from splits into key-value pairs that can be processed by the mapper.
  • How it Works:
    • For text files, the RecordReader generates pairs like (line_number, line_content).
    • It ensures that the input format is compatible with the MapReduce framework.
  • Example: Line 5 of a file:
    Key = 5, Value = This is an example line.

3. Mapping Phase (MAP)

  • Purpose: Processes the input key-value pairs and generates intermediate key-value pairs.
  • How it Works:
    • The mapper applies the user-defined map() function to each input pair.
    • The output is a set of intermediate key-value pairs.
  • Example: Counting words in a text file:
    Input: (line_number, "apple banana apple")
    Output: ("apple", 1), ("banana", 1), ("apple", 1)

4. Combine Phase (Optional)

  • Purpose: Locally aggregates intermediate results to reduce data transfer to reducers.
  • How it Works:
    • Combiner acts as a “mini-reducer” on the mapper node.
    • It reduces intermediate data size by pre-aggregating values with the same key.
  • Example:
    Input: ("apple", 1), ("apple", 1), ("banana", 1)
    Output: ("apple", 2), ("banana", 1)

5. Shuffle and Sort

  • Purpose: Organizes and groups intermediate key-value pairs by key and prepares them for the reducer.
  • How it Works:
    • Shuffle: Data is transferred from mappers to reducers. Keys are grouped together from different mappers.
    • Sort: Ensures that keys are in sorted order before being processed by the reducer.
  • Example:
    Input from mappers:
    Mapper 1: ("apple", 2), ("banana", 1)
    Mapper 2: ("apple", 1), ("orange", 1)
    After shuffle and sort:
    Reducer 1: ("apple", [2, 1])
    Reducer 2: ("banana", [1]), ("orange", [1])

6. Reducing Phase (REDUCE)

  • Purpose: Aggregates the grouped key-value pairs to produce the final output.
  • How it Works:
    • The reducer applies the user-defined reduce() function to grouped values.
    • This produces a single value or result for each key.
  • Example:
    Input: ("apple", [2, 1])
    Output: ("apple", 3)

7. Output

  • Purpose: Writes the final key-value pairs to the Hadoop Distributed File System (HDFS).
  • How it Works:
    • Each reducer writes its output to a separate file in HDFS.
    • The final output is split across files, one per reducer.
  • Example:
    Reducer 1 writes: ("apple", 3)
    Reducer 2 writes: ("banana", 1), ("orange", 1)

Overall Example: Word Count

  • Input: A text file with content:
    "apple banana apple orange banana apple"
  • Process:
    • Split into chunks and converted into pairs:
      (1, "apple banana apple"), (2, "orange banana apple")
    • Mapper outputs:
      ("apple", 1), ("banana", 1), ("apple", 1), ...
    • Shuffle and sort groups data:
      ("apple", [1, 1, 1]), ("banana", [1, 1]), ("orange", [1])
    • Reducer outputs:
      ("apple", 3), ("banana", 2), ("orange", 1)
  • Output: Written to HDFS as a file.

Leave a Reply

Your email address will not be published. Required fields are marked *