The data flow in running a MapReduce job at various task trackers using Hadoop Library

Hadoop is an open-source framework that supports the MapReduce programming model for processing large-scale data across a distributed cluster.
The job execution involves three major components:

  • User Node
  • JobTracker (Master Node)
  • TaskTrackers (Worker Nodes)

Step-by-Step Explanation of the Data Flow:

1. Job Submission (User Node → JobTracker):

  • The user calls runJob(conf) from the user node.
  • The user node:
    • Requests a new Job ID from the JobTracker.
    • Computes input file splits based on the HDFS input data.
    • Uploads the following to the JobTracker’s file system:
      • The job’s .JAR file.
      • The configuration (conf) file.
      • The computed input splits.
  • Then, the job is submitted to JobTracker using submitJob().

2. Task Assignment (JobTracker → TaskTrackers):

  • The JobTracker:
    • Creates Map Tasks for each input split.
    • Assigns Map Tasks to TaskTrackers, preferably closest to the data location (data locality optimization).
    • Creates Reduce Tasks (number set by user).
    • Assigns Reduce Tasks to available TaskTrackers (no data locality is considered for reduce tasks).

3. Task Execution (At TaskTrackers):

  • Each TaskTracker:
    • Copies the job .JAR file from JobTracker’s file system.
    • Launches a JVM (Java Virtual Machine).
    • Executes Map or Reduce task depending on the assignment.

4. Task Running Check (Heartbeat Monitoring):

  • TaskTrackers send periodic heartbeat signals to the JobTracker.
    • Confirms that the TaskTracker is alive.
    • Updates the JobTracker on task status (finished, running, or failed).
    • Informs if the TaskTracker is ready for new task assignment.

Leave a Reply

Your email address will not be published. Required fields are marked *