Hadoop is an open-source framework that supports the MapReduce programming model for processing large-scale data across a distributed cluster.
The job execution involves three major components:
- User Node
- JobTracker (Master Node)
- TaskTrackers (Worker Nodes)

Step-by-Step Explanation of the Data Flow:
1. Job Submission (User Node → JobTracker):
- The user calls
runJob(conf)
from the user node. - The user node:
- Requests a new Job ID from the JobTracker.
- Computes input file splits based on the HDFS input data.
- Uploads the following to the JobTracker’s file system:
- The job’s
.JAR
file. - The configuration (
conf
) file. - The computed input splits.
- The job’s
- Then, the job is submitted to JobTracker using
submitJob()
.
2. Task Assignment (JobTracker → TaskTrackers):
- The JobTracker:
- Creates Map Tasks for each input split.
- Assigns Map Tasks to TaskTrackers, preferably closest to the data location (data locality optimization).
- Creates Reduce Tasks (number set by user).
- Assigns Reduce Tasks to available TaskTrackers (no data locality is considered for reduce tasks).
3. Task Execution (At TaskTrackers):
- Each TaskTracker:
- Copies the job
.JAR
file from JobTracker’s file system. - Launches a JVM (Java Virtual Machine).
- Executes Map or Reduce task depending on the assignment.
- Copies the job
4. Task Running Check (Heartbeat Monitoring):
- TaskTrackers send periodic heartbeat signals to the JobTracker.
- Confirms that the TaskTracker is alive.
- Updates the JobTracker on task status (finished, running, or failed).
- Informs if the TaskTracker is ready for new task assignment.