1. Hadoop
Hadoop is an open-source implementation of the MapReduce framework developed by Apache. It is written in Java and uses its own storage system called HDFS (Hadoop Distributed File System) instead of Google’s GFS.
2. Core Components of Hadoop
Hadoop has two main layers:
Layer | Description |
---|---|
🔹 HDFS | Stores large files across multiple machines (like Google File System) |
🔹 MapReduce Engine | Performs parallel processing and computation on the data stored in HDFS |
3. HDFS – Hadoop Distributed File System
Architecture
- Master-Slave Model
- NameNode – Master node (controls metadata and file system)
- DataNodes – Worker nodes (store actual data blocks)
Each file is split into blocks (default size = 64MB) and stored across multiple DataNodes.
The NameNode keeps track of where each block is stored.
4. Key Features of HDFS
Fault Tolerance (Reliable Storage)
- Block Replication: Each block is stored in 3 copies by default.
- Copy 1: on original node
- Copy 2: on another node in the same rack
- Copy 3: on a node in a different rack
- Heartbeat: Sent regularly by each DataNode to confirm it’s working.
- BlockReport: Sent by DataNodes to report which blocks they store.
High Throughput
- Designed for batch processing, not quick reads.
- Blocks are large (64MB+) for faster sequential access.
5. File Operations in HDFS
Reading a File
- User sends “open” request to NameNode.
- NameNode returns locations of blocks and DataNodes.
- User reads each block sequentially from the nearest DataNode.
Writing a File
- User sends “create” request to NameNode.
- Data is written to a queue.
- A streamer sends blocks to the first DataNode, which forwards it to others for replication.
- Process repeats for every block.
6. MapReduce in Hadoop
Architecture
- Also master-slave:
- JobTracker (master) → Manages entire job
- TaskTrackers (slaves) → Run actual Map and Reduce tasks on each node
Execution Slots
- Each TaskTracker has slots based on the number of CPU threads.
- Example: A system with 2 CPUs, each supporting 4 threads → 8 slots total
- Each slot runs one map or reduce task.
1-to-1 Mapping
- One map task processes one data block
- So, the number of map tasks = number of blocks

Points to Remember
Feature | HDFS | MapReduce |
---|---|---|
Stores Data | Yes | No |
Handles Computation | No | Yes |
Master Node | NameNode | JobTracker |
Slave Nodes | DataNodes | TaskTrackers |