1. Hadoop

Hadoop is an open-source implementation of the MapReduce framework developed by Apache. It is written in Java and uses its own storage system called HDFS (Hadoop Distributed File System) instead of Google’s GFS.

2. Core Components of Hadoop

Hadoop has two main layers:

Layer	Description
🔹 HDFS	Stores large files across multiple machines (like Google File System)
🔹 MapReduce Engine	Performs parallel processing and computation on the data stored in HDFS

3. HDFS – Hadoop Distributed File System

Architecture

Master-Slave Model
- NameNode – Master node (controls metadata and file system)
- DataNodes – Worker nodes (store actual data blocks)

Each file is split into blocks (default size = 64MB) and stored across multiple DataNodes.
The NameNode keeps track of where each block is stored.

4. Key Features of HDFS

Fault Tolerance (Reliable Storage)

Block Replication: Each block is stored in 3 copies by default.
- Copy 1: on original node
- Copy 2: on another node in the same rack
- Copy 3: on a node in a different rack

Heartbeat: Sent regularly by each DataNode to confirm it’s working.
BlockReport: Sent by DataNodes to report which blocks they store.

High Throughput

Designed for batch processing, not quick reads.

Blocks are large (64MB+) for faster sequential access.

5. File Operations in HDFS

Reading a File

User sends “open” request to NameNode.
NameNode returns locations of blocks and DataNodes.

User reads each block sequentially from the nearest DataNode.

Writing a File

User sends “create” request to NameNode.
Data is written to a queue.

A streamer sends blocks to the first DataNode, which forwards it to others for replication.
Process repeats for every block.

6. MapReduce in Hadoop

Architecture

Also master-slave:
- JobTracker (master) → Manages entire job
- TaskTrackers (slaves) → Run actual Map and Reduce tasks on each node

Execution Slots

Each TaskTracker has slots based on the number of CPU threads.

Example: A system with 2 CPUs, each supporting 4 threads → 8 slots total
Each slot runs one map or reduce task.

1-to-1 Mapping

One map task processes one data block

So, the number of map tasks = number of blocks

Points to Remember

Feature	HDFS	MapReduce
Stores Data	Yes	No
Handles Computation	No	Yes
Master Node	NameNode	JobTracker
Slave Nodes	DataNodes	TaskTrackers

Hadoop Library from Apache

1. Hadoop

2. Core Components of Hadoop

3. HDFS – Hadoop Distributed File System

Architecture

4. Key Features of HDFS

Fault Tolerance (Reliable Storage)

High Throughput

5. File Operations in HDFS

Reading a File

Writing a File

6. MapReduce in Hadoop

Architecture

Execution Slots

1-to-1 Mapping

Points to Remember

Leave a ReplyCancel Reply

1. Hadoop

2. Core Components of Hadoop

3. HDFS – Hadoop Distributed File System

Architecture

4. Key Features of HDFS

Fault Tolerance (Reliable Storage)

High Throughput

5. File Operations in HDFS

Reading a File

Writing a File

6. MapReduce in Hadoop

Architecture

Execution Slots

1-to-1 Mapping

Points to Remember

Related Posts

AWS and Azure Programming

MapReduce, Twister, and Iterative MapReduce.

Data and Software Protection Techniques

Leave a ReplyCancel Reply