Hadoop Library from Apache

1. Hadoop

Hadoop is an open-source implementation of the MapReduce framework developed by Apache. It is written in Java and uses its own storage system called HDFS (Hadoop Distributed File System) instead of Google’s GFS.


2. Core Components of Hadoop

Hadoop has two main layers:

LayerDescription
🔹 HDFSStores large files across multiple machines (like Google File System)
🔹 MapReduce EnginePerforms parallel processing and computation on the data stored in HDFS

3. HDFS – Hadoop Distributed File System

Architecture

  • Master-Slave Model
    • NameNode – Master node (controls metadata and file system)
    • DataNodes – Worker nodes (store actual data blocks)

Each file is split into blocks (default size = 64MB) and stored across multiple DataNodes.
The NameNode keeps track of where each block is stored.


4. Key Features of HDFS

Fault Tolerance (Reliable Storage)

  • Block Replication: Each block is stored in 3 copies by default.
    • Copy 1: on original node
    • Copy 2: on another node in the same rack
    • Copy 3: on a node in a different rack
  • Heartbeat: Sent regularly by each DataNode to confirm it’s working.
  • BlockReport: Sent by DataNodes to report which blocks they store.

High Throughput

  • Designed for batch processing, not quick reads.
  • Blocks are large (64MB+) for faster sequential access.

5. File Operations in HDFS

Reading a File

  1. User sends “open” request to NameNode.
  2. NameNode returns locations of blocks and DataNodes.
  3. User reads each block sequentially from the nearest DataNode.

Writing a File

  1. User sends “create” request to NameNode.
  2. Data is written to a queue.
  3. A streamer sends blocks to the first DataNode, which forwards it to others for replication.
  4. Process repeats for every block.

6. MapReduce in Hadoop

Architecture

  • Also master-slave:
    • JobTracker (master) → Manages entire job
    • TaskTrackers (slaves) → Run actual Map and Reduce tasks on each node

Execution Slots

  • Each TaskTracker has slots based on the number of CPU threads.
  • Example: A system with 2 CPUs, each supporting 4 threads → 8 slots total
  • Each slot runs one map or reduce task.

1-to-1 Mapping

  • One map task processes one data block
  • So, the number of map tasks = number of blocks

Points to Remember

FeatureHDFSMapReduce
Stores DataYesNo
Handles ComputationNoYes
Master NodeNameNodeJobTracker
Slave NodesDataNodesTaskTrackers

Leave a Reply

Your email address will not be published. Required fields are marked *