HBase is a distributed, scalable, NoSQL database built on top of Hadoop. It is designed to store huge amounts of structured or semi-structured data and provide fast, random read/write access. To achieve this, HBase relies on three main components in its architecture: HMaster, Region Server, and ZooKeeper.

HMaster

The HMaster acts as the main coordinator of the HBase cluster.
Think of it as the manager that oversees how data is distributed and how the cluster functions.

Key Roles of HMaster

Assigns regions (data chunks) to Region Servers
Manages table operations like create, delete, and modify
Monitors the health of Region Servers
Balances load across servers
Handles failover when a server crashes

In large clusters, multiple backup HMasters run to ensure high availability.

Region Server

HBase tables are very large, so they are divided horizontally into smaller parts called Regions. A Region Server is responsible for managing these regions.

What Region Servers Do

Store and manage regions, each containing data for a specific row-key range
Handle read and write requests from clients
Store data in column families, which are the basic storage units in HBase
Run on top of HDFS DataNodes, making use of Hadoop’s storage

Each region is around 256 MB by default, and new regions are automatically created as the table grows.

ZooKeeper

ZooKeeper works like a traffic controller for the HBase cluster.

ZooKeeper Responsibilities

Helps clients find which Region Server holds which data
Monitors server failures and helps in quick recovery
Maintains cluster configuration
Provides distributed synchronization

Without ZooKeeper, coordination between HMaster, Region Servers, and clients would not be possible.

How HBase Works

How Data is Written in HBase (Write Path)

Flow: Client → Region Server → WAL → MemStore → HFile

When you write data to HBase, here’s what actually happens:

1. The client sends a write request

Just like sending a message to a server saying, “Please save this data.”

2. Region Server writes to WAL (Write Ahead Log)

WAL is like a safety notebook.

Before HBase stores data in memory, it writes a copy to WAL so that nothing gets lost if the server crashes.
Think of WAL as saving a draft before writing the final version.

3. Data goes into MemStore (memory buffer)

This is a temporary holding area in RAM.

MemStore collects recent writes, making the system very fast because writing to memory is much quicker than writing to disk.

4. When MemStore becomes full, data is flushed to disk

Once the MemStore reaches a certain size, HBase saves its content permanently to disk as an HFile in HDFS.

This is like moving items from your desk (fast access) into a file cabinet (permanent storage).

5. Compaction happens in the background

Over time, many small HFiles get created.
HBase merges these smaller files into larger ones, which:

Reduces storage space
Speeds up read operations
Keeps data organized

This process is called compaction.

How Data is Read in HBase (Read Path)

Flow: Client → Region Server → BlockCache → MemStore → HFile

When a client wants to read data, HBase tries to return the answer as fast as possible.

1. Client contacts ZooKeeper

ZooKeeper tells the client which Region Server holds the data it needs.
This avoids confusion and saves time.

2. Region Server checks BlockCache (fastest place)

BlockCache is like the recently used memory (similar to how your phone keeps recently used apps active).

If the requested data is here → instant answer.

3. If not, it checks MemStore

MemStore may still have some recent writes that were not flushed to HDFS yet.

4. Finally, it looks into HFile (stored in HDFS)

If the data is not found in cache or MemStore, the Region Server reads it from the actual HFiles stored in HDFS.

This is the slowest option, but still efficient.

Why are reads fast?

Because most of the time:

Recently used data is in BlockCache
Recently written data is in MemStore

So HBase often returns results without touching the disk.

Advantages of HBase

Handles massive datasets easily
Scales horizontally just by adding more machines
Cost-effective for storing gigabytes to petabytes of data
High availability due to replication and failover
Suitable for real-time read/write workloads

Disadvantages of HBase

Does not support SQL queries (NoSQL model)
No full ACID transactions
Rows are sorted only by row key
Requires careful memory management in large clusters

HBase vs HDFS

Feature	HBase	HDFS
Access Pattern	Low-latency reads/writes	High-latency, batch processing
Data Access	Random read/write	Write once, read many
APIs	Shell, Java, REST, Thrift, Avro	Mostly MapReduce
Use Case	Real-time data	Large file storage & batch jobs

Key Features of HBase Architecture

Distributed & Scalable

HBase can grow across hundreds or thousands of machines, allowing it to store enormous datasets.

Column-oriented Storage

Data is stored in column families, making read/write operations faster for specific columns.

Tight Hadoop Integration

Built on HDFS and works seamlessly with MapReduce and other Hadoop tools.

Strong Consistency

Every read or write operation is consistent across the cluster.

Built-in Caching

Frequently accessed data is cached in memory for faster performance.

Data Compression

Reduces storage usage and speeds up data retrieval.

Flexible Schema

Columns can be added dynamically without redefining the entire table—ideal for evolving data.

Real-world Use Case

HBase is popular for online analytical workloads. For example, banks use HBase for real-time ATM transaction updates, where fast and consistent data operations are crucial.

Architecture of HBase