What is Apache HBase — Distributed Wide-Column Store on Hadoop?

Scalable non-relational database modeled after Google Bigtable, providing random real-time read/write access to billions of rows on HDFS.

Is Apache HBase — Distributed Wide-Column Store on Hadoop free to use?

Yes. Apache HBase — Distributed Wide-Column Store on Hadoop is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Apache HBase — Distributed Wide-Column Store on Hadoop?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Apache HBase — Distributed Wide-Column Store on Hadoop

Introduction

Apache HBase is an open-source, distributed, wide-column database built on top of HDFS. Inspired by Google's Bigtable paper, it provides real-time random read/write access to very large datasets, making it a core component of the Hadoop ecosystem for use cases requiring low-latency access to massive tables.

What Apache HBase Does

Stores billions of rows with millions of columns in a sparse, distributed table
Provides consistent low-latency reads and writes through automatic region splitting and load balancing
Integrates natively with HDFS for fault-tolerant, replicated data storage
Supports coprocessors (similar to stored procedures) for server-side computation
Enables real-time analytics when combined with Apache Phoenix for SQL access

Architecture Overview

HBase organizes data into Regions, each covering a contiguous range of row keys. A RegionServer process hosts multiple Regions and handles read/write requests. The HMaster coordinates Region assignment, schema changes, and load balancing across RegionServers. ZooKeeper manages cluster coordination and RegionServer health monitoring. Data is stored in HFiles (based on SSTable format) on HDFS, with a Write-Ahead Log (WAL) ensuring durability before MemStore flushes.

Self-Hosting & Configuration

Requires Java 8 or 11, Hadoop (HDFS), and ZooKeeper for distributed mode
Use standalone mode with built-in ZooKeeper for local development and testing
Configure hbase-site.xml for cluster settings: hbase.rootdir, hbase.zookeeper.quorum
Tune hbase.regionserver.handler.count and hbase.hregion.memstore.flush.size for throughput
Deploy via Apache Ambari, Cloudera, or containerized setups for production clusters

Key Features

Automatic region splitting distributes data evenly as tables grow beyond configured thresholds
Block cache and Bloom filters accelerate point lookups across large datasets
Coprocessors enable server-side filtering, aggregation, and secondary index maintenance
Snapshots provide instant point-in-time backups without stopping reads or writes
Replication supports cross-datacenter synchronization for disaster recovery

Comparison with Similar Tools

Apache Cassandra — Peer-to-peer wide-column store with tunable consistency; HBase offers stronger consistency via single-master regions
Google Cloud Bigtable — Managed service implementing the same Bigtable model; fully compatible with HBase API
ScyllaDB — C++ rewrite of Cassandra for better single-node performance; lacks HBase's tight HDFS integration
Apache Accumulo — Bigtable-inspired with cell-level security; less widely adopted than HBase
DynamoDB — AWS managed key-value and document database; proprietary with no self-hosted option

FAQ

Q: Does HBase support SQL queries? A: HBase itself uses a Java API and shell commands. Apache Phoenix adds a SQL layer on top of HBase, enabling JDBC access and SQL queries with secondary indexes.

Q: When should I choose HBase over Cassandra? A: Choose HBase when you need strong consistency per row, tight Hadoop ecosystem integration, or coprocessor-based server-side logic. Choose Cassandra for multi-datacenter deployments requiring tunable consistency and peer-to-peer architecture.

Q: How does HBase handle failures? A: If a RegionServer fails, ZooKeeper detects the loss and HMaster reassigns its Regions to surviving servers. The WAL on HDFS is replayed to recover any unflushed writes.

Q: Is HBase suitable for small datasets? A: HBase is designed for large-scale data. For datasets under a few hundred GB, a relational database or simpler key-value store will be easier to operate and likely perform better.

Apache HBase — Distributed Wide-Column Store on Hadoop

Introduction

What Apache HBase Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Apache Calcite — Dynamic SQL Query Planning and Optimization Framework

Apache IoTDB — Time-Series Database for Internet of Things

Immudb — Immutable Database with Cryptographic Verification