Introduction
Apache Hadoop is an open-source framework for distributed storage and processing of large data sets using clusters of commodity hardware. Originally inspired by Google's MapReduce and GFS papers, Hadoop became the foundation of the modern big data ecosystem and remains widely deployed in enterprise data pipelines.
What Apache Hadoop Does
- Stores petabytes of data across distributed clusters using HDFS (Hadoop Distributed File System)
- Processes large-scale data sets in parallel via the MapReduce programming model
- Manages cluster resources and job scheduling through YARN (Yet Another Resource Negotiator)
- Provides fault tolerance through automatic data replication across nodes
- Supports pluggable storage and compute engines including Spark, Tez, and Hive
Architecture Overview
Hadoop consists of three core modules: HDFS for distributed storage with configurable replication, YARN for resource management and job scheduling, and MapReduce for batch computation. HDFS splits files into blocks distributed across DataNodes, with a NameNode tracking metadata. YARN allocates CPU and memory to application containers managed by per-node NodeManagers and a central ResourceManager.
Self-Hosting & Configuration
- Requires Java 8 or 11 runtime on all cluster nodes
- Configure core-site.xml, hdfs-site.xml, yarn-site.xml, and mapred-site.xml for cluster setup
- Set HDFS replication factor (default 3) based on cluster size and fault tolerance needs
- Use Kerberos authentication for production security via hadoop.security.authentication
- Deploy via Apache Ambari, Cloudera Manager, or containerized setups with Docker/Kubernetes
Key Features
- Linear horizontal scalability from a single node to thousands of machines
- Data locality optimization moves computation to where data resides
- Built-in rack awareness for intelligent data placement and network optimization
- Ecosystem integration with Hive, Pig, HBase, Spark, and dozens of other tools
- Federation and high availability for NameNode eliminate single points of failure
Comparison with Similar Tools
- Apache Spark — in-memory processing engine that runs on top of or independently from Hadoop, faster for iterative workloads
- Apache Flink — native streaming engine with batch support, lower latency than MapReduce
- Presto/Trino — interactive SQL query engine for federated data sources, not a storage layer
- Databricks — managed Spark platform with proprietary optimizations and a commercial model
- Google BigQuery — fully managed serverless analytics warehouse, no cluster management
FAQ
Q: Is Hadoop still relevant with Spark and cloud data warehouses available? A: Yes. HDFS remains the storage backbone for many on-premise data lakes, and YARN orchestrates workloads beyond MapReduce including Spark and Flink jobs.
Q: What is the minimum cluster size for production? A: A minimal production cluster starts at 4-5 nodes (1 NameNode, 1 ResourceManager, 3+ DataNodes), though single-node pseudo-distributed mode works for development.
Q: How does Hadoop handle node failures? A: HDFS replicates each data block across multiple nodes. If a DataNode fails, the NameNode detects the loss via heartbeats and re-replicates under-replicated blocks automatically.
Q: Can Hadoop run in the cloud? A: Yes. AWS EMR, Google Dataproc, and Azure HDInsight provide managed Hadoop clusters, and HDFS can be replaced with cloud object storage like S3.