This repository provides a robust and easily deployable Hadoop High Availability (HA) cluster using Docker and Docker Compose. The architecture is designed to eliminate single points of failure within HDFS and YARN, ensuring continuous operation and data resilience.
The cluster is structured to provide high availability for both HDFS and YARN components. It consists of two master nodes and three worker nodes, all containerized using Docker.
-
Master Nodes (
MyMaster01,MyMaster02):- NameNode: Manages the HDFS namespace and regulates access to files by clients. In HA setup, one NameNode is Active and the other is Standby.
- ResourceManager: The primary service in YARN that allocates resources to applications.
- ZooKeeper: A distributed coordination service used for NameNode failover and other distributed synchronization tasks.
- ZKFC (ZooKeeper Failover Controller): A component that manages NameNode failover, integrating with ZooKeeper to elect the Active NameNode.
- JournalNode: Stores edit logs from the Active NameNode, allowing the Standby NameNode to stay synchronized.
-
Worker Nodes (
Myworker01,Myworker02,Myworker03):- DataNode: Stores the actual HDFS data blocks.
- NodeManager: The per-machine agent in YARN that manages user applications and resources on a given node.
- ZooKeeper:
Myworker03hosts a ZooKeeper instance, contributing to the ZooKeeper ensemble for robust coordination. - JournalNode:
Myworker03hosts a JournalNode instance, contributing to the shared edit log mechanism.
The cluster operates within a custom Docker network named cluster-network, allowing services to communicate using their hostnames. The docker-compose.yml defines the following services and their roles:
| Service Name | Hostname | Role |
|---|---|---|
mymaster01 |
mymaster01 |
NameNode (Active/Standby), ResourceManager, ZKFC, JournalNode, ZooKeeper |
mymaster02 |
mymaster02 |
NameNode (Active/Standby), ResourceManager, ZKFC, JournalNode, ZooKeeper |
myworker01 |
myworker01 |
DataNode, NodeManager |
myworker02 |
myworker02 |
DataNode, NodeManager |
myworker03 |
myworker03 |
DataNode, NodeManager, JournalNode, ZooKeeper |
Internal Communication: Services within the cluster-network resolve each other by their hostnames.
External Access: For accessing Hadoop UIs from your host machine, the following ports are mapped:
- Active NameNode UI:
http://localhost:9870 - YARN ResourceManager UI:
http://localhost:8088
- High Availability: Implements HDFS HA with two NameNodes (Active/Standby) and JournalNodes for shared edit logs, ensuring data accessibility and resilience against failures.
- Dockerized Deployment: Simplifies cluster setup and management through containerization, enabling rapid spin-up and teardown of the entire Hadoop ecosystem.
- Automatic Failover: Integrates Apache ZooKeeper and the ZooKeeper Failover Controller (ZKFC) for seamless and automatic NameNode failover, significantly minimizing downtime.
- Scalable DataNodes: The Docker Compose configuration facilitates easy scaling of DataNodes to accommodate diverse data storage and processing requirements.
- Pre-configured Environment: The custom Docker image includes all necessary Hadoop and Java installations, providing a ready-to-use environment with minimal manual setup.
- YARN High Availability: Configures YARN with multiple ResourceManager instances for resilient job scheduling and application management, enhancing overall cluster reliability.
Ensure your system meets the following requirements before proceeding with the cluster setup:
- Docker Engine: Version 18.06.0 or higher. Installation Guide
- Docker Compose: Version 1.27.0 or higher. Installation Guide
- Minimum System Resources: At least 8GB RAM and 4 CPU cores are recommended for a smooth experience with the default cluster size. More resources may be required for larger datasets or heavier workloads.
Follow these steps to set up and launch your Hadoop HA cluster:
-
Clone the repository:
git clone https://github.com/MaiSerry/hadoop-HACluster.git cd hadoop-HACluster -
Build the Docker image:
The
Dockerfiledefines the base Hadoop image. Build it once. This image will be tagged ashadoop-ha:1.0.docker build -t hadoop-ha:1.0 . -
Start the Hadoop HA Cluster:
This command will bring up all services (NameNodes, JournalNodes, ZooKeeper, DataNodes) in detached mode (
-d).docker-compose up -d
-
Initialize HDFS and format NameNode:
This step is crucial and should only be performed once during the initial setup of the cluster. It formats the HDFS filesystem.
docker exec -it mymaster01 hdfs namenode -format -
Start ZKFC and initialize ZooKeeper for HA:
Initialize the ZooKeeper state for automatic NameNode failover. This also needs to be done only once.
docker exec -it mymaster01 hdfs zkfc -formatZK -
Start the HDFS and YARN services:
Launch the core Hadoop services across the cluster. This includes starting the NameNodes, DataNodes, ResourceManagers, and NodeManagers.
docker exec -it mymaster01 start-dfs.sh docker exec -it mymaster01 start-yarn.sh
- Hadoop 3.4.2
- ZooKeeper 3.8.6
- Docker + Docker Compose
- Ubuntu 24.04
- Java 17 (OpenJDK)