Automation with Ansible — Setting up Hadoop Clusters

This is the third article in the Automation with Ansible series. For the second article, please refer to this link.

In this series, we will be looking at different ways in which Ansible can be used to implement automation in the IT industry

Image Source

What is Hadoop?

To handle the storage and computation of large volumes of data, we use the distributed systems which are capable of integrating the working of multiple computer systems to handle or store a given volume of data. There are many products in the industry for this purpose like Hadoop, Chef, Clusterfs, etc. But we will be focusing on the product Hadoop.

Setting up the HDFS cluster using Ansible

Ansible is designed for configuration management, and it greatly simplifies the whole process of setting up the HDFS cluster and starting the services. Based on the steps defined above, we will be writing 4 playbooks, which are named as follows:

  1. setup_namenode.yml — Configure the hdfs-site.xml and core-site.xml files of name node, and format the local directory of the name node for shared storage
  2. setup_datanode.yml — Configure the hdfs-site.xml and core-site.xml files of the data nodes.
  3. start_services.yml — Start the Hadoop service in the name node and data nodes.
Figure 1: Directory Structure for all files

Installing Hadoop and supporting JDK

Setting up the Name Node

Setting up the Data Node(s)

Starting the Hadoop Services in all the nodes

Aggregation of all playbooks

# ansible-playbook --syntax-check all_plays.yml
# ansible-playbook -vv all_plays.yml

Checking the Name Node and Data Node(s)

When I tested this example, I considered one name node and one data node in the inventory file. Once the playbooks run successfully, we can see the status of the Hadoop cluster from the name node or data node using the command

# hadoop dfsadmin -report
Figure 2: Output of the cluster report if all the nodes are configured successfully

Conclusion

In this article, we have seen how to set up an HDFS cluster on RHEL8 VMs. By configuring the inventory, we can apply the same script for any number of data nodes, which shows Ansible’s versatility in configuration management in terms of variation in the number of targets to be configured.

In the next article, we will look at Ansible’s nature of Idempotence, and how it affects the way in which we write Ansible Playbooks

ECE Undergrad | ML, AI and Data Science Enthusiast | Avid Reader | Keen to explore different domains in Computer Science