Automation with Ansible — Setting up Hadoop Clusters

Akshaya Balaji
7 min readDec 22, 2020

This is the third article in the Automation with Ansible series. For the second article, please refer to this link.

In this series, we will be looking at different ways in which Ansible can be used to implement automation in the IT industry

Image Source

In the previous article, we looked at some of the basic terms in Ansible. We also set up the Docker Community Edition on RHEL8 and launched a container with a web server configured in it. In this article, we are going to look at how we can set up a Hadoop Distributed File System (HDFS) cluster using RHEL8 virtual machines (VMs).

What is Hadoop?

To handle the storage and computation of large volumes of data, we use the distributed systems which are capable of integrating the working of multiple computer systems to handle or store a given volume of data. There are many products in the industry for this purpose like Hadoop, Chef, Clusterfs, etc. But we will be focusing on the product Hadoop.

Hadoop is an Apache open-source framework written in Java. It is used for the distributed processing of large datasets across clusters of computers. The framework consists of two major layers:

(i) Processing/Computation layer (MapReduce)

(ii)Storage layer (Hadoop Distributed File System)

In this article, we will be looking at how we can set up the HDFS cluster of Hadoop using Ansible, which belongs to the storage layer. In an HDFS cluster, one or more computer systems can share storage with a central system. The central system is called the name node, while the computer systems contributing storage to the name node are called the data nodes. Any client can utilize the shared storage by connecting to the name node using the HDFS protocol.

The main steps to set up an HDFS cluster are

(i) Install Hadoop and the supporting JDK package on all the name nodes, worker nodes, and clients.

(ii) Configure the name node through the files hdfs-site.xml and core-site.xml.

(iii) Format the local directory of the name node where the storage from the data nodes will be shared.

(iv) Configure the worker nodes through the files hdfs-site.xml and core-site.xml.

(v) Start the Hadoop Services in the name node and data nodes.

Setting up the HDFS cluster using Ansible

Ansible is designed for configuration management, and it greatly simplifies the whole process of setting up the HDFS cluster and starting the services. Based on the steps defined above, we will be writing 4 playbooks, which are named as follows:

  1. install_hadoop.yml — To install Hadoop and the supporting JDK on the name node and data nodes
  2. setup_namenode.yml — Configure the hdfs-site.xml and core-site.xml files of name node, and format the local directory of the name node for shared storage
  3. setup_datanode.yml — Configure the hdfs-site.xml and core-site.xml files of the data nodes.
  4. start_services.yml — Start the Hadoop service in the name node and data nodes.

We will also use an additional playbook, named all_plays.yml, to execute the 4 playbooks in the desired order. This method is better than writing down all the tasks (plays) in a single playbook, as we may not need to run all the plays. For example, the installation of Hadoop and JDK needs to happen only once. So, we can simply modify the all_plays.yml playbook to choose which of the steps we will need to complete our requirements.

A sample inventory is given below.

In the inventory, we define two groups — hadoop_namenodes, which includes the target node that must be configured as name nodes, and hadoop_datanodes, which includes the target node(s) that must be configured as data nodes.

The complete directory structure is as shown below.

Figure 1: Directory Structure for all files

The directory group_vars stores all the variables for each group under a file with the group’s name. The contents of these files are given below.

Installing Hadoop and supporting JDK

install_hadoop.yml is run on all the target nodes. First, we check if the desired versions of JDK and Hadoop are installed, and store the outputs of these tasks in two variables (installed_jdk and installed_hadoop) using the register keyword. If these tasks are not successful (i.e., if they are not installed), the tasks to install JDK and Hadoop are executed. To show the variations and options in the modules available to us, I have used the get_url module to download Hadoop directly into the target nodes, while copying the JDK’s rpm from the controller to the target node. To find the same JDK, you can check the link here.

The installations are completed using the command module as we are dealing with RPMs instead of a pre-configured YUM. The command module also allows us to give additional options (in the case of Hadoop) during installation.

Setting up the Name Node

setup_namenode.yml consists of all the tasks we need to run to set up the name node. Hence, the playbook will be run only on the hosts under the hadoop_namenodes group. The tasks are as follows:

(i) We create a new directory with a name provided by the user (through the group_vars/hadoop_namenodes file) stored in the variable nn_dir. The task’s output status is stored in the variable created_new_dir.

(ii) We stop the name node if it is running. This might seem wrong as the playbook is also designed to run to set up a name node, and we cannot stop a service that is already running. However, the failure of this task (i.e., the name node’s Hadoop service is not running) does not mean the execution of the playbook itself stops.

(iii) We configure the hdfs-site.xml and core-site.xml files using the template module. The template module and copy module are very similar in their operation since they both are used to copy files from the controller to the target node(s). However, the template module provides the additional feature of processing any special keywords or variables and filling in the right values in the files before copying them to the target node(s).

The template module can differentiate between strings and special keywords (or variables) only when they are given in the file using the Jinja2 format. The files are saved as sample-nn-hdfs-site.xml.j2 and sample-nn-core-site.xml.j2 in the controller node and copied to the target node as hdfs-site.xml and core-site.xml respectively. You can see the contents of these files in the GitHub repository where all the files for this example are stored. The link is provided at the end of this article.

(iv) If a new directory was created in the first task, we need to first format it before it can be used by the name node to draw from the shared storage provided by the data nodes. Hence, only if the changed attribute of the variable created_new_dir is true, the task of formatting the name node’s storage directory is executed.

Setting up the Data Node(s)

The process of setting up data nodes is similar to that of setting up name nodes. The step of formatting the storage sharing directory is present only for name nodes, hence we do not need to repeat that step for the data nodes.

Starting the Hadoop Services in all the nodes

The playbook is written only to start the services in the name nodes and data node(s) by defining the tasks for their respective host groups. In both cases, we also clear out the cache as Hadoop services require a lot of memory, and clearing out the cache helps in freeing memory for Hadoop.

Aggregation of all playbooks

All the playbooks must be executed in a certain order if we are setting up the Hadoop cluster for the first time, and this order is defined in the all_plays.yml file. Now, we can simply execute this playbook on the command line instead of individually executing each playbook.

# ansible-playbook --syntax-check all_plays.yml
# ansible-playbook -vv all_plays.yml

Checking the Name Node and Data Node(s)

When I tested this example, I considered one name node and one data node in the inventory file. Once the playbooks run successfully, we can see the status of the Hadoop cluster from the name node or data node using the command

# hadoop dfsadmin -report

If the nodes have all connected, we see the following output

Figure 2: Output of the cluster report if all the nodes are configured successfully

Here, we can see that one data node with the IP address 192.168.1.52 is contributing storage to the name node.

Conclusion

In this article, we have seen how to set up an HDFS cluster on RHEL8 VMs. By configuring the inventory, we can apply the same script for any number of data nodes, which shows Ansible’s versatility in configuration management in terms of variation in the number of targets to be configured.

You can visit this GitHub repository to find all the files mentioned in this article.

In the next article, we will look at Ansible’s nature of Idempotence, and how it affects the way in which we write Ansible Playbooks

--

--

Akshaya Balaji

MEng EECS | ML, AI and Data Science Enthusiast | Avid Reader | Keen to explore different domains in Computer Science