Learn How to Set up Hadoop on 4 Amazon Instances



In the first part of this tutorial provisioning a cluster with four instances on Amazon ec2 was demonstrated. Connecting to the instances using SSH was also explained. The second part of this tutorial will pick up from there and explain how to set up a Hadoop cluster with two data nodes, one primary name node and one secondary name node. Login to your Amazon management console and start your instances.
start instance

To start your cluster, select all instances, click on Actions then start as shown below
start instances

Confirm instance state of each instance is indicated as running
running instance

To connect to a remote server when using SSH we need to specify the path to our .pem key file, the user name and public DNS. To avoid typing all this information we can add it in sshd_config file and just type an easy to remember name that resolves to our instance. Open sshd_config by navigating to the directory where it is stored and opening it in a text editor like gedit and add the lines highlighted in green.

edit ssh

Save sshd_config and restart SSH by running sudo restart ssh at the terminal
If you get an error “unable to connect to upstart” use the command below

sudo systemctl restart ssh
In a Hadoop cluster the name node needs to communicate with the data nodes. To enable this we need to set up SSH. First we upload the .pem key file from the local machine to the primary name node. To copy files to remote servers we use scp utility. Run scp at the terminal to check it is installed. You should get output similar to that shown below.

scp installed

To copy a file using scp you need to specify the path to .pem key, the file to be copied, user and public DNS of the remote server.

copy pem
After the .pem key has been uploaded SSH into primary node to create an authorization key and add the fingerprint to authorized_keys on name node and the data nodes.

Alternatively you can use the host we specified in sshd_config like this ssh -i namenode1

ssh to namenode1

Files are uploaded to the working directory on remote server. Check which is the working directory using the command pwd and list the contents using ls to confirm .pem key file was correctly uploaded.
check pem key

Move the private key to .ssh directory and rename it to id_rsa

move pem key

To create a public fingerprint and add it to list of authorized keys use the commands below.

gen key namenode

The fingerprint we have created above has to be added to authorized_keys of each of the other nodes in the cluster. The commands below will implement that

After setting up SSH so that the nodes are able to communicate we can begin installing hadoop. Open 4 terminals and connect to each of the instances. Java is a prerequisite for installation of Hadoop so we begin by installing Java on each node. Use the commands below on each terminal to install and check Java has been properly installed.

update repository
install java

java version
After Java has been properly set up, download Hadoop to /Downloads directory, extract it and move to to its installation directory.

download hadoop


extract hadoop

Move extracted content to installation folder using the command below.

Open .bashrc by running this command sudo vi .bashrc and add the lines below. Editing files with vim may be tricky at the beginning but there is a tutorial online. To insert text press ior a. To move the cursor use arrow keys. To exit vim without saving changes press q. To exit vim while saving your changes press escape then shift+zz.

bashrc edit

Reload .bashrc by running source .bashrc

Type hadoop at the terminal to confirm hadoop commands are available. You should get output as shown below.

confirm hadoop

There are configurations that are applied only to the name node and there are configurations that are applied to all nodes in the cluster. We will begin with settings that are applied to all the nodes. We need to edit hadoop-env.sh on each node and add the path to the correct java installations


export java

Next we edit core-site.xml, here we will declare the default hadoop file system using the public dns of the node instead of localhost. You need to substitute the value with public DNS that points to your instance.


The other file to be edited is yarn-site.xml. Add the lines below to the file. In this file you will need to replace the value of  yarn.resourcemanager.hostname with public DNS that points to your instance


 The last file to be edited is mapred-site.xml. A template is provided so we first need to use it to make the file. The commands below will copy the template and rename it.



Once we are done with common configurations on all nodes we can move on and add configurations that are specific to the name node. First we need to add the public DNS and host name of each of the nodes to /etc/hosts. The public DNS is obtained by typing this command echo $(hostname) at the terminal. You can also get the host name from the first part of the private DNS (this is the first part that is composed of numbers). Open the file in a text editor and add the public DNS and host name.



Next we edit hdfs-site.xml to specify a replication factor of 3 will be used and specify the directory where name node will store its data. Create a directory on the namenode for storing data.

Add the lines above

namenode data
we need to create a masters file in hadoop conf directory and add the host name of name node.

Edit slaves file found in hadoop conf directory and add the host names of slave nodes.

At this point the name node configuration is complete. The last thing we need to do is change ownwership of $HADOOP_HOME directory to ubuntu user

Before starting our cluster we need to add a directory in each of the data nodes that will be used to store blocks. Create the directory and add the path in hdfs-site.xml

Finally change the ownership of $HADOOP_HOME of each of the data nodes to the ubuntu user.

from the name node we format the hdfs file system and start the cluster.


Please enter your comment!
Please enter your name here