A step by step guide to install Hadoop cluster on Amazon EC2

5
21450
Setting Hadoop Cluster on Amazon EC2 Instance

Setting Hadoop Cluster on Amazon EC2 Instance

This is a step by step guide to install a Hadoop cluster on Amazon EC2. I have my AWS EC2 instance ec2-54-169-106-215.ap-southeast-1.compute.amazonaws.com ready on which I will install and configure Hadoop, java 1.7 is already installed.

In case java is not installed on you AWS EC2 instance, use below commands:

Command: sudo yum install java-1.7.0-openjdk

Command: sudo yum install java-devel
1
I am installing hadoop-2.6.0 on the cluster. Below command will download hadoop-2.6.0 package.

Command: wget
2

3
Check if the package got downloaded.

Command: ls
4
Untar the file.

Command: tar -xvf hadoop-2.6.0.tar.gz
5

6
Make hostname as ec2-user

Command: sudo hostname ec2-user

Command: hostname
7
Below command will give you ip address, for me its 172.31.26.122

Command: ifconfig
8
Edit /etc/hosts file.

Command: sudo vi /etc/hosts
9
Put ip address and hostname as below, save the file and close it.
10
The ‘ssh-agent’ is a background program that handles passwords for SSH private keys.
The ‘ssh-add’ command prompts the user for a private key password and adds it to the list maintained by ssh-agent. Once you add a password to ssh-agent, you will not be asked to provide the key when using SSH or SCP to connect to hosts with your public key.

You will get .pem file from your amazon instance settings, copy it on amazon cluster, I have copied hadoop.pem file on my amazon ec2 cluster.

Protect key files to avoid any accidental or intentional corruption.

Command: chmod 644 .ssh/authorized_keys
Command: chmod 400 hadoop.pem

Start ssh-agent

Note: Make sure you use the backquote (  ), located under the tilde ( ~ ), rather than the single quote ( ' ).' ).

Command: eval ssh-agent -s `

Add the secure identity to SSH Agent Key repository

Command: ssh-add hadoop.pem

Command: ssh ec2-user@ec2-54-169-106-215.ap-southeast-1.compute.amazonaws.com

You will be able to login without password.
11
Come out of the login.

Command: exit

Now we will add hadoop and java environment variables in .bashrc file.

Command: sudo vi .bashrc
12

13
Command: source .bashrc
14
Enter hadoop command and check if you get a set of options.

Command: hadoop
15
check out the directories present in hadoop-2.6.0

Command: cd hadoop-2.6.0
16
share directory contains all the jar files.

Command: cd share/hadoop
17
sbin directory contains all the script files to run or stop hadoop daemons/cluster.
18
etc directory contains all the configuration files. We will edit few configuration files. Go to etc/hadoop/ directory.

Command: cd ..

Command: cd ..

Command: cd etc/hadoop

Command: ls
19
Set java home in hadoop-env.sh file.

Command: sudo vi hadoop-env.sh
20

21
Edit core-site.xml file. This file contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

Command: vi core-site.xml
22

23
Make namenode and datanode directory.

Command: mkdir -p /home/ec2-user/hadoop-2.6.0/hdfs/namenode

Command: mkdir -p /home/ec2-user/hadoop-2.6.0/hdfs/datanode
24
Edit hdfs-site.xml. This file contains the cconfiguration settings for HDFS daemons; the Name Node, the secondary Name Node, and the data node.

Command: vi hdfs-site.xml
25

26
Edit mapred-site.xml. This file contains the configuration settings for MapReduce daemons.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml
27

28
Edit yarn-site.xml. This file contains the configuration settings for YARN.

Command: vi yarn-site.xml
29

30
Now we are ready to start the cluster. We will format  the namenode.

Command: cd

Command: hadoop namenode -format
31

32
After we format the namenode successfully, we will start all the hadoop daemons.

You are now all set to start the HDFS services i.e. Name Node, Secondary Name Node, and Data Node on your Hadoop Cluster.

Command: cd hadoop-2.6.0/sbin/

Command: ./start-dfs.sh

Start the YARN services i.e. ResourceManager and NodeManager

Command: ./start-yarn.sh

Now run jps command to check if the daemons are running.

Command: jps
33
Now open a browser o your system and browse:

ec2-54-169-106-215.ap-southeast-1.compute.amazonaws.com:50070

Congratulations your cluster is up and running.
34

35

5 COMMENTS

  1. Hello, I followed the steps above. But, am getting below error.

    [ec2-user@ip-10-177-1-69 sbin]$ ./start-dfs.sh
    Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured.
    Starting namenodes on []
    Error: Cannot find configuration directory: /home/ec2-user/hadoop2.6.0/etc/hadoop
    Error: Cannot find configuration directory: /home/ec2-user/hadoop2.6.0/etc/hadoop
    Starting secondary namenodes [0.0.0.0]
    Error: Cannot find configuration directory: /home/ec2-user/hadoop2.6.0/etc/hadoop

  2. ec2-54-169-106-215.ap-southeast-1.compute.amazonaws.com is this free tier. If yes when we install hadoop and java will there any costing on this.

LEAVE A REPLY

Please enter your comment!
Please enter your name here