Set up a Hadoop Stream Processing Stack in less than 10 minutes

0
1959
Problems in Hadoop 2

Problems in Hadoop 2Overview

Thanks to all the hardwork by the Apache Software foundation big data streaming tools and development environments are getting very easy to set up and configure.  As an example of this we demonstrate an install script created by the developers at Apache Samza that downloads installs, configures and runs Hadoop 2.4, Yarn and Zookeeper in less than 10 minutes.

Thats the good news, what is less easy is configuring a development box for the script. Virtualbox is the most popular solution for setting up an alternative development environment, for example the Hadoop code is stable in Java 7, and it is easier and more convenient to set up a virtual environment with the correct build artifacts, not to mention issues with Hadoop ecosystem components on the excellent Windows opperating system.

Problem

sh grid start allVirtualbox networking can run into issues with bridged networking on wirless connections, Windows, Mac and some Debian host machines.

In this article we carefull  go through all the steps (explaining as we go)  needed to set up a wired up Virtualbox network for Centos 6 and even the setting up of the open source alternative  to Centos ‘cpanel’  webadmin.  Focus is to coherently illustrate the basic concepts for developers unfamiliar with Core Linux and VirtualBox networking.

Solution
In the solution outlined below the following definitions are used;

‘host’ refers to the real physical box
‘guest’ refers to the running image (vm) in VirtualBox

For this post a Centos VirtualBox Image is downloaded from virtualboximages. To install the image in VirtualBox this link has the general overview, follow the steps and use the downloaded image as an existing hard disk image.

Centos vm image Username root Password adminuser

Once you are up and running login to the vm as root and issue the ifconfig command.

ifconfig   ifconfig1From the output we can identify the guest’s ip as 10.0.2.15.

Set up Host Only network
Create Host only network Host-only networking
2

In the host machine issue the ifconfig command, you should now see should see the ‘Host only network’ vboxnet0. This is analogous to an etehernet connection between the guest and the host.
3

4

Set up Port Forwarding
Port forwarding is a common technique to ‘glue’ sockets between kernals.  Linux systems use port forwarding where port numbers smaller than 1024 can only be created by software running as the root user.  Another common example is to forward port 80 (internet) to the port where an applications tcp socket is listening.  Here we use Port forwarding to avoid port conflicts and to map services on the host to equivalent services on the guest (ssh).

For ssh teminal and sftp file browsing
We map port 2222 on the host to 22 on the guest. In the ‘settings’ tab select network (port forwarding works with network address translation (NAT))
5For host on 127.0.0.0

Now you can ssh into the guest via a terminal

Its highly likley you will get ssl certificate warnings easy to resolv, it most probably inform you to issue the following, if not issue the ssh-keygen command below

6
After resolution login to the guest with ssh,

For IP/TCP
create port access for web

1) in the guest boot up the webmin panel

2) open a tcp port 10000, on this guest the webmin web panel is on port 10000
7Here we have mapped 8086 on the host to 10000 on the guest, we access it on the hosts localhost which is mapped to the guests ip
8On this box u r root /adminuser
9

Working through Linux Firewall issues
Obviously for Hadoop development we do not to rely on the excellent webadmin panel so let’s go thru the issues of running our own web service. I have created a simple service with a rest echo endpoint that displays some info about the system. See source code here.  It uses an ubuer jar see the build file. Build and use a ssh filebrowser to load the jar to the guest and run with the usual ‘java -jar jserv.jar’.

When we try to access this service on its port 8080 it obviosly gets rejected by issueing the netstat command in the guest we can see  why.

10
Two thing to note are the host to bind to for our web service is ‘0.0.0.0’ and there is no entry for 8080.

Use iptables to open port 8080.
First edit iptables and add the entry
vi /etc/sysconfig/iptables

11
Then in the source code for the webservice bind to the correct guest host

Then we map a port on the host (9000)  to the web service port on the guest (8080)
12Finally we run the service and can access its endpoint as shown below.

13

Set up a Hadoop Stream Processing Stack in less than 10 minutes
Now lets use the skills we just aquired to rapidly set up a development environment for Hadoop streaming applications,

On your Marks!
In the running guest

  1. purge older java

  1. download and install oracle java 7

  1. Set JAVA_HOME

14reboot

check environment

15

  1. Download the  install script and copy to the guest  to install hadoop kafka and zookeeper.

Run the script as shown below

16

17
We need to shutdown and open the ports for are new hadoop stream processing stack

sh grid stop all
18

Add yarn port 8088 to ip tables just as  as before

19

20
Port forwarding for Yarn
21
Reboot the stack.

22
Access the yarn gui.
23Done

Summary

Problem
Virtulabox networking can be problematic on wireless networks, some debian set ups, Windows and Mac.

Solution
1) Minimal stable Centos Disrtibution.

2) Set up Host Only Networking.

3) Set up Port Forwarding.

LEAVE A REPLY

Please enter your comment!
Please enter your name here