Set up a Hadoop Stream Processing Stack in less than 10 minutes

Problems in Hadoop 2


Thanks to all the hardwork by the Apache Software foundation big data streaming tools and development environments are getting very easy to set up and configure.  As an example of this we demonstrate an install script created by the developers at Apache Samza that downloads installs, configures and runs Hadoop 2.4, Yarn and Zookeeper in less than 10 minutes.

Thats the good news, what is less easy is configuring a development box for the script. Virtualbox is the most popular solution for setting up an alternative development environment, for example the Hadoop code is stable in Java 7, and it is easier and more convenient to set up a virtual environment with the correct build artifacts, not to mention issues with Hadoop ecosystem components on the excellent Windows opperating system.


sh grid start allVirtualbox networking can run into issues with bridged networking on wirless connections, Windows, Mac and some Debian host machines.

In this article we carefull  go through all the steps (explaining as we go)  needed to set up a wired up Virtualbox network for Centos 6 and even the setting up of the open source alternative  to Centos ‘cpanel’  webadmin.  Focus is to coherently illustrate the basic concepts for developers unfamiliar with Core Linux and VirtualBox networking.

In the solution outlined below the following definitions are used;

‘host’ refers to the real physical box
‘guest’ refers to the running image (vm) in VirtualBox

For this post a Centos VirtualBox Image is downloaded from virtualboximages. To install the image in VirtualBox this link has the general overview, follow the steps and use the downloaded image as an existing hard disk image.

Centos vm image Username root Password adminuser

Once you are up and running login to the vm as root and issue the ifconfig command.

ifconfig   ifconfig

From the output we can identify the guest’s ip as

Set up Host Only network
Create Host only network Host-only networking

In the host machine issue the ifconfig command, you should now see should see the ‘Host only network’ vboxnet0. This is analogous to an etehernet connection between the guest and the host.


Set up Port Forwarding
Port forwarding is a common technique to ‘glue’ sockets between kernals.  Linux systems use port forwarding where port numbers smaller than 1024 can only be created by software running as the root user.  Another common example is to forward port 80 (internet) to the port where an applications tcp socket is listening.  Here we use Port forwarding to avoid port conflicts and to map services on the host to equivalent services on the guest (ssh).

For ssh teminal and sftp file browsing
We map port 2222 on the host to 22 on the guest. In the ‘settings’ tab select network (port forwarding works with network address translation (NAT))
For host on

Now you can ssh into the guest via a terminal

Its highly likley you will get ssl certificate warnings easy to resolv, it most probably inform you to issue the following, if not issue the ssh-keygen command below

After resolution login to the guest with ssh,

create port access for web

1) in the guest boot up the webmin panel

2) open a tcp port 10000, on this guest the webmin web panel is on port 10000

Here we have mapped 8086 on the host to 10000 on the guest, we access it on the hosts localhost which is mapped to the guests ip
On this box u r root /adminuser

Working through Linux Firewall issues
Obviously for Hadoop development we do not to rely on the excellent webadmin panel so let’s go thru the issues of running our own web service. I have created a simple service with a rest echo endpoint that displays some info about the system. See source code here.  It uses an ubuer jar see the build file. Build and use a ssh filebrowser to load the jar to the guest and run with the usual ‘java -jar jserv.jar’.

When we try to access this service on its port 8080 it obviosly gets rejected by issueing the netstat command in the guest we can see  why.

Two thing to note are the host to bind to for our web service is ‘’ and there is no entry for 8080.

Use iptables to open port 8080.
First edit iptables and add the entry
vi /etc/sysconfig/iptables

Then in the source code for the webservice bind to the correct guest host

Then we map a port on the host (9000)  to the web service port on the guest (8080)

Finally we run the service and can access its endpoint as shown below.


Set up a Hadoop Stream Processing Stack in less than 10 minutes
Now lets use the skills we just aquired to rapidly set up a development environment for Hadoop streaming applications,

On your Marks!
In the running guest

  1. purge older java

  1. download and install oracle java 7

  1. Set JAVA_HOME


check environment

  1. Download the  install script and copy to the guest  to install hadoop kafka and zookeeper.

Run the script as shown below


We need to shutdown and open the ports for are new hadoop stream processing stack

sh grid stop all

Add yarn port 8088 to ip tables just as  as before

Port forwarding for Yarn

Reboot the stack.

Access the yarn gui.


Virtulabox networking can be problematic on wireless networks, some debian set ups, Windows and Mac.

1) Minimal stable Centos Disrtibution.

2) Set up Host Only Networking.

3) Set up Port Forwarding.


Please enter your comment!
Please enter your name here