Thanks to all the hardwork by the Apache Software foundation big data streaming tools and development environments are getting very easy to set up and configure. As an example of this we demonstrate an install script created by the developers at Apache Samza that downloads installs, configures and runs Hadoop 2.4, Yarn and Zookeeper in less than 10 minutes.
Thats the good news, what is less easy is configuring a development box for the script. Virtualbox is the most popular solution for setting up an alternative development environment, for example the Hadoop code is stable in Java 7, and it is easier and more convenient to set up a virtual environment with the correct build artifacts, not to mention issues with Hadoop ecosystem components on the excellent Windows opperating system.
sh grid start allVirtualbox networking can run into issues with bridged networking on wirless connections, Windows, Mac and some Debian host machines.
In this article we carefull go through all the steps (explaining as we go) needed to set up a wired up Virtualbox network for Centos 6 and even the setting up of the open source alternative to Centos ‘cpanel’ webadmin. Focus is to coherently illustrate the basic concepts for developers unfamiliar with Core Linux and VirtualBox networking.
In the solution outlined below the following definitions are used;
‘host’ refers to the real physical box
‘guest’ refers to the running image (vm) in VirtualBox
For this post a Centos VirtualBox Image is downloaded from virtualboximages. To install the image in VirtualBox this link has the general overview, follow the steps and use the downloaded image as an existing hard disk image.
Centos vm image Username root Password adminuser
Once you are up and running login to the vm as root and issue the ifconfig command.
ifconfig ifconfigFrom the output we can identify the guest’s ip as 10.0.2.15.
Set up Host Only network
Create Host only network Host-only networking
In the host machine issue the ifconfig command, you should now see should see the ‘Host only network’ vboxnet0. This is analogous to an etehernet connection between the guest and the host.
Set up Port Forwarding
Port forwarding is a common technique to ‘glue’ sockets between kernals. Linux systems use port forwarding where port numbers smaller than 1024 can only be created by software running as the root user. Another common example is to forward port 80 (internet) to the port where an applications tcp socket is listening. Here we use Port forwarding to avoid port conflicts and to map services on the host to equivalent services on the guest (ssh).
For ssh teminal and sftp file browsing
We map port 2222 on the host to 22 on the guest. In the ‘settings’ tab select network (port forwarding works with network address translation (NAT))
For host on 127.0.0.0
Now you can ssh into the guest via a terminal
ssh email@example.com -p 2222
Its highly likley you will get ssl certificate warnings easy to resolv, it most probably inform you to issue the following, if not issue the ssh-keygen command below
ssh-keygen -f "/home/ubu/.ssh/known_hosts" -R [127.0.0.1]:2222
ssh firstname.lastname@example.org -p 2222
create port access for web
1) in the guest boot up the webmin panel
2) open a tcp port 10000, on this guest the webmin web panel is on port 10000
Here we have mapped 8086 on the host to 10000 on the guest, we access it on the hosts localhost which is mapped to the guests ip
On this box u r root /adminuser
Working through Linux Firewall issues
Obviously for Hadoop development we do not to rely on the excellent webadmin panel so let’s go thru the issues of running our own web service. I have created a simple service with a rest echo endpoint that displays some info about the system. See source code here. It uses an ubuer jar see the build file. Build and use a ssh filebrowser to load the jar to the guest and run with the usual ‘java -jar jserv.jar’.
When we try to access this service on its port 8080 it obviosly gets rejected by issueing the netstat command in the guest we can see why.
netstat -tulpn | less
Two thing to note are the host to bind to for our web service is ‘0.0.0.0’ and there is no entry for 8080.
Use iptables to open port 8080.
First edit iptables and add the entry
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8080 -j ACCEPT
Then in the source code for the webservice bind to the correct guest host
String host = "0.0.0.0";
Server server = new Server();
ServerConnector serverConnector = new ServerConnector(server);
Then we map a port on the host (9000) to the web service port on the guest (8080)
Finally we run the service and can access its endpoint as shown below.
Set up a Hadoop Stream Processing Stack in less than 10 minutes
Now lets use the skills we just aquired to rapidly set up a development environment for Hadoop streaming applications,
On your Marks!
In the running guest
- purge older java
yum remove java-1.7.0-openjdk
- download and install oracle java 7
rpm -Uvh jdk-7u79-linux-x64.rpm
- Set JAVA_HOME
- Download the install script and copy to the guest to install hadoop kafka and zookeeper.
Run the script as shown below
sh grid bootstrap
sh grid start all
2) Set up Host Only Networking.
3) Set up Port Forwarding.