Apache Storm is a top level hadoop project that has been developed to enable processing of very large stream data that arrives very fast in real time. Storm is a suitable platform for real time data processing because: it is very fast, scales well with data growth, tolerates node failure, it can be relied on to process all data and it is easy to use. Due to its flexibility Storm has been applied in detecting cyber security breaches, identifying fraudulent activities in financial sector, personalizing content and optimizing supply chains.
A Storm cluster consists of a master node that runs a Nimbus daemon and worker nodes that run supervisor daemons. The Nimbus daemon distributes code to worker nodes, assigns each worker node the data to process and monitors node failure. Any jvm or non-jvm language can be used to develop code that is submitted to Storm via Nimbus which runs as a thrift service. This reduces the learning curve because you use the language you are comfortable with. Storm relies on Apache Zookeeper to provide cluster management services.
Supervisor daemons perform work assigned by Nimbus by managing each worker process to complete assigned tasks. The logic to be processed by worker processes is specified in a topology. A topology is a visual representation of the logic to be performed. The topology consists of streams, bolts and spouts. In a topology the data sources are spouts. Data can be sourced from databases, distributed file systems and messaging frameworks. Spouts can be categorized as reliable or unreliable. With reliable spouts in the event of a failure the data units can be replayed and processed again. Unreliable spouts are not able to replay data units so if there is a failure there is no way of recovering.
In a topology bolts are used to process logic. Data processing tasks such as joining, sorting and filtering are specified in a bolt. Tuples are lists of elements that make up a stream or simply a unit of data. Streams are sequences of data units that are moved from spouts to bolts or from a bolt to another.
Setting up a storm cluster can be divided into:
- Setting up a Zookeeper cluster
- Installing a Storm client
- Installing Storm cluster
When setting up storm on a single machine the same Storm installation acts as the client and cluster. The recommended way of setting up Zookeeper in production is an ensemble of 3 Zookeeper servers with each server running on its own machine. In this tutorial we will just set up Zookeeper on a single ubuntu machine. Download Zookeeper from here http://zookeeper.apache.org/releases.html. Unpack the download, move it to its installation directory and assign ownership of directory to eduonix user.
sudo tar xzvf zookeeper-3.4.8.tar.gz
sudo mv zookeeper-3.4.8 /usr/local/zookeeper
sudo chown -R eduonix usr/local/zookeeper
Create a directory that will be used to store temporary zookeeper data using this command sudo mkdir /usr/local/zookeeper_data. Rename zoo_sample.cfg file found in /conf directory to zoo.cfg and edit its properties so that they appear as shown below.
cp /usr/local/zookeeper/conf/zoo_sample.cfg /usr/local/zookeeper/conf/zoo.cfg
sudo gedit /usr/local/zookeeper/conf/zoo.cfg
Save zoo.cfg and start zookeeper using zkServer.sh script found in bin directroy
Storm can run in local or remote mode. In local mode topologies are developed and ran in the local machine. In remote mode topologies are are developed then submitted to a cluster where they are ran. In this tutorial we will ran Storm in local mode so get installation package from https://github.com/apache/storm/releases. Unpack, move it to its installation directory and assign ownership of directory to eduonix user.
sudo tar xzvf apache-storm-1.0.1.tar.gz
sudo mv storm-1.0.2 /usr/local/storm
sudo chown -R eduonix /usr/local/storm
We need to add the bin directory to our path so open .bashrc in a text editor and include it by adding export PATH=/usr/local/storm/bin.
Save the file and reload it by running source ~/.bashrc
Configuration settings for Storm are stored in storm.yaml file that resides under conf directory. A template is provided so rename it and add the lines below to set correct options. These settings indicate zookeeper and nimbus are running on localhost. It also specifies the directory where storm data will be stored. Create this directory by running sudo mkdir /usr/local/storm_data
sudo cp /usr/local/storm/conf/storm.yaml.template /usr/local/storm/conf/storm.yaml
sudo gedit /usr/local/storm/conf/storm.yaml
The storm cluster is started by starting the nimbus and supervisor using their respective commands as shown below.
Topologies that have been packaged as jar files are uploaded to storm by specifying the path to jar file and arguments as shown below.
storm jar [path to jar file] [args]
When you have implemented your topology in a non-jvm language the storm shell command is available that lets you package code in a jar file and upload to nimbus.
This tutorial introduced to Storm which enables processing of vast amounts of streaming data in real time. The key strengths of Storm in data processing were highlighted. The core concepts of topology, streams, spouts and bolts were discussed. Installing and configuring zookeeper to provide cluster management services was demonstrated. Installing storm on a single machine that is adequate for developing and testing topologies was discussed. Finally submitting jvm and non-jvm topologies was discussed. These concepts make up part 1 of this tutorial. In part 2 we will discuss development of topologies using Python to demonstrate how actual data processing happens.