Learn How To Use Apache Oozie To Schedule Hadoop Jobs


Within the Hadoop ecosystem Oozie provides services that enable jobs to be scheduled. With job scheduling you are able to organize multiple jobs into a single unit that is run sequentially. The types of jobs that are supported are MapReduce, Pig, Hive, Sqoop, Java programs shell scripts. To support scheduling Oozie avails two types of jobs. These are work flow and coordinator jobs. Work flow jobs are specified as directed acyclic graphs (DAG) that are executed in sequence. For coordinator jobs, time and data availability triggers are used to start them after which they run in a recurring manner. Packaging many work flow jobs and managing their life cycle is done by Oozie Bundle.

In a work flow directed acyclic graph rules for starting and ending work flows are defined in a control node. This sequential flow is referred to as job chronology. For flexibility in deciding the path to follow when executing work flows; decision, fork and join nodes are available. Execution of tasks is triggered by action nodes. Oozie work flows are very flexible and they allow modeling of various types of work flows. Workflows that require unpredictable conditions like time and data availability to be met before they run can be specified. Triggers that start these work flows whenever required conditions are met are used. Work flows that rely on status output of other work flows can also be specified. In this way the output from one work flow becomes the input to be processed by the next work flow. Such chained data processing is referred to as a data application pipeline.

Oozie has a client and a server package. Each of these is installed separately. To install Oozie we need first to compile it from source. To package the source code we need Maven installed. To install Maven run the command below at an ubuntu terminal.

sudo apt-get install maven

Once Maven is installed download source code from here http://www.apache.org/dyn/closer.lua/oozie/ . Navigate to the directory where the download was saved and unpack it.

cd ~/Downloads
sudo tar xzvf oozie-4.2.0.tar.gz

Enter the directory where the files were extracted and open pom.xml. We need to edit this file so that we compile against correct version of Hadoop. Change hadoop.version property to version you are running. In this tutorial we are using version 2.7.1 so we change it to that. Change hadoop.majorversion property to 2. Other properties like hbase, hive, pig and spark are also changed here to versions running on your computer.

sudo gedit ~/Downloads/oozie-4.2.0/pom.xml

Enter the main directory where Oozie files were extracted and invoke Maven to package the source code.

cd ~/Downloads/oozie-4.2.0
mvn clean package assembly:single -P hadoop-2 -DskipTests

Alternatively you can navigate to bin directory of Oozie and use mkdistro.sh to build. The commands to do so are shown below

sudo ~/Downloads/oozie-4.2.0/bin/mkdistro.sh -Dskiptests

After Oozie has been successfully built it will be saved under /distro/target/ directory which is just under the main oozie directory. Create a directory where it will be installed and move the binaries there.

sudo mkdir /usr/local/oozie
sudo cp ~/Downloads/oozie-4.2.0/distro/target/oozie-4.2.0-distro/oozie-4.2.0 /usr/local/oozie/

after the binaries have been placed in preferred installation directory we need to add the bin path. Open bashrc and add the lines highlighted in green.

gedit ~/.bashrc
export OOZIE_VERSION=4.2.0
export OOZIE_HOME="/usr/local/oozie/bin/"
export PATH="$OOZIE_HOME/bin:$PATH"

Under /bin/ directory create a libext directory that will be used to store Hadoop jars. The command used is sudo mkdir /usr/local/oozie/oozie-4.2.0/bin/libext. To enable the web based console for Oozie we need extjs library version 2.2 which is not bundled with hadoop so it has to be downloaded separately. Download the extjs library and copy it to libext directory. This library is not mandatory for Oozie to work. It is only required when you would like to enable the web interface. From the /hadooplibs/hadoop-distcp-2/target directory beneath the main Oozie directory copy the Hadoop jar files from there to libext directory. Once Hadoop jars and extjs library have been copied to libext we then prepare the war files for Oozie using the command below.

oozie-setup.sh prepare-war [-jars <PATHS>] [-extjs <PATH>] [-secure]
sharelib create -fs <FS_URI> [-locallib <PATH>]
sharelib upgrade -fs <FS_URI> [-locallib <PATH>]

In the construct above you specify the path to the jars and extjs ie the libext directory. The sharelib command is used when you would like to upload or upgrade a sharelib. The secure option makes sure Oozie is set up wit HTTP (SSL). Oozie requires a database which is set up using the command below

ooziedb.sh create -sqlfile oozie.sql -run

The default behavior is to use the embedded derby database which comes bundled with Oozie. Other databases that can be used are HSQL, Derby, MySQL, Oracle and PostgreSQL.

Under /conf directory there are three files ie oozie-site.xml, oozie-log4j.properties and admin-users.txt. The oozie-site.xml file contains settings to configure the Oozie server. The admin-users.txt file contains the list of Oozie administrators. The oozie-log4j.properties file is used to specify where logging is done.

We need to specify a user for the Oozie service in the Hadoop core-site.xml file. The file is located under /etc/hadoop/ beneath the Hadoop home directory. Open it and add the lines highlighted in green below. Change the user name to the user you would like. After making the changes restart Hadoop.

sudo gedit /usr/local/hadoop/hadoop-2.7.1/etc/hadoop/core-site.xml
<!-- OOZIE -->

To start Oozie as a service the command below is used

oozied.sh start

To start Oozie as a foreground process you use the command shown below

oozied.sh run

The Oozie client comes bundled with the server. Its installation is only necessary on remote machines. Installation is straight forward you just extract oozie-client.tar.gz and add /bin to the path.

This tutorial introduced to Oozie which is used to schedule Hadoop jobs. The benefits of using Oozie were explained. Installation of Maven and its use to build Oozie was discussed. Installation and configuration of server and client were discussed. In this first part tutorial setting up Oozie was discussed. In the second part of this tutorial we will demonstrate how to develop work flows and submit them to the server using the client tool.


Please enter your comment!
Please enter your name here