Learn how to manage data in a Nosql database using HBase


NoSQL databases are designed for scalability where unstructured data is spread across multiple nodes. When data volumes increase you just need to add another node to accommodate the growth. The lack of structure in NoSQL databases relaxes stringent requirements of consistency enforced in relational databases to improve speed and agility. Hbase, MongoDB and Cassandra are the three major options that provide NoSQL capabilities. The options differ in the features they provide, so the decision on which to use is informed by the workload that will be handled. The main difference between Hbase and Cassandra databases is the consistency model they implement. Cassandra implements eventual consistency which guarantees writes are available. This provides excellent write scaling but suffers a penalty when reading because for consistency in reads you have to read from many copies of data. On the other hand HBase provides a strong consistency model that excels at scaling reads but does not scale on writes as well as Cassandra does.

Hbase is natively supported on Hadoop and it is the subject of this tutorial. The main characteristics that make Hbase an excellent data management platform are fault tolerance, speed and usability. Fault tolerance is provided by automatic fail-over, automatically sharded and load balanced tables, strong consistency in row level operations and replication. Speed is provided by almost real time lookups, in memory caching and server side processing. Usability is provided by a flexible data model that allows many uses, a simple Java API and ability to export metrics.

Hbase can run standalone on the local file system but this set up does not guarantee durability. Edits will be lost when daemons are not cleanly started and stopped. Such a set up is not suitable in a production environment but it provides a way of exploring how the database functions. Alternatively Hbase can be installed on a single or multi node cluster and use HDFS. This set up requires a working set up of Hadoop. If you have not yet installed Hadoop please refer to the setting up Hadoop tutorial. Type hadoop version at the terminal to check Hadoop is correctly installed. You should get the version printed out as shown below.
To begin installation head over to the closest Apache mirror and download a Hbase version that is compatible with your Hadoop version. By default this will be saved in Downloads directory. Navigate to the directory where it was saved, unzip it, move it to its installation directory and set correct permissions on the directory.

cd ~/Downloads
sudo tar xzvf hbase-1.1.5-bin.tar.gz
sudo mkdir /usr/local/hbase
sudo mv hbase-1.1.5 /usr/local/hbase
sudo chown -R eduonix /usr/local/hbase

We need to add hbase path to bashrc so run sudo gedit ~.bashrc at the terminal and add the lines below.

export HBASE_HOME=/usr/local/hbase/hbase-1.1.5

save and reload .bashrc by running source ~.bashrc

Type hbase at the terminal, if you get a list of hbase commands the hbase path has been correctly set.
We need to point Hbase to the correct Java installation so we add the java path to hbase-env.sh. Navigate to conf directory under hbase installation directory and open the file.

cd /usr/local/hbase/hbase-1.1.5/conf
sudo gedit  hbase-env.sh
#Point hbase to correct java installation
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HBASE_REGIONSERVERS=/usr/local/hbase/hbase-1.1.5/conf/regionservers
export HBASE_MANAGES_ZK=true

We need to edit hbase-site.xml to add settings that are specific to Hbase. Navigate to conf directory under hbase installation directory and open the file. When installing in a distributed mode you need to set hbase.cluster.distributed to true, when running in single node you set this value to false.

cd /usr/local/hbase/hbase-1.1.5/conf
sudo gedit  hbase-site.xml

You access hbase using a shell by typing this command shell hbase.
Hbase commands can be broadly divided into five categories. The commands operate in a similar way to those in relational databases. Security commands are used to GRANT, REVOKE and show USER_PERMISSION. Cluster replication commands are used to manage a cluster. Some cluster management activities are: ADD_PEER, REMOVE_PEER, DISABLE_PEER and STOP_REPLICATION. Data manipulation commands include COUNT, DELETE, DELETEALL and SCAN. Some table management commands are: ALTER, CREATE, DESCRIBE, DROP, and DROPALL. Some general commands are VERSION and STATUS. This is a very small list of commands available for managing data in Hbase. Complete documentation is available online for reference.

The Hbase data model is different from the model provided by relational databases. Hbase is referred to by many terms like a key-value store, column oriented database and versioned map of maps which are correct. The easiest way of visualizing a Hbase data model is a table that has rows and tables. This is the only similarity shared by Hbase model and the relational model.

Data in Hbase is organized into tables. Any characters that are legal in file paths are used to name tables. Tables are further organized into rows that store data. Each row is identified by a unique row key which does not belong to any data type but is stored as a bytearray. Column families are further used to group data in rows. Column families define the physical structure of data so they are defined upfront and their modification is difficult. Each row in a table has same column families. Data in a column family is addressed using a column qualifier. It is not necessary to specify column qualifiers in advance and there is no consistency requirement between rows. No data types are specified for column qualifiers, as such they are just stored as bytearrays. A unique combination of row key, column family and column qualifier forms a cell. Data contained in a cell is referred to as cell value. There is no concept of data type when referring to cell values and they are stored as bytearrays. Versioning happens to cell values using a timestamp of when the cell was written.

Tables in Hbase have several properties that need to be understood for one to come up with an effective data model. Indexing and sorting only happens on the row key. The concept of data types is absent and everything is stored as bytearray. Only row level atomicity is enforced so multi row transactions are not supported.

This tutorial has introduced NoSQL databases and identified the three most popular platforms. The difference between relational and no relational databases was briefly explained. The use cases for NoSQL databases were briefly identified. Installation of Hbase to run with Hadoop was demonstrated. Finally the data model used in Hbase was explained. These are concepts aimed at introducing the learner to the world of non-relational databases and should only be considered a starting point.


Please enter your comment!
Please enter your name here