In previous Hbase tutorials we looked at how to install Hbase and develop suitable data models. In this tutorial we will build on those concepts to demonstrate how to perform create read update delete (CRUD) operations using the Hbase shell. If you have not installed Hbase please refer to Hbase base tutorial to learn how to install and configure Hbase. For a review of data models please refer to learn how to create effective data models in Hbase.
The shell can be used in interactive or non interactive mode. To pass commands to Hbase in non interactive mode from an operating system shell you use the echo command and the | operator and pass the non interactive option -n. it is important to note this way of running commands can be slow. For example to create a table you use the command below
echo “create ‘courses’, ‘id’ “ | hbase shell -n
Another way of running Hbase in non interactive mode is by creating a text file with each command on its own line then specifying the path to the text file. For example if we create a text file and save it in /usr/local/createtables.txt we run the commands as shown below. This will be demonstrated later.
hbase shell /usr/local/createtables.txt
To access the Hbase shell in an interactive mode start Hbase using this command start-hbase.sh then invoke the shell using this command hbase shell command at a terminal.
Before we demonstrate the use of Hbase shell there are several important points to be aware of when typing commands in the shell. These are highlighted below
- Names identifying tables and columns need to be quoted
- There is no need to quote constants
- Command parameters are separated using commas
- To run a command after typing it in the shell hit enter key
- Double quoting is required when you need to use binary keys or values in the shell
- To separate keys and values you use the => character
- To specify a key you use predefined constants like NAME, VERSIONS and COMPRESSIONS
We will begin with general commands for Hbase. To check the version you are running you use the version command. To identify the user who is running Hbase you use whoami command. To know the status of the cluster you use the status command. Options that can be used with status command are summary, detailed or simple. The default option is summary. To request detailed cluster status you would use the command shown below. The status command is only useful if you are running a Hbase cluster otherwise when running a single instance it is not useful.
In its simplest form the create command is used to create a table by specifying the table name and column family. To reduce disk space used for storing data it is advisable to use short column family names. This is because storage of each value happens in a fully qualified manner. Frequent change of column names and use of many column families is not good practice so this is a design area that needs careful thought. A compromise design is to have a few column families then you can have many columns in each family. The format of naming columns is to specify the column family then the column name (family:qualifier).
A basic command that creates a table with two column families is shown below.
CREATE ‘courses’ ‘hadoop’ ‘programming’
To add columns in each column family the query is enhanced as shown below.
CREATE ‘courses’ ‘hadoop:spark’, ‘programming:java’
To optimize data storage Hbase offers several options to help in managing data storage. Compression enables a reduction in amount of data stored on disk and data sent over the network but increases CPU workload. Three algorithms available for compression are Snappy, LZO and GZIP. GZIP compresses data better than the other two algorithms but its CPU requirements are higher compared to the other two. GZIP is a better choice when compressing infrequently queried data while the other two are more appropriate for data that is frequently queried. The previous query is enhanced as shown below to enable data compression.
CREATE ‘courses’ ‘hadoop:spark’, ‘programming:java’, COMPRESSION >= Snappy
Hbase allows you to have multiple versions of a row. This arises because data changes are not applied in place, instead a change results in a new version. To control how this happens you specify the number of versions or time to live (TTL). When any of these settings are exceeded rows are removed when data compaction is done. Examples are shown below.
CREATE courses hadoop:spark, programming:java, VERSIONS >= 4;
To improve scalability Hbase offers a way to split tables into smaller units. These table splits are referred to as regions. At the beginning there is one table but as more rows are added the need to split the table arises. When creating a table you supply points that will be used to decide how data will be split. This process is referred to as pre-splitting. This pre-splitting enables you to even load distribution across a cluster and it is an excellent choice when you have prior knowledge of the distribution key. However a bad decision will not distribute load evenly resulting in poor cluster performance. There are no tested rules to guide you in choosing number of regions. Best practice is to begin with a number that is a low multiple of the number of region servers then leave Hbase to do automated splitting. A RegionSplitter utility is available in Hbase to assist in deciding split points. HexStringSplit and UniformSplit are two algorithms that can help you deciding splitting but you can also use custom algorithms.
The utility can be used to split a table by specifying number of regions and column families to be used as shown below. The statement below creates a table named split_table with 10 regions on the hadoop column family.
hbase org.apache.hadoop.hbase.util.RegionSplitter split_table HexStringSplit -c 10 -f hadoop
The regions can also be created when creating the table if you have split points as shown below.
create ‘split_table’, ‘hadoop’, SPLITS >= [‘A’ ‘B’ ‘C’]
In this first part on using Hbase shell we confined ourselves to discussing the use of CREATE command. Options for creating tables that perform optimally were discussed. Data compression, table splitting and row versioning were discussed. In subsequent tutorials we look at actual use of the shell, loading and manipulating data