Learn How to Query, Summarize and Analyze Data using Apache Hive

0
4014
Learn-How-to-Query,-Summarize-and-Analyze-Data-using-Apache-Hive

Learn-How-to-Query,-Summarize-and-Analyze-Data-using-Apache-Hive
Apache Hive is project within the Hadoop ecosystem that provides data warehouse capabilities. It was not designed for processing OLTP workloads. It has features for manipulating large distributed data using a SQL-like language called HiveQL. This makes it suitable for extract/transform (ETL), reporting and data analysis problems. HiveQL queries are translated into Java MapReduce code which runs on Hadoop. The queries are executed by Mapreduce, Apache Tez or Apache Spark. Hive queries run on Mapreduce take long to run because of batch processing. Spark provides a way to run low latency queries. Spark provides better performance than MapReduce without requiring any changes in queries. Hive is able to access data stored in HDFS, Hbase andAmazon S3.

This tutorial requires a fair understanding of relational databases.
Hive requires a working installation of Hadoop before installation. Verify Hadoop is installed before beginning Hive installation. Move to the downloads folder, extract the hive tar, move it to its installation folder and assign its ownership to user eduonix.

Image1
Image2
Open .bashrc in a text editor by running gedit ~/.bashrc and add lines below.

Image3

Save the file and reload it by running source ~/.bashrc
Rename the hive-env.sh.template provided to hive-env.sh and open it in a text editor.

Add the line below to point Hive to the Hadoop installation

Image4

Hive requires a database for its metastore. Derby comes preinstalled but it is only useful as a test enviroment. Mysql is a better option and in this tutorial it’s use will be demonstrated. Installation of Mysql was demonstrated in the Sqoop tutorial so it will not be repeated.

Copy the mysql connector jar into $HIVE_HOME/lib/. Create the metastore database and create the schema using hive-schema-2.0.0.mysql.sql.

Create a Mysql account that Hive will use to access metastore.

Image5
Image6
In /usr/local/hive/conf create a hive-site.xml and add the configuration settings below.

Invoke the hive console and create a table to test the metastore.

hive
create table eduonixhive(id int, name string);

In the hadoop file system create a temporary directory /usr/loca/hive/tmp that will be used to store results of intermediate data processing. Create a directory usr/local/hive/warehouse that will be used to store hive data. The permissions on these directories are then changed to read/write. Before running these commands make sure hadoop is running

Image7

Run hive at the terminal to start hive and avail its terminal.
Image8

We create a table in Hive to test if it is working properly

create table eduonixcourses (id int, name string, instructor string);
Image9

We need to confirm the metadata of the table we created above is in metastore. Login to Mysql console, select metastore database and show it’s tables.

10

Query the metastore to check the metastore for database eduonixcourses exists.
11

The eduonixcourses database has been created in the metastore repository so our Hive installation is working properly.

Hive organizes data into databases, tables, partitions and buckets or clusters. The first three concepts are similar to those used in relational databases. Partitioning and bucketing of tables is not mandatory but it provides a way of pruning thereby speeding query processing. Primitive and complex data types for specifying columns are available in Hive. Primitive types available are: number, float, double, bigint, smallint, tinyint, int, string and boolean. The complex types available are structs, maps and arrays. Arithmetic and logical operators similar to those in relational databases are also available in Hive.

Some data operations provided by HiveQL are listed below

  • Creation, dropping and altering of tables and partitions
  • Using SELECT clause to select specific columns
  • Using joins on two tables
  • Filtering data using the WHERE clause
  • Storage of query results in the hadoop file system

These are some of the features provided by HiveQL that are commonly used. For a complete perspective of what the language offers refer to the hive data definition language manual.

In the sqoop tutorial we demonstrated how sqoop is used as a tool for importing data from relational databases into hadoop. Here we will demonstrate how data can be loaded into Hadoop from the local system. We will then load data into Hive from the local system. Download the cars.csv data from http://perso.telecom-paristech.fr/~eagan/class/as2013/inf229/data/cars.csv. This is data that measured the fuel consumption and other characteristics of eight car models.

First we need to do some preprocessing then upload the data into HDFS. Navigate to the downloads directory and check if the cars.csv data has any headers. If they are there we need to remove them because they are not part of the data.

12

Once the headers have been removed we can upload the data into hdfs

13

before we can upload the data into Hive we need to create a table similar in structure to the data we are loading. Invoke hive and run the commands below

14

With the cars.csv data in hdfs and the database structure in place we can load the data into Hive.

15

This tutorial has highlighted the role of Hive in the Hadoop ecosystem. It has explained the features provided by Hive to enable management and analysis of data. It has explained how data is organized in Hive. It has explained how to install and configure Hive on top of Hadoop. Loading data into hdfs from the local system then into hive has also been demonstrated.

LEAVE A REPLY

Please enter your comment!
Please enter your name here