Learn How To Write Advanced Queries To Manipulate Data Using Hive

0
3079
Queries

Queries
In previous Hive tutorials we have looked at installing and configuring Hive, data modeling and use of partitions to improve query response time. For a review of these concepts please refer to learn how to set up Hive, creating effective data models in Hive and use of partitioning tutorials. A good level of understanding of those concepts is assumed in this tutorial. In this tutorial we will focus on writing queries that get answers from your data.

Data on New York Stock Exchange for the years 2000 and 2001 will be used to demonstrate concepts in this tutorial. It is available for download from https://s3.amazonaws.com/hw-sandbox/tutorial1/infochimps_dataset_4778_download_16677-csv.zip. Unzip the data, create a directory in HDFS and move it there.

load hdfs
We need to create a table in Hive to store our stock data. The command below does that.

create table1
Once our table has been created we load data into it using the command below.

load data hive
Load dividends data into HDFS using the command below.

load dividends hdfs
We also create a table for dividends data and load it. The commands below do that.

dividend table
To load the data the command below is used.

load dividend hive
Hive provides a syntax that is very similar to SQL for data querying. The SELECT statement is the main query that is used to interact with stored data. At its very basic level to use select statement you specify the columns and the table from which you would like to select data. You can use a wild card or specify the columns to return although the former technique is not advisable. Query examples are shown below.

The queries shown above are not very meaningful because they return all the data without any way of limiting the data to be returned. The WHERE clause offers a way of limiting data returned by specifying conditions that need to met. Conditions can be specified using predicate operators, logical operators and functions. These are the standard operators available in programming languages and SQL. Functions for working with dates, strings, mathematical operations, aggregations and text analytics are available. These functions and operators are very extensive so the reader is referred to HQL documentation for an exhaustive discussion. These operators and functions can be combined to create multiple criteria as desired. For example the AND operator can be used to specify two conditions that need to be true so that records can be returned. The OR operator can be used specify two conditions where either needs to be true for records to be returned.To demonstrate how these are used examples are given below. For example to return records where the closing price of a stock is greater than the opening price we would use the query shown below.

select greater than
To group results using a specific column and apply an aggregate function on each group you use the GROUP BY clause. You can specify a column that already exists in data schema or use a function to compute a key that will be used to group data. Aggregate functions that can be used include AVG, COUNT, GROUP_CONCAT, MAX, MIN, NDV, STDDEV, SUM, VARIANCE and user defined functions. When there are null values present in the data they are excluded when aggregate functions are applied. For example when using an AVG aggregate function null values will not be considered in calculation of the average. Examples of how aggregate functions are used are illustrated in the section below.

To get the average of the opening price of stocks you use the query below

To get the maximum price of the closing price stock where the closing price of stock was less than opening price of stock the query below would be used.

To count the number of records in each group of stock symbol we use an aggregate function COUNT and a GROUP BY clause as shown in the query below.

To operate further on groups created by the GROUP BY clause by applying operators you use the HAVING clause.

To sort records using a specific column in ascending or descending order you use the ORDER BY clause. By default records are ordered in ascending order so when you intend to sort in ascending you don’t need to specify any option. You can specify one or multiple columns that will be used to sort the data.

An example of use of order by where records are sorted in descending order is shown below

Querying data from two tables is supported in Hive by use of a join similary to how it is done in relational databases. However there are differences in the way they are supported that you need to be aware of. In Hive the only types of joins supported are equality joins, outer joins and left semi joins. The restriction to use only equality joins arises due to complexity of defining a non equality join as a mapreduce job. In this tutorial we loaded two data sets on stocks and dividends. We would now be interested in querying the two tables. As an example we select the opening stock price and dividends as shown below

when doing table joins and you realize one of the tables is small enough to fit in memory a map join will give better performance as compared to a regular join. Hive automatically decides if to use a map join when hive.auto.convert.join is set to true via hive-site.xml configuration file or from the Hive shell.

This tutorial explored the most useful and commonly used Hive queries. Creating tables and loading data was discussed. Querying all or specific columns was discussed. The use of a where clause to filter results that are returned was discussed. Grouping and sorting results was discussed. Finally using joins to query two tables was discussed.

LEAVE A REPLY

Please enter your comment!
Please enter your name here