In previous Hive tutorials we have have looked at Hive as the Hadoop project that offers data warehousing features. Installing and configuring Hive was demonstrated. Guidelines on best practices when creating data models were also discussed. If you are new to these concepts please refer to setting up Hive and creating effective data models in Hive tutorials for a review of those concepts. In this tutorial the main focus will be on how one can use partitioning and bucketing to speed up queries.
Partitioning is a data management technique used to make queries run faster by dividing a large table into smaller parts. This reduces the amount of data to be scanned thereby reducing the amount of time needed to run a query. When a data partitioning strategy is unavailable all the data files residing in the data directory have to be scanned and filters applied which can result in long running queries especially when there are large tables. When partitioning is used only data directories that are needed are scanned and the others are ignored. To use partitioning to your advantage you need to identify columns of low cardinality that are frequently used in querying data that will help in organizing data by relying on partitioning feature in Hive. These columns are used to split the data into different partitions. A range of the partition column forms a partition which is stored in its own sub directory within the data directory of the table. Hive supports the use of one or multiple partition columns.
There are two partitioning approaches that are supported in Hive. These are dynamic partitioning and static partitioning. Static partitioning is preferable over dynamic partitioning when you know the values of partition columns before data is loaded into a Hive table. Dynamic partitioning is better when you only know partition column values during data load. The decision on which type of partitioning to use is not usually clear but there are some key points to consider.
Dynamic partitioning is suitable in situations when:
- You are loading data from a hive table that is not yet partitioned. In this case when data is loaded there is no need for partitioning because the table is likely small. However with time there will be data growth that hurts performance. When loading data use of dynamic partitioning will resolve these issues.
- Values of partition columns are not known. When there are difficulties in identifying values that are unique in a column you cannot use static partitioning. In such situations Hive identifies unique values and automatically creates partitions.
- Due to data growth you decide to change columns used to partition data. This arises when a previous partitioning cannot cope with data growth. This situation is resolved by creation of a new table with additional columns, loading data into a new table and deleting the previous table. During load of data into the new table dynamic partitioning is used.
Table partitioning is supported on both external and internal tables. In the next sections we will demonstrate how this is done.
To demonstrate how to partition data we will use airline delay data from department of transportation that is available from here http://stat-computing.org/dataexpo/2009/the-data.html. The data is from 1987 to 2008. In this tutorial we will use data for the year 1987 but you can download and use data for all the years.
We need to remove headers from the data files and move them to a HDFS directory.
hadoop fs -mkdir /usr/local/airline1987
hadoop fs -put /home/eduonix/Downloads/1987.csv /usr/local/airline1987
Static partitioning is the default way of creating partitions. We will demonstrate how it is used then demonstrate dynamic partitioning in the next section.
With our data in csv format we can choose to load it directly into a partitioned table or create a non partitioned staging table from which we will query data to be loaded into the partitioned table. For simplicity we will load data from csv file. In this tutorial our interest is to partition the data by year so the 1987 data is one partition. In the CREATE TABLE command use PARTITIONED BY option to specify the columns that will be used in partitioning data. Start Hive shell and use the command below to create a partitioned table.
create table partitionedairlinedata
(DayofMonth INT ,
DayOfWeek INT ,
DepTime INT ,
CRSDepTime INT ,
ArrTime INT ,
CRSArrTime INT ,
UniqueCarrier STRING ,
FlightNum INT ,
TailNum STRING ,
ActualElapsedTime INT ,
CRSElapsedTime INT ,
AirTime STRING ,
ArrDelay INT ,
DepDelay INT ,
Origin STRING ,
Dest STRING ,
Distance INT ,
TaxiIn STRING ,
TaxiOut STRING ,
Cancelled INT ,
CancellationCode STRING ,
Diverted INT ,
CarrierDelay STRING ,
WeatherDelay STRING ,
NASDelay STRING ,
SecurityDelay STRING ,
PARTITIONED BY (Year INT )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
From the command above it is very important to note that columns used to partition the data are excluded from the table definition. To create an external table you just need to add the external keyword in between create and table as shown in the output below.
When using static partitioning each partition has to be loaded independently, thereby we need to use separate statements for each load. The airline data runs from 1987 up to 2008. We chose to use year as our partitioning column so we need to load each of the data files separately. To load the 1987 year we use the command below. We need to repeat this process for all the years.
LOAD DATA INPATH '/usr/local/airline1987/1987.csv'
INTO TABLE partitionedairlinedata
PARTITION (Year = 1987);
To load data using a dynamic partition there are several settings that need to be changed. To change the settings permanently you edit hive-site.xml file while to change settings for a particular session you use hive shell. To allow dynamic partitioning you use SET hive.exec.dynamic.partition=true;. To lift the restriction of specifying at least one static partition column we use non strict mode SET hive.exec.dynamic.partition.mode=nonstrict;. To limit the number of dynamic partitions in a table you use SET hive.exec.max.dynamic.partitions=1000;. To limit the number of partitions on each node you use SET hive.exec.max.dynamic.partitions.pernode=256;.
To load data using a dynamic partition you create a partitioned table as demonstrated earlier. Then you load the data specifying which column is used for partition. This is a one time operation as opposed to multiple loads encountered in static partitioning. So in this case all our data will be in one file. Assuming the 1987.csv file used earlier contains all our data we would load it as shown below. The data to be loaded contains partitioning columns although they are omitted in table definition.
LOAD DATA INPATH '/usr/local/airline1987/1987.csv'
OVERWRITE INTO TABLE partitionedairlinedata
This tutorial introduced partitioning as a technique used to make queries run faster. A review of how partitions improve query speed was done. The two models of partitioning in Hive were extensively discussed and their use was demonstrated.