Apache Pig is a high level scripting language and a part of the Apache Hadoop eco-system. Pig scripting is mainly used for data analysis and manipulation on top of the Hadoop platform. We know that MapReduce is a programming model used with the Hadoop platform (for parallel processing), Pig also uses MapReduce mechanism internally to process data on a distributed environment. Pig actually provides an abstraction on top of the MapReduce model to make programming easier for the developers. Pig scripting is similar to SQL syntax, so the developers can simply write SQL like statements for data processing without using MapReduce directly.
Introduction: The power of Apache Pig is defined by its capability to describe any data analysis tasks such as data flows, traversing from one component to another component. The other important feature of Pig is its User Defined Functions (UDF), which can be used to access code written in many popular high level languages like Java, Python, and Ruby etc. On the other side, Pig scripts can also be executed from other languages. So, you can take the advantage of Pig to write complex business problems, which should be executed in a parallel way on a distributed computing system. And, then invoke it from different applications as a component.
Conceptual thinking – How Pig works
The best example of understanding the work flow of a Pig script is to understand the ETL process. In an ETL (Extract – Transform – Load) process, first, the data is extracted from the sources, second, it is processed based on the business logic and finally stored in a database. The same mechanism is followed in a Pig script execution. Following are the steps:
- First, Pig extracts the data from sources (stream, flat file, dynamic data etc.) using UDF – This is the input.
- Second, Pig performs its operations (like select, iterate and other transforms) on the data – This is the initial processing.
- Third, Pig passes the data to other complex systems for more processing (using UDF) – This is further processing.
- Finally, Pig stores the result into a Hadoop Distributed File System (HDFS) – This is the storage.
Internally, all the pig tasks are series of MapReduce jobs which runs on a hadoop cluster. These jobs are optimized by Pig interpreters to improve performance.
Apache Pig components
The main components of Apache Pig are its infrastructure layer and the language layer.
Infrastructure layer: This layer contains compilers to generate a sequence of MapReduce jobs from the Pig scripts. It works on a distributed parallel computing framework.
Language layer: The language layer contains a textual language known as ‘Pig Latin’. The syntax of this language is more like an SQL statement. It has the following features:
- Simple programming: It provides a simple way to write scripts to achieve parallel execution of data analytics tasks. It can also perform complex tasks including complex data transformations as a flow of data sequences. So, it is easy to write, understand, and maintain Pig Latin scripts.
- Better optimization: As all the tasks are encoded and optimization is automatically done by the system.
- Extendable: The language can be extended to write custom functions.
Pig – Operators: Pig has lot of operators to perform its tasks. Some of the operators are ‘LOAD’, ‘FOREACH’ etc.
Pig – User Defined Functions: Pig supports User Defined Functions to perform complex tasks. These functions can be written in the Java language also.
How to install and execute Pig?
In this section, we will discuss about the installation and execution of Pig scripts. Let’s start one by one.
- Prerequisite: All UNIX and Windows users should have Hadoop (Download) and Java (Download) installed in their system. HADOOP_HOME and JAVA_HOME should be set properly.
- Pig Download: First, download a stable version of Pig (Download). Then unpack the distribution and keep a note of the pig script and pig properties files and their location. After this add the ‘bin’ directory to your path as shown below.
$ export PATH=/&amp;lt;path-to-pig&amp;gt;/pig-n.n.n/bin:$PATH
Now, test the pig installation by using the following command. It will show all the help related to pig. If the results are proper, then your pig installation is successful.
$ pig –help
- Run/Execute Pig commands: Pig Latin statements and Pig scripts can be run in both ‘Local’ and ‘MapReduce’ mode. For local mode, a single machine is required and for mapreduce mode, Hadoop cluster and HDFS installation is needed. Pig can be run in two ways. First, you can use ‘pig’ command by using ‘bin/pig Perl ‘ script. Second, by using the ‘java’ command as ‘java -cp pig.jar’. These two modes are defined based on the infrastructure available like local installation or clustered environment etc.
Local Mode: To run Pig in local mode, install all required files in your local file system and then run it from local host.
Listing 1: Showing Pig running in local mode
/* Run Pig in local mode */
$ pig -x local
Mapreduce Mode: For mapreduce mode, you need to install Hadoop cluster and HDFS. It is the default mode, so you need not specify ‘-x’ flag.
Listing 2: Showing Pig running in mapreduce mode
/* Run Pig in mapreduce mode - This is the default mode*/
$ pig -x mapreduce
In general, Pig can be run by using interactive or batch mode. In interactive mode, ‘ Grunt’ shell is used to enter individual Pig Latin statements and in batch mode, Pig Latin statements are put in a script file with (.pig) extension and run from the command line. This is similar to SQL statements and scripts.
How to run Pig Latin statements and Pig script?
In this section, we will try some examples to run Pig Latin statements and Pig scripts. In the following example, employees are loaded from the storage, and then the names are extracted and dumped as an output.
Listing 3: Showing Pig statements
grunt> E = LOAD 'employees' USING PigStorage() AS (name:chararray, age:int);
grunt> N = FOREACH E GENERATE name;
grunt> DUMP N;
Following is the output:
Now, the same task can be done by using a script file (example.pig). The code snippet is shown below.
Listing 4: Showing Pig script
/* example.pig */
E = LOAD 'employees' USING PigStorage() AS (name:chararray, age:int);
N = FOREACH E GENERATE name;
/* End of script */
Now, run the script file as shown below:
$ pig -x local example.pig
Following is the output:
Conclusion: In this article, we have seen that Pig is a very powerful scripting language based on the Hadoop eco-system and MapReduce programming. It can be used to process large volumes of data in a distributed environment. Pig statements and scripts are similar to SQL statements, so developers can use it without focusing much on the underlying mechanism. Hope Apache Pig will evolve in coming days and support more efficient computing.