Learn the Ropes – Pig

Pig is a high-level platform for creating MapReduce programs used with Hadoop, originally developed by Yahoo in 2006. It is a powerful tool for querying data in a Hadoop cluster. It basically helps write Map-Reduce more easily.

Pig is handly in terms of manipulating data flow. It has few reserved key words, its comparable to SQL logical query plan and can be easilty extended using user-defined function (Java/Python/Ruby etc). Pig processing can be divided into three logical levels –

  1. Data Loading  – Load data from file, HDFS, Hbase, Hive etc.
  2. Data Manipulation – Use PIG APIs (FILTER, GROUP, ORDER) or mathematical function (Mean, Min, Max) or even UDFs to refine data.
  3. Data Persisting – Store processed data back in HDFS/Hive etc.

Let us now get out hands dirty with some code

Programming in Pig

Pig has two execution modes or exectypes:

  • Local Mode – To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system.
pig -x local ..
  • Mapreduce Mode – To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode
$ pig ...
or
$ pig -x mapreduce ...

You can run pig in batch using pig scripts or in the interactive mode. To run as bach scripts in mapreduce

$ pig id.pig
or
$ pig -x mapreduce id.pig

To run in local mode just use

pig -x local id.pig

For the rest of the session we will be using pig in the local interactive mode.

Lets look at the Hello World of the Big Data space – the word count program in PIG. The language is called ‘Pig Latin‘. (Fun fact, Pig is called so because its a lazy programming language, as in it just keeps in mind all the commands you give it without actually checking them till it has to LOAD/STORE data. Only then does it compile and execute. This helps to make few optimization in the query execution )

This is just to get a feel for programming in PIG, we will go over each function in PIG in detail in the following posts. Assume a file containing a set of words – ‘input.txt’

A = load 'input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
dump D;

Compared to Map-Reduce job in Java this takes considerably less time with the syntax and logic being even more familiar and natural. This is one of the major advatages of Pig – abstracting out data manipulation.

Lets now dive into the details of PIG Latin

PIG Latin Data Types

The most common data types that can be handled in pig are as follows. The simple data types are

  • int
  • long
  • float
  • double
  • chararray [string]
  • bytearray

The complex data types are

  • Tuple: An ordered set of fields e.g., (‘console’, ’mouse’)
  • Bag: An collection of tuples e.g., {(‘laptop’, ’keyboard’), (‘chikki, ’chips’)}
  • Map: A set of key value pairs e.g., [tech # laptop, food #chikki ]

Now lets get our hands dirty with some code. You can use the data shown below. Save it to a file called ‘input.txt’ in to your local directory in CSV format. Fire up pig in local mode (pig -x local)

PRODUCT NAME SELLING PRICE COST PRICE STOCK DATE
 Age of Empires 200 120 20/02/2017
 Call of Duty 350 200 3/3/2017
 Prince of Persia 300 210 4/1/2017
Need For Speed 400 220 12/3/2017

The first step to any data flow is to specify your input. In Pig Latin this is done with the load statement.Data Loading

LOAD loads input in various formats including tab – separated, comma-seprated text files.

Data = LOAD 'input.txt';
Data = LOAD 'input.txt' USING PigStorage(',');

Schema can be specified

Data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);

Pig stores your data on HDFS in a tab-delimited file using PigStorage.

STORE Data into '/path/to/hdfs';

Data Manipulation

Pig has a variety of built in functions. Lets look at them one by one.

FOREACH

FOREACH takes an expression and applies them to every record in the data. From these expressions it generates new records and store it in a variable. For example, the following code loads an entire record, but then removes all but the product_name and stores it to a variable

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_data:chararray);
selectivce = foreach data generate product_name;

FOREACH supports an array of expressions, like difference

data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
difference = foreach prices generate selling_price - cost_price;

Field references can be by name or by position . Positional references are preceded by a $ (dollar sign) and start from 0.

difference = foreach prices generate $1 - $2;

DISTINCT

Distinct removes duplicate records.

data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
without_duplicates = distinct data;

LIMIT

Limit, only whe you want to see only a certain number of results.

data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
top_100 = limit data 100;

FILTER

Filter statement allows you to select which records will be retained in your data pipeline.

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
filtered = filter data by (selling_price > 250);

GROUP

Group statement collects together records having the same value for a key. In the below example we group all the data by product_name and then take sum of the selling price of products having the same name.

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
product_sales = group data by product; 
product_revenue = foreach product_sales generate group, SUM(data.selling_price);

ORDER

Order statement sorts your data either in ascending or descending as specified.

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
sort_by_sp  = order data by selling_price DESC;

JOINS

Join selects records from one input to put together with records from another input. It is done by indicating keys for each input.

For this we need one more data file – ‘sales.txt’.

PRODUCT NAME SALE_DATE QUANTITY
 Age of Empires 20-02-2017 2
 Call of Duty 03-03-2017 1
 Prince of Persia 04-01-2017 1
Need For Speed 12-03-2017 5
data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray)
sale = load 'sales.txt' using PigStorage(',') as (product_name:chararray, sale_date:chararray, quantity: int); 
Jnd =join data by (product_name), sale by (product_name);

It will join the data with same product_name in the above keys. Output is as follows

PRODUCT NAME SELLING PRICE COST PRICE STOCK DATE SALE_DATE QUANTITY
 Age of Empires 200 120 20/02/2017 20-02-2017 2
 Call of Duty 350 200 3/3/2017 3/3/2017 1
 Prince of Persia 300 210 4/1/2017 4/1/2017 1
Need For Speed 400 220 12/3/2017 12/3/2017 5

It can also join on multiple keys.

join_data = JOIN data by (product_name, stock_date), sales by (product_name, sale_date);

Phew.. Lengthy post, lets take Eval Functions (Mathematical Functions in PIG) and UDFs in the next Post.

That’s all Folks

Credits Pig Official Docs


Advertisements

About Karun Thankachan

Working as Assoc Software Development Engineer at Dell International Services Ltd. Area of focus is Data Science and Engineering (Hadoop and Python)
This entry was posted in Big data anlaytics, Data Science, Distributed Computing, Hadoop, MapReduce and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s