Road to Machine Learning

I wanted to try more of machine learning in Python. Among the few other blogs I read I came across this interesting one – 100 days of algorithms challenge.

Found it to be quite interesting and seemed like a good idea to learn from those more experienced in the the field.

Thought i would share the same

100 days of Algorithms

That’s all folks..

Posted in Machine Learning | Tagged | Leave a comment

Learn the Ropes – Pig

Pig is a high-level platform for creating MapReduce programs used with Hadoop, originally developed by Yahoo in 2006. It is a powerful tool for querying data in a Hadoop cluster. It basically helps write Map-Reduce more easily.

Pig is handly in terms of manipulating data flow. It has few reserved key words, its comparable to SQL logical query plan and can be easilty extended using user-defined function (Java/Python/Ruby etc). Pig processing can be divided into three logical levels –

  1. Data Loading  – Load data from file, HDFS, Hbase, Hive etc.
  2. Data Manipulation – Use PIG APIs (FILTER, GROUP, ORDER) or mathematical function (Mean, Min, Max) or even UDFs to refine data.
  3. Data Persisting – Store processed data back in HDFS/Hive etc.

Let us now get out hands dirty with some code

Programming in Pig

Pig has two execution modes or exectypes:

  • Local Mode – To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system.
pig -x local ..
  • Mapreduce Mode – To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode
$ pig ...
$ pig -x mapreduce ...

You can run pig in batch using pig scripts or in the interactive mode. To run as bach scripts in mapreduce

$ pig id.pig
$ pig -x mapreduce id.pig

To run in local mode just use

pig -x local id.pig

For the rest of the session we will be using pig in the local interactive mode.

Lets look at the Hello World of the Big Data space – the word count program in PIG. The language is called ‘Pig Latin‘. (Fun fact, Pig is called so because its a lazy programming language, as in it just keeps in mind all the commands you give it without actually checking them till it has to LOAD/STORE data. Only then does it compile and execute. This helps to make few optimization in the query execution )

This is just to get a feel for programming in PIG, we will go over each function in PIG in detail in the following posts. Assume a file containing a set of words – ‘input.txt’

A = load 'input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
dump D;

Compared to Map-Reduce job in Java this takes considerably less time with the syntax and logic being even more familiar and natural. This is one of the major advatages of Pig – abstracting out data manipulation.

Lets now dive into the details of PIG Latin

PIG Latin Data Types

The most common data types that can be handled in pig are as follows. The simple data types are

  • int
  • long
  • float
  • double
  • chararray [string]
  • bytearray

The complex data types are

  • Tuple: An ordered set of fields e.g., (‘console’, ’mouse’)
  • Bag: An collection of tuples e.g., {(‘laptop’, ’keyboard’), (‘chikki, ’chips’)}
  • Map: A set of key value pairs e.g., [tech # laptop, food #chikki ]

Now lets get our hands dirty with some code. You can use the data shown below. Save it to a file called ‘input.txt’ in to your local directory in CSV format. Fire up pig in local mode (pig -x local)

 Age of Empires 200 120 20/02/2017
 Call of Duty 350 200 3/3/2017
 Prince of Persia 300 210 4/1/2017
Need For Speed 400 220 12/3/2017

The first step to any data flow is to specify your input. In Pig Latin this is done with the load statement.Data Loading

LOAD loads input in various formats including tab – separated, comma-seprated text files.

Data = LOAD 'input.txt';
Data = LOAD 'input.txt' USING PigStorage(',');

Schema can be specified

Data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);

Pig stores your data on HDFS in a tab-delimited file using PigStorage.

STORE Data into '/path/to/hdfs';

Data Manipulation

Pig has a variety of built in functions. Lets look at them one by one.


FOREACH takes an expression and applies them to every record in the data. From these expressions it generates new records and store it in a variable. For example, the following code loads an entire record, but then removes all but the product_name and stores it to a variable

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_data:chararray);
selectivce = foreach data generate product_name;

FOREACH supports an array of expressions, like difference

data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
difference = foreach prices generate selling_price - cost_price;

Field references can be by name or by position . Positional references are preceded by a $ (dollar sign) and start from 0.

difference = foreach prices generate $1 - $2;


Distinct removes duplicate records.

data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
without_duplicates = distinct data;


Limit, only whe you want to see only a certain number of results.

data = LOAD 'input.csv' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
top_100 = limit data 100;


Filter statement allows you to select which records will be retained in your data pipeline.

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
filtered = filter data by (selling_price > 250);


Group statement collects together records having the same value for a key. In the below example we group all the data by product_name and then take sum of the selling price of products having the same name.

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
product_sales = group data by product; 
product_revenue = foreach product_sales generate group, SUM(data.selling_price);


Order statement sorts your data either in ascending or descending as specified.

data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray);
sort_by_sp  = order data by selling_price DESC;


Join selects records from one input to put together with records from another input. It is done by indicating keys for each input.

For this we need one more data file – ‘sales.txt’.

 Age of Empires 20-02-2017 2
 Call of Duty 03-03-2017 1
 Prince of Persia 04-01-2017 1
Need For Speed 12-03-2017 5
data = load 'input.txt' using PigStorage(',') as (product_name:chararray, selling_price:double, cost_price: double, stock_date: chararray)
sale = load 'sales.txt' using PigStorage(',') as (product_name:chararray, sale_date:chararray, quantity: int); 
Jnd =join data by (product_name), sale by (product_name);

It will join the data with same product_name in the above keys. Output is as follows

 Age of Empires 200 120 20/02/2017 20-02-2017 2
 Call of Duty 350 200 3/3/2017 3/3/2017 1
 Prince of Persia 300 210 4/1/2017 4/1/2017 1
Need For Speed 400 220 12/3/2017 12/3/2017 5

It can also join on multiple keys.

join_data = JOIN data by (product_name, stock_date), sales by (product_name, sale_date);

Phew.. Lengthy post, lets take Eval Functions (Mathematical Functions in PIG) and UDFs in the next Post.

That’s all Folks

Credits Pig Official Docs

Posted in Big data anlaytics, Data Science, Distributed Computing, Hadoop, MapReduce | Tagged , , | Leave a comment

Learning the Ropes – Machine Learning

Some good books i found for machine learning and am going through myself

To get the baiscs down of Data Mining

A Programmer’s Guide to Data Mining

A fresh approach to the Bayesian model and how powerful it can actually be.

Probabilistic Programming and Bayesian Methods for Hackers

From the Stanford Course for dealing with massive datasets –

Mining Massive Datasets

To get your hands dirty with the latest revolution in ML –

Deep Leanrning (MIT Press)

Posted in Machine Learning, Python, Statistical Modelling, Statistics | Tagged | Leave a comment

Learning the Ropes – Machine Leanring

Came across a blog which take a fresh approach to Machine Leanring


Just google ‘Machine Leanring is Fun’ and the first link will take you there

That’s all folks ..

Posted in Uncategorized | Leave a comment

Deep Learning

Came across this cool github repo where some interesting material for Deep Learning is present. Thought I’d share

Deep Learning RoadMap

Deep Learning

That’s all folks ..


Posted in Deep Learning, Machine Learning | Tagged , , | Leave a comment

Learning the Ropes – Hadoop

Hadoop is technology that is based on two ideas

1. Hadoop Distributed File System (HDFS)
2. MapReduce Algorithm

HDFS (based on Java) provides scalable and reliable data storage. It was designed to span large clusters of commodity servers (meaning expensive production class server are not required)

An HDFS cluster is comprised of a ‘NameNode’, which manages the cluster metadata (like permissions, access times, which data is stored in what block – called Namespace etc.) and ‘DataNodes’ that store the data. For example if a file is to be stored its content is split into large ‘block’s (64-128MB), and each block of the file is replicated and stored at multiple DataNodes. Its metadata information, including mapping of data to block, called the ‘Namespace’ is stored on the NameNode.

This replication helps different parts of the same process to access the same data together. It also helps in fault-tolerance. How ? – DataNodes send the NameNode a heartbeat message every few minutes.If the NameNode does not receive a heartbeat message it assumes the DataNode to be down. It then proceeds to replicate all the data that was on this DataNode across other available DataNodes

Image result for HDFS namenode and data node

Credits : Hortonworks

This ensures HDFS is reliable and fault-tolerant.

Note : The secondary NameNode is in case NameNode fails.

YARN (yet another resource manager), is another element that became essential in later versions (Hadoop 2.0). It is a resource management framework that co-ordinates concurrent access to data in HDFS. HDFS and YARN work together to distribute storage and computation across many servers so that if data and storage grow linearly processing throughput remains economical.

Related image

Credits : Hortonworks

MapReduce is algorithm which helps to process large data sets in a distributed and parallel manner. It uses certain paradigms from functional programming, mainly consisting of two sub tasks – Map and Reduce. Each of these tasks operate by taking the code to the place where data is stored. Also there are multiple mapper tasks working together following which there can be multiple reducer tasks working together. Each of them work independent of the other executing code on the specific portion of data they are tasked to work on.

To understand the stages in the Map-Reduce workflow better let us take an example. Consider the data below and the aim is to count the number of times a city appears

L Hadoop MR 1

A Map reduce workflow goes through five stages to accomplish this task

Splitting : The input data is split (as per logic programmed) and each chunk of data is assigned to a map task. In the example the data is split at the end of line and each line shown below is give to a map task.

L Hadoop MR 2

Mapping : This stage takes in data as input and return output as tuples or key-value pairs (this pair is determined by the logic programmed). In this example a map task split a line on space. It then counts the number of distinct elements.

L Hadoop MR 3

Sort and Shuffle : Data is then sent to a sort and shuffle phase. This is done to reduce the logic included in the reduce. For example in our example we can see that (Bangalore,1) is spread across four mappers. If this data across four mappers were to be brought to the same location a reduce task could calculate the number of time Bangalore occurred more easily as apposed to when the data is distributed across systems. In the example at the end of sort and shuffle the result would be. Now each these will assigned to reduce tasks.

L Hadoop MR 4

Reduce : A reduce task aims to find the desired output on the data it has. In this example the first reducer would have the data  – (Bangalore,1) (Bangalore,1) (Bangalore,1) (Bangalore,1) – in which it must find the number of time a city occurs. Hence the result – (Bangalore,4). Similarly each of our reducers will return their results as follows.

L Hadoop MR 5

Combiner : In case we only have one reducer the combiner and reducer are one and the same. Else we need to Combine the results from all the reducers and format the results into the output we require. In the example we want to know how many time each city occurred, so the result would be

L Hadoop MR 6

To Summarize MapReduce workflow is as follows

L Hadoop MR 7

That’s all folks!..

Credits : Hortonworks HDFS Docs



Posted in Big data anlaytics, Distributed Computing, Hadoop, MapReduce | Tagged , , , | Leave a comment

The Journey to Hadoop

Intel co-founder Gordon Moore in 1965 noticed that the number of transistors per square inch on integrated circuits had doubled every year since their invention. This was later know as the “Moore’s Law“.(REF)

A common corollary is  that the frequency of chips on a CPU also doubles.  This had been holding steady for over 40 years, but lately there has been a bit of stagnation. The main reasons are

  • We are reaching the physical limits in terms of minimizing size of chips. Intel has suggested silicon transistors can only keep shrinking for another five years (REF).
  • Even if we are capable of processing at higher speeds another limiting factor – the  memory bandwidth (the rate at which data can be loaded on to the processor) will kick in.

Both these indicate that unless we get creative about the way we compute will hit a wall in terms of amount of work we can do in a given time frame. Lets take a look at the various point in history where we hit these walls for computational power and how the new school of thought overcame it.

Evolution of Computing

From 1964 to 1971 computers went through a significant change in terms of speed, courtesy of integrated circuits. This not only increased the speed of computers but also made them smaller, more powerful, less expensive and more accessible (REFERENCE). People wanted computers to be able to handle more computations.

At the time software had been written for serial computation i.e. a set in instructions is executed one after another to complete a task. To be able to handle more compute intensive tasks a thought came up – which was to split tasks that are independent of one another and have them run in ‘parallel’. Consider the example below

Two set of hundred elements A and B, from which a set C is to created such that

C[i] = A[i] + B[i] where i=0,1,2,…100

In the traditional approach each of the instructions

C[1]= A[1]+B[1]  — TASK 1

C[2] = A[2]+B[2]  — TASK 2

Will be executed one after another. Whereas in systems designed to handle parallel operations it recognizes that TASK 1 and 2 are independent on one another, hence can be executed together.

Over the 70’s Parallel Computing showed a rapid rise and holding well till the early 80’s( there was a shift from type of parallelism like vector to thread parallelism, more on this later)

However in the 85’s traditional parallel computing became expensive requiring specialized hardware. This meant we needed to get creative and try a different approach.

If you were dealt with a considerable task and a limited time frame in which you know its not possible to get complete the task what would be one way you would tackle it? By asking for help? From peers? friends ? The same can be applied to computational systems, instead of loading the entire task onto a single system – split it into smaller tasks and let it run on multiple systems, all connected across a network.  (this is quite a simplified way of looking at it, I will cover the details like – how to split, how to aggregate, how to synchronize – in later posts).

This thought gave way to MPP, Massively Parallel Processing, in which systems connected over a network were used to to process compute intensive tasks. These system were still specialized to parallelize computations and were quite expensive. (MPP also dealt with the many of the issues of splitting, co-coordinating tasks over a network – more on this in later blog post)

It was in 95’s due to a combination of cost and increasing computation requirements the idea of splitting tasks over different systems was taken even further leading to Cluster or Grid architecture. This architecture was built by hooking up an number of COTS (commercial off-the-shelf) systems (not specialized for computation) tightly over a network and have the load shared across systems. Google took this to the extreme, building a huge cluster and proving the architecture for compute intensive tasks.

The cluster and grid architecture meant that the task needed to be distributed across servers and had to deal with many issues like concurrency of components or and failure of components. Multiple solutions came up and it gained more ground especially now that the eventuality of Moore’s Law was understood. Among it, in the 2000’s Google came out with two paper inspiring what we call today as Hadoop.

Hadoop is basically a combination of two separate concepts

HDFS – Hadoop Distributed File System, how to store and manage data in a distributed manner

MapReduce – How to process data in distributed manner

The next post – ‘What is Hadoop’ will delve a bit more into the above components of Hadoop.

That’s all folks!..

Credits : Cluster Computing and MapReduce – Google

Posted in Big data anlaytics, Data Science, Distributed Computing, Hadoop, Parallel Computing | Tagged , , , , | Leave a comment