The Journey to Hadoop

Intel co-founder Gordon Moore in 1965 noticed that the number of transistors per square inch on integrated circuits had doubled every year since their invention. This was later know as the “Moore’s Law“.(REF)

A common corollary is  that the frequency of chips on a CPU also doubles.  This had been holding steady for over 40 years, but lately there has been a bit of stagnation. The main reasons are

  • We are reaching the physical limits in terms of minimizing size of chips. Intel has suggested silicon transistors can only keep shrinking for another five years (REF).
  • Even if we are capable of processing at higher speeds another limiting factor – the  memory bandwidth (the rate at which data can be loaded on to the processor) will kick in.

Both these indicate that unless we get creative about the way we compute will hit a wall in terms of amount of work we can do in a given time frame. Lets take a look at the various point in history where we hit these walls for computational power and how the new school of thought overcame it.

Evolution of Computing

From 1964 to 1971 computers went through a significant change in terms of speed, courtesy of integrated circuits. This not only increased the speed of computers but also made them smaller, more powerful, less expensive and more accessible (REFERENCE). People wanted computers to be able to handle more computations.

At the time software had been written for serial computation i.e. a set in instructions is executed one after another to complete a task. To be able to handle more compute intensive tasks a thought came up – which was to split tasks that are independent of one another and have them run in ‘parallel’. Consider the example below

Two set of hundred elements A and B, from which a set C is to created such that

C[i] = A[i] + B[i] where i=0,1,2,…100

In the traditional approach each of the instructions

C[1]= A[1]+B[1]  — TASK 1

C[2] = A[2]+B[2]  — TASK 2

Will be executed one after another. Whereas in systems designed to handle parallel operations it recognizes that TASK 1 and 2 are independent on one another, hence can be executed together.

Over the 70’s Parallel Computing showed a rapid rise and holding well till the early 80’s( there was a shift from type of parallelism like vector to thread parallelism, more on this later)

However in the 85’s traditional parallel computing became expensive requiring specialized hardware. This meant we needed to get creative and try a different approach.

If you were dealt with a considerable task and a limited time frame in which you know its not possible to get complete the task what would be one way you would tackle it? By asking for help? From peers? friends ? The same can be applied to computational systems, instead of loading the entire task onto a single system – split it into smaller tasks and let it run on multiple systems, all connected across a network.  (this is quite a simplified way of looking at it, I will cover the details like – how to split, how to aggregate, how to synchronize – in later posts).

This thought gave way to MPP, Massively Parallel Processing, in which systems connected over a network were used to to process compute intensive tasks. These system were still specialized to parallelize computations and were quite expensive. (MPP also dealt with the many of the issues of splitting, co-coordinating tasks over a network – more on this in later blog post)

It was in 95’s due to a combination of cost and increasing computation requirements the idea of splitting tasks over different systems was taken even further leading to Cluster or Grid architecture. This architecture was built by hooking up an number of COTS (commercial off-the-shelf) systems (not specialized for computation) tightly over a network and have the load shared across systems. Google took this to the extreme, building a huge cluster and proving the architecture for compute intensive tasks.

The cluster and grid architecture meant that the task needed to be distributed across servers and had to deal with many issues like concurrency of components or and failure of components. Multiple solutions came up and it gained more ground especially now that the eventuality of Moore’s Law was understood. Among it, in the 2000’s Google came out with two paper inspiring what we call today as Hadoop.

Hadoop is basically a combination of two separate concepts

HDFS – Hadoop Distributed File System, how to store and manage data in a distributed manner

MapReduce – How to process data in distributed manner

The next post – ‘What is Hadoop’ will delve a bit more into the above components of Hadoop.

That’s all folks!..

Credits : Cluster Computing and MapReduce – Google


About Karun Thankachan

Working as Assoc Software Development Engineer at Dell International Services Ltd. Area of focus is Data Science and Engineering (Hadoop and Python)
This entry was posted in Big data anlaytics, Data Science, Distributed Computing, Hadoop, Parallel Computing and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s