Hadoop is technology that is based on two ideas
1. Hadoop Distributed File System (HDFS)
2. MapReduce Algorithm
HDFS (based on Java) provides scalable and reliable data storage. It was designed to span large clusters of commodity servers (meaning expensive production class server are not required)
An HDFS cluster is comprised of a ‘NameNode’, which manages the cluster metadata (like permissions, access times, which data is stored in what block – called Namespace etc.) and ‘DataNodes’ that store the data. For example if a file is to be stored its content is split into large ‘block’s (64-128MB), and each block of the file is replicated and stored at multiple DataNodes. Its metadata information, including mapping of data to block, called the ‘Namespace’ is stored on the NameNode.
This replication helps different parts of the same process to access the same data together. It also helps in fault-tolerance. How ? – DataNodes send the NameNode a heartbeat message every few minutes.If the NameNode does not receive a heartbeat message it assumes the DataNode to be down. It then proceeds to replicate all the data that was on this DataNode across other available DataNodes
Credits : Hortonworks
This ensures HDFS is reliable and fault-tolerant.
Note : The secondary NameNode is in case NameNode fails.
YARN (yet another resource manager), is another element that became essential in later versions (Hadoop 2.0). It is a resource management framework that co-ordinates concurrent access to data in HDFS. HDFS and YARN work together to distribute storage and computation across many servers so that if data and storage grow linearly processing throughput remains economical.
Credits : Hortonworks
MapReduce is algorithm which helps to process large data sets in a distributed and parallel manner. It uses certain paradigms from functional programming, mainly consisting of two sub tasks – Map and Reduce. Each of these tasks operate by taking the code to the place where data is stored. Also there are multiple mapper tasks working together following which there can be multiple reducer tasks working together. Each of them work independent of the other executing code on the specific portion of data they are tasked to work on.
To understand the stages in the Map-Reduce workflow better let us take an example. Consider the data below and the aim is to count the number of times a city appears
A Map reduce workflow goes through five stages to accomplish this task
Splitting : The input data is split (as per logic programmed) and each chunk of data is assigned to a map task. In the example the data is split at the end of line and each line shown below is give to a map task.
Mapping : This stage takes in data as input and return output as tuples or key-value pairs (this pair is determined by the logic programmed). In this example a map task split a line on space. It then counts the number of distinct elements.
Sort and Shuffle : Data is then sent to a sort and shuffle phase. This is done to reduce the logic included in the reduce. For example in our example we can see that (Bangalore,1) is spread across four mappers. If this data across four mappers were to be brought to the same location a reduce task could calculate the number of time Bangalore occurred more easily as apposed to when the data is distributed across systems. In the example at the end of sort and shuffle the result would be. Now each these will assigned to reduce tasks.
Reduce : A reduce task aims to find the desired output on the data it has. In this example the first reducer would have the data – (Bangalore,1) (Bangalore,1) (Bangalore,1) (Bangalore,1) – in which it must find the number of time a city occurs. Hence the result – (Bangalore,4). Similarly each of our reducers will return their results as follows.
Combiner : In case we only have one reducer the combiner and reducer are one and the same. Else we need to Combine the results from all the reducers and format the results into the output we require. In the example we want to know how many time each city occurred, so the result would be
To Summarize MapReduce workflow is as follows
That’s all folks!..
Credits : Hortonworks HDFS Docs