Hadoop – An Introduction

Hadoop is the most well-known technology used for exploiting the big data, to enable businesses to gain insight from massive amounts of structured and unstructured data quickly.

In a nutshell Hadoop is one way of storing an enormous amount of data in an enormous cluster of machines and running distributed analysis on each cluster or we can say Hadoop is a software framework that supports data intensive distributed applications. It’s both distributed computing and parallel computing. 

Cluster – Connecting two or more computers together in such a way that they behave like a single computer. Clustering is used for parallel processing, load balancing and fault tolerance.
Distributed Computing – Multiple independent computations on multiple independent machines, where each client works on a separate chunk of information and returns the completed package to a centralized resource that’s responsible for managing the overall workload.
Parallel Computing – multiple CPUs or computers working simultaneously at the same problem requires constant sharing of small snippet of data between the CPUs.
In Parallel Computing all processors have access to a shared memory. In distributed computing, each processor has its own private memory.

In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to retrieve the data. In Hadoop distributed architecture, both data and processing are distributed across multiple servers. In simple terms, instead of running a query on a single server, the query is split across multiple servers i.e. when you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.

Hadoop employs Scale-Out (or horizontally) architecture  means adding new components (or building blocks) to a system i.e. adding a new Hadoop DataNode to a Hadoop cluster, as opposed to Scale-Up or vertically) used by traditional RDBMS which means adding more resources to an existing component of a system such as adding more CPU, adding more storage, etc.

Hadoop is batch processing centric ideal for the discovery, exploration and analysis of large amounts of multi-structured data that doesn’t fit nicely into table, and not suitable for real-time operations.

Apache DefinitionThe Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool.

Hadoop Architecture

Machine Roles“Hadoop employs master/slave architecture for both distributed storage and distributed computation”. In the distributed storage, the NameNode is the master and the DataNodes are the slaves. In the distributed computation, the Job Tracker is the master and the Task Trackers are the slaves.

Working of Hadoop Cluster – 

3 major categories of machine roles in a Hadoop Deployment are,

    1. Client Machine
    2. Master Node
    3. Slave Node

Machine Roles

Client Machine – has hadoop installed will all the cluster settings, but is neither a Master nor Slave. The role of client machine is to load data into cluster, submit Map Reduce jobs describing how that data should be processed and then retrieve or view all the results of the job when it’s finished.

Master Node – encapsulates 2 key functional pieces that make up hadoop

    1. Storing lots of data (HDFS)
    2. Running parallel computations on all that data (MapReduce).

The Name Node is responsible for coordinating the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using the MapReduce.

Slave Nodes – make up the vast majority of machines and do all the dirty work of storing the data and running the computations. Each slave runs both a Data Node and Task Tracker daemon that communicates and receives instructions from their master node.

Task Tracker daemon is slave to the Job Tracker, Data node daemon slave to Name Node.

Hadoop Ecosystem –

Hadoop is built on two main parts

    1. Distributed Storage – A special file system called Hadoop Distributed File System (HDFS)
    2. Distributed Processing – MapReduce framework


The other components of Hadoop Ecosystem are –

    1. Data access and query (Apache Hive)
    2. Scripting platform (Apache Pig)
    3. Column oriented database scaling to billions of rows (Apache HBase)
    4. Metadata services (Apache HCatalog)
    5. Workflow scheduler (Apache Oozie)
    6. Cluster coordination (Apache Zookeeper)
    7. Import data from relational databases (Apache Sqoop)
    8. Collection and import of log and event data (Apache Flume)
    9. Library of machine learning and data mining algorithms (Mahout)
Please refer the article Hadoop Ecosystem for more details.