Hadoop Ecosystem

The components of Hadoop Ecosystem are –

Hadoop Stack

Hadoop Distributed File System (HDFS) – 

Filesystems that manage the storage across a network of machines are called distributed filesystems.

Hadoop distributed file system (HDFS) is a java based file system that provides scalable and reliable data storage and provides high performance access to data across hadoop cluster.

HDFS is designed for storing very large files with write-once-ready-many-times patterns, running on clusters of commodity hardware. HDFS is not a good fit for low-latency data access, when there are lots of small files and for modifications at arbitrary offsets in the file.

Hadoop employs master slave architecture, with each cluster consisting of a single NameNode that is the master and the multiple DataNodes are the slaves.

The NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The NameNode also knows the DataNodes on which all the blocks for a given file are located. DataNodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are storing.

Files in HDFS are broken into block-sized chunks, default size being 64MB, which are stored as independent units, and each block of file is independently replicated at multiple data node.


MapReduce –

MapReduce is a simple programming model for processing massive amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. It is the tool that actually gets the data processed.

MapReduce works breaking the processing into two phases –

    1. Map – The purpose of the map phase is to organize the data in preparation for the processing done in the reduce phase.
    2. Reduce – Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output.

Job Tracker acts as Master and Task Tracker acts as Slave.

A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with more reliability.

The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each Reduce.

The Reduce function then collects the various results and combines them to answer the larger problem the master node was trying to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys it is responsible for and combine them to solve the problem.


 Hive –

Apache Hive is an open-source data warehouse system for querying and analysis of large, stored in Hadoop files data sets. Hive has three main functions: data summary, queries and analyzers.

It provides an SQL-like language called HiveQL while maintaining full support for map/reduce. In short, a Hive query is converted to MapReduce tasks. The main building blocks of Hive are – Metastore stores the system catalog and metadata about tables, columns, partitions, etc. Driver manages the lifecycle of a HiveQL statement as it moves through Hive Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks Execution Engine executes the tasks produced by the compiler in proper dependency order Hive Server provides a Thrift interface and a JDBC / ODBC server.

In addition HiveQL supported in queries embedding individual MapReduce scripts. Hive also allows serialization / deserialization of data.

Hive supports text files (including flat files called), SequenceFiles (flat- Files consisting of binary key / value pairs) and RCFiles (Record Columnar files which store the columns of a table in the manner of a column-based database).


Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters.

Pig enables developers to create query execution routines for analyzing large, distributed data sets without having to do low-level work in MapReduce, much like the way the Apache Hive data warehouse software provides a SQL-like interface for Hadoop that doesn’t require direct MapReduce programming.

The key parts of Pig are a compiler and a scripting language known as Pig Latin. Pig Latin is a data-flow language geared toward parallel processing. Through the use of user-defined functions (UDFs), Pig Latin applications can be extended to include custom processing tasks written in Java as well as languages such as JavaScript and Python.

Pig is intended to handle all kinds of data, including structured and unstructured information and relational and nested data.

HBase –

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). Hadoop is a framework for handling large datasets in a distributed computing environment.

HBase is designed to support high table-update rates and to scale out horizontally in distributed compute clusters. Its focus on scale enables it to support very large database tables — for example, ones containing billions of rows and millions of columns. Currently, one of the most prominent uses of HBase is as a structured data handler for Facebook’s basic messaging infrastructure.

HBase is known for providing strong data consistency on reads and writes, which distinguishes it from other NoSQL databases. Much like Hadoop, an important aspect of the HBase architecture is the use of master nodes to manage region servers that distribute and process parts of data tables.

ZooKeeper –

Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.

The Zookeeper service is provided by a cluster of servers to avoid a single point of failure. Zookeeper uses a distributed consensus protocol to determine which node in the ZooKeeper service is the leader at any given time.

The leader assigns a timestamp to each update to keep order. Once a majority of nodes have acknowledged receipt of a time-stamped update, the leader can declare a quorum, which means that any data contained in the update can be coordinated with elements of the data store. The use of a quorum ensures that the service always returns consistent answers.

Sqoop –

Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

Flume –

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for fail-over and recovery.

In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or decompress data, modify data by adding or removing pieces of information, and more.

Oozie –

Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive – then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

Mahout –

Mahout is an open source machine learning library from Apache. It’s highly scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. At the moment, it primarily implements recommender engines (collaborative filtering), clustering, and classification. Recommender engines try to infer tastes and preferences and identify unknown items that are of interest. Clustering attempts to group a large number of things together that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set. Classification decides how much a thing is or isn’t part of some type or category, or how much it does or doesn’t have some attribute.

Hadoop – An Introduction

Hadoop is the most well-known technology used for exploiting the big data, to enable businesses to gain insight from massive amounts of structured and unstructured data quickly.

In a nutshell Hadoop is one way of storing an enormous amount of data in an enormous cluster of machines and running distributed analysis on each cluster or we can say Hadoop is a software framework that supports data intensive distributed applications. It’s both distributed computing and parallel computing. 

Cluster – Connecting two or more computers together in such a way that they behave like a single computer. Clustering is used for parallel processing, load balancing and fault tolerance.
Distributed Computing – Multiple independent computations on multiple independent machines, where each client works on a separate chunk of information and returns the completed package to a centralized resource that’s responsible for managing the overall workload.
Parallel Computing – multiple CPUs or computers working simultaneously at the same problem requires constant sharing of small snippet of data between the CPUs.
In Parallel Computing all processors have access to a shared memory. In distributed computing, each processor has its own private memory.

In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to retrieve the data. In Hadoop distributed architecture, both data and processing are distributed across multiple servers. In simple terms, instead of running a query on a single server, the query is split across multiple servers i.e. when you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.

Hadoop employs Scale-Out (or horizontally) architecture  means adding new components (or building blocks) to a system i.e. adding a new Hadoop DataNode to a Hadoop cluster, as opposed to Scale-Up or vertically) used by traditional RDBMS which means adding more resources to an existing component of a system such as adding more CPU, adding more storage, etc.

Hadoop is batch processing centric ideal for the discovery, exploration and analysis of large amounts of multi-structured data that doesn’t fit nicely into table, and not suitable for real-time operations.

Apache DefinitionThe Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool.

Hadoop Architecture

Machine Roles“Hadoop employs master/slave architecture for both distributed storage and distributed computation”. In the distributed storage, the NameNode is the master and the DataNodes are the slaves. In the distributed computation, the Job Tracker is the master and the Task Trackers are the slaves.

Working of Hadoop Cluster – 

3 major categories of machine roles in a Hadoop Deployment are,

    1. Client Machine
    2. Master Node
    3. Slave Node

Machine Roles

Client Machine – has hadoop installed will all the cluster settings, but is neither a Master nor Slave. The role of client machine is to load data into cluster, submit Map Reduce jobs describing how that data should be processed and then retrieve or view all the results of the job when it’s finished.

Master Node – encapsulates 2 key functional pieces that make up hadoop

    1. Storing lots of data (HDFS)
    2. Running parallel computations on all that data (MapReduce).

The Name Node is responsible for coordinating the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using the MapReduce.

Slave Nodes – make up the vast majority of machines and do all the dirty work of storing the data and running the computations. Each slave runs both a Data Node and Task Tracker daemon that communicates and receives instructions from their master node.

Task Tracker daemon is slave to the Job Tracker, Data node daemon slave to Name Node.

Hadoop Ecosystem –

Hadoop is built on two main parts

    1. Distributed Storage – A special file system called Hadoop Distributed File System (HDFS)
    2. Distributed Processing – MapReduce framework


The other components of Hadoop Ecosystem are –

    1. Data access and query (Apache Hive)
    2. Scripting platform (Apache Pig)
    3. Column oriented database scaling to billions of rows (Apache HBase)
    4. Metadata services (Apache HCatalog)
    5. Workflow scheduler (Apache Oozie)
    6. Cluster coordination (Apache Zookeeper)
    7. Import data from relational databases (Apache Sqoop)
    8. Collection and import of log and event data (Apache Flume)
    9. Library of machine learning and data mining algorithms (Mahout)
Please refer the article Hadoop Ecosystem for more details.

Big Data is it really about ‘Volume’?

Whenever we think of Big Data we often tend to relate it with the enormous quantity of data without realizing that the problem is not just about huge volume of data, it’s also about the increasing velocity at which it is being created, as well as by its variety (think of “unstructured” data such as text documents, videos and pictures, for instance).

So, What is Big Data?

The term big data is believed to have originated with Web search companies who had to query very large distributed aggregations of loosely-structured data. Big data is a concept, or a popular term, used to describe a massive volume of structured, semi-structured, unstructured and multi-structured data that is so large that it’s difficult to process the data within a tolerable elapsed time, using traditional database and software techniques.

    • Structured Data – Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data.
    • Semi-Structured Data – Semi-structured data is data that is neither raw data, nor typed data in a conventional database system. It is type structured data, but does not conform to the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. XML, JSON are examples of semi-structured data.
    • Un-Structured Data – Data that does not reside in fixed locations. The term generally refers to free-form text, which is ubiquitous. Examples are word processing documents, PDF files, e-mail messages, blogs and Web pages, Twitter tweets, and other social media posts.
    • Multi-structured data – Refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.

An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible. According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. While the term may seem to reference the volume of data, that isn’t always the case. The term big data is defined with respect to the three V’s: volume (amount of data), velocity (speed of data in and out) and variety (range of data types and sources) and veracity is added by some organizations, and when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires handling the large amounts of data and storage facilities. BigData

    • Volume – Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.
    • Velocity – It’s not just the velocity of the incoming data that’s the issue: it’s streaming of fast-moving data into bulk storage for later batch processing. The importance lies in the speed of the feedback loop, taking data from input through to decision. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.
    • Variety – Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.
    • Veracity – Big Data Veracity refers to the biases, noise and abnormality in data i.e. uncertain and imprecise data.

Why Big Data?

In business, we often run into very similar problems where we need to make decisions based on incomplete information in a rapidly changing context.  The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyse it to find answers that smarter business decision-making. Big data can also be used to uncover hidden patterns, unknown correlations and other useful information by combining big data and high-powered analytics. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. It’s also possible to use clickstream analysis and data mining to detect fraudulent behavior and to determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually. Big Data can be used to develop the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance to avoid failures in new products. To get maximum value out of big data, it needs to be associated with traditional enterprise data, automatically or via purpose-built applications, reports, queries, and other approaches. For instance a retailer wants to link its website visitor behaviour logs (Big Data) with purchase information (found in Relational Database) based on text and image message volume trends (Unstructured big data).

Big Data Helps Serve Customers Better – Recent studies suggest that today’s customers get a major chunk of information about brands from social media websites. In order to survive in the connected economy, businesses need to actively Engage with their customers for a better brand awareness and recall; Analyze customers’ needs based on their preferences and behaviour; and Act on this to create goods and services that are aligned to the customer’s needs – when they want it

    • Engage Social Engagement with customers includes listening and monitoring what the customer says and keeping them engaged with specific content that provides the customers a ring-side view of the brands and products and in the process elicits responses about their needs and preferences.
    • Analyze – Understand your customers and the impact to your business/brand. Monitor context of the conversation against the matrices customer sentiment, Customer Psychograph and Demography, and so on.
    • Act When the social data is correlated with enterprise data, it provides a much potent view of the social impact. This makes it easier for organizations to improve their campaign efficacy or customer engagement. Armed with the knowledge of customer preference along with their behaviour in the store, the business leaders can more effectively tune their messaging, products and services for better impact.

Emerging Technologies for Big Data –

Hadoop – The first thing that comes to mind when we speak about Big Data is Hadoop. Hadoop is not Big Data it’s one of the most popular implementation of MapReduce for processing and handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive – Hive is a data warehouse tool that integrates with HDFS and MapReduce by providing a SQL like query language called Hive Query Language (HQL) and schema layer on top of HDFS files. Hive translates HQL queries into a series of MapReduce jobs that emulate the queries behavior and runs the jobs in cluster. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volume of data.

PIG – It’s an abstraction for Java MapReduce jobs similar to Hive. The language for this platform is called Pig Latin. Pig enables data workers to write complex data transformations without knowing Java using Pig’s simple SQL-like scripting language is called Pig Latin. Pig works with data from many sources, including structured and unstructured data, and stores the results into the Hadoop Data File System

Schema-less databases, or NoSQL databases – There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. MongoDb, Cassandra, HBase are few to name.