Big Data HDFS Way

Storage was not the biggest concern for the Big Data, the real concern was the rate at which data is getting accessed from the system which is usually used to measure performance of the system and it is determined by the number of I/O ports. As the number of queries to access data increase, the current file system I/O becomes inadequate to retrieve large amounts of data simultaneously. Further, the model of one large single storage starts becoming a bottleneck.

One Machine with four io port

Consider a machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is :

(1024 * 1024)/(100 * 4 * 60 ) = 43.69 mins, approximately 45 mins

Solution – To overcome the problems, a distributed file system was conceived that provided solution to the above problems. The solution tackled the problem as

1. When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines.

2. With the advent of block storage, the data access becomes distributed and enables us to combine the power of multiple machines into a single virtual machine, so that you are not limited to the capacity of a single unit.

3. When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to produce the whole file.

4. As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine.

5. Since it employs scale-out i.e distributed architecture  its easier to keep pace with data growth problems as well as increasing access demands by adding more nodes to the cluster. In this way, performance and I/O bandwidth all scale linearly as more capacity is added to the storage system.

ten machine with four io port



Consider 10 machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is : (1024 * 1024)/(100 * 4 * 60 * 10 ) = 4.369 mins, approximately 4.5 mins



HDFS Architecture

HDFS employs master-slave architecture and consists of :

1. Single NameNode which is master and

2. Multiple DataNode which acts as slaves.

hdfs architecture

HDFS High Level Architecture

In the above diagram, there is one Name Node and multiple Data Nodes (servers) with data blocks.

When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in Hadoop Clusters. HDFS creates several replications of the data blocks and distribute them accordingly in the cluster is a way that will be reliable and retrieved faster.

Hadoop will internally make sure that any node failure will never result in data loss.

There will be only one machine that manages the file system meta-data.

There will be multiple data nodes (These are the real cheap commodity servers that will store data blocks).

When we execute a query from a client, it will reach out to Name Node to get the file system meta-data information, and then it will reach out to the Data Node to get the real data blocks.

Name Node –

HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes make up the complete file.

The complete collection of all the files in the cluster is referred to as the file system namespace. The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. So it contains the information of all the files, directories and their hierarchy in the cluster. Along with the filesystem information it also knows about the datanode on which  all the blocks of a file is kept.

It is the NameNode’s job to oversee the health of Data Nodes and to coordinate access to data. The Name Node is the central controller of HDFS.

A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. The client presents a filesystem interface similar to a Portable Operating System Interface (POSIX), so the user code does not need to know about the namenode and datanode to function.


The above diagram shows how Name Node stores information in disks. Two different files are –

1. fs image – It’s the snapshot the file system when Name Node is started.

2. Edit Logs – It’s the sequence of changes made to the file system after the Name Node is started.

Only in the restart of Name Node, edit logs are applied to fs image to get the latest snapshot of the file system, but Name Node restart are very rare in production clusters which means edit logs can grow very large for the cluster where Name Node runs for a long period of time leading to below mentioned issues –

1. Name Node restart takes long time because of lot of changes has to be merged.

2. In the case of crash we will lose huge amount of metadata since fs image is very old. This is also the reason that’s why Hadoop is known as a Single Point of failure.

Secondary namenode –

helps to overcome the above mentioned issue.

Secondary NameNode

Well the Secondary namenode also contains a namespace image and edit logs like namenode. Now after every certain interval of time(which is one hour by default)  it copies the namespace image from namenode and merge this namespace image with the edit log and copy it back to the namenode so that namenode will have the fresh copy of namespace image. Now lets suppose at any instance of time the namenode goes down and becomes corrupt then we can restart  some other machine with the namespace image and the edit log that’s what we have with the secondary namenode and hence can be prevented from a total failure.

Secondary Name node takes almost the same amount of memory and CPU for its working as the Namenode. So it is also kept in a separate machine like that of a namenode.

Data Nodes –

These are the workers that does the real work. And here by real work we mean that the storage of actual data is done by the data node. They store and retrieve blocks when they are told to (by clients or the namenode).

Data nodes are not smart, but they are resilient. Within the HDFS cluster, data blocks are replicated across multiple data nodes and access is managed by the NameNode. The replication mechanism is designed for optimal efficiency when all the nodes of the cluster are collected into a rack. In fact, the NameNode uses a “rack ID” to keep track of the data nodes in the cluster.

Data nodes also provide “heartbeat” messages to detect and ensure connectivity between the NameNode and the data nodes. When a heartbeat is no longer present, the NameNode unmaps the data node from the cluster and keeps on operating as though nothing happened. When the heartbeat returns, it is added to the cluster transparently with respect to the user or application.

Data integrity is a key feature. HDFS supports a number of capabilities designed to provide data integrity. As you might expect, when files are broken into blocks and then distributed across different servers in the cluster, any variation in the operation of any element could affect data integrity. HDFS uses transaction logs and checksum validation to ensure integrity across the cluster.

Transaction logs keep track of every operation and are effective in auditing or rebuilding of the file system should something untoward occur.

Checksum validations are used to guarantee the contents of files in HDFS. When a client requests a file, it can verify the contents by examining its checksum. If the checksum matches, the file operation can continue. If not, an error is reported. Checksum files are hidden to help avoid tampering.

Hadoop Distributed File System (HDFS) Concepts and Design Fundamentals

Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is a filesystem developed specifically for storing very large files with streaming data access patterns, running on clusters of commodity hardware and is highly fault-tolerant. HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100 PB and beyond.

HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis.


Hadoop Distributed File System (HDFS) Concepts – 

1. Cluster : A hadoop cluster is made by having many machines in a network, each machine is termed as a node, and these nodes talks to each other over the network.

computer cluster2. Name Node : Name Node holds all the file system metadata for the cluster and oversees the health of Data Nodes and coordinates access to data.  The Name Node is the central controller of HDFS.

3. Secondary Name Node :  Since Name Node is the single point of failure, secondary NameNode constantly reads the data from the RAM of the NameNode and writes it into the hard disk or the file system. It is not a substitute to the NameNode, so if the NameNode fails, the entire Hadoop system goes down.

4. Data Node : These are the workers that does the real work of storing data as and when told by the Name Node.

5. Block Size: This is the minimum amount of size of one block in a filesystem, in which data can be kept contiguously. The default size of a single block in HDFS is 64 Mb.

Fundamentals Behind HDFS Design –

1. Very Large files: Now when we say very large files we mean here that the size of the file will be in a range of gigabyte, terabyte, petabyte or may be more.

2. Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. This makes HDFS is an excellent choice for supporting big data analysis.

3. Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors). This leads to an overall cost reduction to a great extent.

4. Data Replication and Fault Tolerance: HDFS works on the assumption that hardware is bound to fail at some point of time or the other. This disrupts the smooth and quick processing of large volumes of data. To overcome this obstacle, in HDFS, the files are divided into large blocks of data and each block is stored on three nodes: two on the same rack and one on a different rack for fault tolerance.

5. High Throughput : In HDFS, the task is divided and shared among different systems. So, all the systems will be executing the tasks assigned to them independently and in parallel achieving high throughput.

6. Moving Computation is Better than Moving Data: HDFS works on the principle that if a computation is done by an application near the data it operates on, it is much more efficient than done far of, particularly when there are large data sets. The major advantage is reduction in the network congestion and increased overall throughput of the system.

7. Scale Out Architecture : HDFS employs scale out architecture as compared to scale up architecture employed by RDBMS. To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application. Scaling out, also known as distributed programming, does something similar by distributing jobs across machines over the network.

Hadoop Ecosystem

The components of Hadoop Ecosystem are –

Hadoop Stack

Hadoop Distributed File System (HDFS) – 

Filesystems that manage the storage across a network of machines are called distributed filesystems.

Hadoop distributed file system (HDFS) is a java based file system that provides scalable and reliable data storage and provides high performance access to data across hadoop cluster.

HDFS is designed for storing very large files with write-once-ready-many-times patterns, running on clusters of commodity hardware. HDFS is not a good fit for low-latency data access, when there are lots of small files and for modifications at arbitrary offsets in the file.

Hadoop employs master slave architecture, with each cluster consisting of a single NameNode that is the master and the multiple DataNodes are the slaves.

The NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The NameNode also knows the DataNodes on which all the blocks for a given file are located. DataNodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are storing.

Files in HDFS are broken into block-sized chunks, default size being 64MB, which are stored as independent units, and each block of file is independently replicated at multiple data node.


MapReduce –

MapReduce is a simple programming model for processing massive amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. It is the tool that actually gets the data processed.

MapReduce works breaking the processing into two phases –

    1. Map – The purpose of the map phase is to organize the data in preparation for the processing done in the reduce phase.
    2. Reduce – Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output.

Job Tracker acts as Master and Task Tracker acts as Slave.

A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with more reliability.

The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each Reduce.

The Reduce function then collects the various results and combines them to answer the larger problem the master node was trying to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys it is responsible for and combine them to solve the problem.


 Hive –

Apache Hive is an open-source data warehouse system for querying and analysis of large, stored in Hadoop files data sets. Hive has three main functions: data summary, queries and analyzers.

It provides an SQL-like language called HiveQL while maintaining full support for map/reduce. In short, a Hive query is converted to MapReduce tasks. The main building blocks of Hive are – Metastore stores the system catalog and metadata about tables, columns, partitions, etc. Driver manages the lifecycle of a HiveQL statement as it moves through Hive Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks Execution Engine executes the tasks produced by the compiler in proper dependency order Hive Server provides a Thrift interface and a JDBC / ODBC server.

In addition HiveQL supported in queries embedding individual MapReduce scripts. Hive also allows serialization / deserialization of data.

Hive supports text files (including flat files called), SequenceFiles (flat- Files consisting of binary key / value pairs) and RCFiles (Record Columnar files which store the columns of a table in the manner of a column-based database).


Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters.

Pig enables developers to create query execution routines for analyzing large, distributed data sets without having to do low-level work in MapReduce, much like the way the Apache Hive data warehouse software provides a SQL-like interface for Hadoop that doesn’t require direct MapReduce programming.

The key parts of Pig are a compiler and a scripting language known as Pig Latin. Pig Latin is a data-flow language geared toward parallel processing. Through the use of user-defined functions (UDFs), Pig Latin applications can be extended to include custom processing tasks written in Java as well as languages such as JavaScript and Python.

Pig is intended to handle all kinds of data, including structured and unstructured information and relational and nested data.

HBase –

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). Hadoop is a framework for handling large datasets in a distributed computing environment.

HBase is designed to support high table-update rates and to scale out horizontally in distributed compute clusters. Its focus on scale enables it to support very large database tables — for example, ones containing billions of rows and millions of columns. Currently, one of the most prominent uses of HBase is as a structured data handler for Facebook’s basic messaging infrastructure.

HBase is known for providing strong data consistency on reads and writes, which distinguishes it from other NoSQL databases. Much like Hadoop, an important aspect of the HBase architecture is the use of master nodes to manage region servers that distribute and process parts of data tables.

ZooKeeper –

Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.

The Zookeeper service is provided by a cluster of servers to avoid a single point of failure. Zookeeper uses a distributed consensus protocol to determine which node in the ZooKeeper service is the leader at any given time.

The leader assigns a timestamp to each update to keep order. Once a majority of nodes have acknowledged receipt of a time-stamped update, the leader can declare a quorum, which means that any data contained in the update can be coordinated with elements of the data store. The use of a quorum ensures that the service always returns consistent answers.

Sqoop –

Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

Flume –

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for fail-over and recovery.

In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or decompress data, modify data by adding or removing pieces of information, and more.

Oozie –

Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive – then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

Mahout –

Mahout is an open source machine learning library from Apache. It’s highly scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. At the moment, it primarily implements recommender engines (collaborative filtering), clustering, and classification. Recommender engines try to infer tastes and preferences and identify unknown items that are of interest. Clustering attempts to group a large number of things together that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set. Classification decides how much a thing is or isn’t part of some type or category, or how much it does or doesn’t have some attribute.