HDFS Blocks – Size Does Matter


A block is the smallest unit of data that can be stored or retrieved from the disk.

In general a Filesystem also consists of blocks which is made out of these blocks on the disk. Normally disk blocks are of 512 bytes and those of filesystem are of a few kilobytes.  In case of HDFS we also have the blocks concept. But here one block size is of 64 MB by default and which can be increased in an integral multiple of 64 i.e. 128 MB, 256 MB, 512 MB or even more in GB’s. It all depend on the requirement and use-cases.

So Why are these blocks size so large for HDFS?

The main reason for having the HDFS blocks in large size is to reduce the cost of seek time. In general, the seek time is 10 ms and disk transfer rate is 100 MB/s. To make the seek time 1% of the disk transfer rate, the block size should be 100 MB. The default size HDFS block is 64 MB. 

Advantages Of HDFS Blocks

The benefits with HDFS block are: 

    • The blocks are of fixed size, so it is very easy to calculate the number of blocks that can be stored on a disk.
    • HDFS block concept simplifies the storage of the datanodes. The datanodes doesn’t need to concern about the blocks metadata data like file permissions etc. The namenode maintains the metadata of all the blocks.
    • If the size of the file is less than the HDFS block size, then the file does not occupy the complete block storage.
    • As the file is chunked into blocks, it is easy to store a file that is larger than the disk size as the data blocks are distributed and stored on multiple nodes in a hadoop cluster.

Blocks are easy to replicate between the datanodes and thus provide fault tolerance and high availability. Hadoop framework replicates each block across multiple nodes (default replication factor is 3). In case of any node failure or block corruption, the same block can be read from another node.

Big Data HDFS Way


Storage was not the biggest concern for the Big Data, the real concern was the rate at which data is getting accessed from the system which is usually used to measure performance of the system and it is determined by the number of I/O ports. As the number of queries to access data increase, the current file system I/O becomes inadequate to retrieve large amounts of data simultaneously. Further, the model of one large single storage starts becoming a bottleneck.

One Machine with four io port

Consider a machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is :

(1024 * 1024)/(100 * 4 * 60 ) = 43.69 mins, approximately 45 mins

Solution – To overcome the problems, a distributed file system was conceived that provided solution to the above problems. The solution tackled the problem as

1. When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines.

2. With the advent of block storage, the data access becomes distributed and enables us to combine the power of multiple machines into a single virtual machine, so that you are not limited to the capacity of a single unit.

3. When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to produce the whole file.

4. As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine.

5. Since it employs scale-out i.e distributed architecture  its easier to keep pace with data growth problems as well as increasing access demands by adding more nodes to the cluster. In this way, performance and I/O bandwidth all scale linearly as more capacity is added to the storage system.

ten machine with four io port

 

 

Consider 10 machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is : (1024 * 1024)/(100 * 4 * 60 * 10 ) = 4.369 mins, approximately 4.5 mins

 

Handling Failures and Re-Replicating Missing Replicas


Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake, using the same port number defined for the Name Node daemon, usually TCP 9000.  Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.  The block reports allow the Name Node build its metadata and insure (3) copies of the block exist on different nodes, in different racks.

NameNode detects the absence of a Heartbeat message from a DataNode and marks such DataNodes as dead and does not forward any new IO requests to them presuming it to be dead and any data it had to be gone as well.  Based on the block reports it had been receiving from the dead node, the Name Node knows which copies of blocks died along with the node and can make the decision to re-replicate those blocks to other Data Nodes.  It will also consult the Rack Awareness data in order to maintain the two copies in one rack, one copy in another rack replica rule when deciding which Data Node should receive a new copy of the blocks.

Re-replication may be needed for various reasons: a DataNode becoming unavailable, replica getting corrupted, a disk failure on a DataNode, or replication factor of a file getting increased.

replicating missing replicas

A block of data fetched from a DataNode may be corrupted because of faults in a storage device, network faults or buggy software. HDFS can detect this using the checksum checking on the contents of the file. When a client creates an HDFS file, it computes a checksum of each block and stores these in a separate hidden file in the same namespace. When a block is fetched, the client verifies that the data it receives matches the checksum already stored. If it is doesn’t match, the data is assumed to be corrupted. The client can opt to retrieve that block from another replica.

HDFS Data Replication Stratergy


Replication is nothing but keeping same blocks of data on different nodes.

HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The block size and replication factor are configurable per file.

hdfs data replicationThe NameNode periodically receives the following from the DataNodes:

1. a Heartbeat implying that the DataNode is functioning properly and

2. a Block report containing the list of all blocks on a DataNode

HDFS uses a rack-aware replica placement policy to improve data reliability, availability and network bandwidth utilization.

HDFS’s default policy is to put one replica in the local rack, another on a node in a remote rack and the last one on a different node in the same remote rack. So instead of three racks, it now involves only two racks cutting the inter-rack write traffic. As the chance of rack failure is far less than the node failure, this does not impact data reliability and availability guarantees.

The typical placement policy can be stated as “One third of replicas are on one node, one third of replicas are on one rack, and the other third are evenly distributed across the remaining racks”. This policy improves write performance without compromising data reliability or read performance.

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader (in the order of same node, same rack, local data center and remote replica).

HDFS Architecture


HDFS employs master-slave architecture and consists of :

1. Single NameNode which is master and

2. Multiple DataNode which acts as slaves.

hdfs architecture

HDFS High Level Architecture

In the above diagram, there is one Name Node and multiple Data Nodes (servers) with data blocks.

When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in Hadoop Clusters. HDFS creates several replications of the data blocks and distribute them accordingly in the cluster is a way that will be reliable and retrieved faster.

Hadoop will internally make sure that any node failure will never result in data loss.

There will be only one machine that manages the file system meta-data.

There will be multiple data nodes (These are the real cheap commodity servers that will store data blocks).

When we execute a query from a client, it will reach out to Name Node to get the file system meta-data information, and then it will reach out to the Data Node to get the real data blocks.

Name Node –

HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes make up the complete file.

The complete collection of all the files in the cluster is referred to as the file system namespace. The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. So it contains the information of all the files, directories and their hierarchy in the cluster. Along with the filesystem information it also knows about the datanode on which  all the blocks of a file is kept.

It is the NameNode’s job to oversee the health of Data Nodes and to coordinate access to data. The Name Node is the central controller of HDFS.

A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. The client presents a filesystem interface similar to a Portable Operating System Interface (POSIX), so the user code does not need to know about the namenode and datanode to function.

NameNode

The above diagram shows how Name Node stores information in disks. Two different files are –

1. fs image – It’s the snapshot the file system when Name Node is started.

2. Edit Logs – It’s the sequence of changes made to the file system after the Name Node is started.

Only in the restart of Name Node, edit logs are applied to fs image to get the latest snapshot of the file system, but Name Node restart are very rare in production clusters which means edit logs can grow very large for the cluster where Name Node runs for a long period of time leading to below mentioned issues –

1. Name Node restart takes long time because of lot of changes has to be merged.

2. In the case of crash we will lose huge amount of metadata since fs image is very old. This is also the reason that’s why Hadoop is known as a Single Point of failure.

Secondary namenode –

helps to overcome the above mentioned issue.

Secondary NameNode

Well the Secondary namenode also contains a namespace image and edit logs like namenode. Now after every certain interval of time(which is one hour by default)  it copies the namespace image from namenode and merge this namespace image with the edit log and copy it back to the namenode so that namenode will have the fresh copy of namespace image. Now lets suppose at any instance of time the namenode goes down and becomes corrupt then we can restart  some other machine with the namespace image and the edit log that’s what we have with the secondary namenode and hence can be prevented from a total failure.

Secondary Name node takes almost the same amount of memory and CPU for its working as the Namenode. So it is also kept in a separate machine like that of a namenode.

Data Nodes –

These are the workers that does the real work. And here by real work we mean that the storage of actual data is done by the data node. They store and retrieve blocks when they are told to (by clients or the namenode).

Data nodes are not smart, but they are resilient. Within the HDFS cluster, data blocks are replicated across multiple data nodes and access is managed by the NameNode. The replication mechanism is designed for optimal efficiency when all the nodes of the cluster are collected into a rack. In fact, the NameNode uses a “rack ID” to keep track of the data nodes in the cluster.

Data nodes also provide “heartbeat” messages to detect and ensure connectivity between the NameNode and the data nodes. When a heartbeat is no longer present, the NameNode unmaps the data node from the cluster and keeps on operating as though nothing happened. When the heartbeat returns, it is added to the cluster transparently with respect to the user or application.

Data integrity is a key feature. HDFS supports a number of capabilities designed to provide data integrity. As you might expect, when files are broken into blocks and then distributed across different servers in the cluster, any variation in the operation of any element could affect data integrity. HDFS uses transaction logs and checksum validation to ensure integrity across the cluster.

Transaction logs keep track of every operation and are effective in auditing or rebuilding of the file system should something untoward occur.

Checksum validations are used to guarantee the contents of files in HDFS. When a client requests a file, it can verify the contents by examining its checksum. If the checksum matches, the file operation can continue. If not, an error is reported. Checksum files are hidden to help avoid tampering.

Hadoop Distributed File System (HDFS) Concepts and Design Fundamentals


Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is a filesystem developed specifically for storing very large files with streaming data access patterns, running on clusters of commodity hardware and is highly fault-tolerant. HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100 PB and beyond.

HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis.

hdfs

Hadoop Distributed File System (HDFS) Concepts – 

1. Cluster : A hadoop cluster is made by having many machines in a network, each machine is termed as a node, and these nodes talks to each other over the network.

computer cluster2. Name Node : Name Node holds all the file system metadata for the cluster and oversees the health of Data Nodes and coordinates access to data.  The Name Node is the central controller of HDFS.

3. Secondary Name Node :  Since Name Node is the single point of failure, secondary NameNode constantly reads the data from the RAM of the NameNode and writes it into the hard disk or the file system. It is not a substitute to the NameNode, so if the NameNode fails, the entire Hadoop system goes down.

4. Data Node : These are the workers that does the real work of storing data as and when told by the Name Node.

5. Block Size: This is the minimum amount of size of one block in a filesystem, in which data can be kept contiguously. The default size of a single block in HDFS is 64 Mb.

Fundamentals Behind HDFS Design –

1. Very Large files: Now when we say very large files we mean here that the size of the file will be in a range of gigabyte, terabyte, petabyte or may be more.

2. Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. This makes HDFS is an excellent choice for supporting big data analysis.

3. Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors). This leads to an overall cost reduction to a great extent.

4. Data Replication and Fault Tolerance: HDFS works on the assumption that hardware is bound to fail at some point of time or the other. This disrupts the smooth and quick processing of large volumes of data. To overcome this obstacle, in HDFS, the files are divided into large blocks of data and each block is stored on three nodes: two on the same rack and one on a different rack for fault tolerance.

5. High Throughput : In HDFS, the task is divided and shared among different systems. So, all the systems will be executing the tasks assigned to them independently and in parallel achieving high throughput.

6. Moving Computation is Better than Moving Data: HDFS works on the principle that if a computation is done by an application near the data it operates on, it is much more efficient than done far of, particularly when there are large data sets. The major advantage is reduction in the network congestion and increased overall throughput of the system.

7. Scale Out Architecture : HDFS employs scale out architecture as compared to scale up architecture employed by RDBMS. To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application. Scaling out, also known as distributed programming, does something similar by distributing jobs across machines over the network.

Hadoop Ecosystem


The components of Hadoop Ecosystem are –

Hadoop Stack

Hadoop Distributed File System (HDFS) – 

Filesystems that manage the storage across a network of machines are called distributed filesystems.

Hadoop distributed file system (HDFS) is a java based file system that provides scalable and reliable data storage and provides high performance access to data across hadoop cluster.

HDFS is designed for storing very large files with write-once-ready-many-times patterns, running on clusters of commodity hardware. HDFS is not a good fit for low-latency data access, when there are lots of small files and for modifications at arbitrary offsets in the file.

Hadoop employs master slave architecture, with each cluster consisting of a single NameNode that is the master and the multiple DataNodes are the slaves.

The NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The NameNode also knows the DataNodes on which all the blocks for a given file are located. DataNodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are storing.

Files in HDFS are broken into block-sized chunks, default size being 64MB, which are stored as independent units, and each block of file is independently replicated at multiple data node.

HDFS

MapReduce –

MapReduce is a simple programming model for processing massive amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. It is the tool that actually gets the data processed.

MapReduce works breaking the processing into two phases –

    1. Map – The purpose of the map phase is to organize the data in preparation for the processing done in the reduce phase.
    2. Reduce – Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output.

Job Tracker acts as Master and Task Tracker acts as Slave.

A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with more reliability.

The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each Reduce.

The Reduce function then collects the various results and combines them to answer the larger problem the master node was trying to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys it is responsible for and combine them to solve the problem.

Mapreduce

 Hive –

Apache Hive is an open-source data warehouse system for querying and analysis of large, stored in Hadoop files data sets. Hive has three main functions: data summary, queries and analyzers.

It provides an SQL-like language called HiveQL while maintaining full support for map/reduce. In short, a Hive query is converted to MapReduce tasks. The main building blocks of Hive are – Metastore stores the system catalog and metadata about tables, columns, partitions, etc. Driver manages the lifecycle of a HiveQL statement as it moves through Hive Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks Execution Engine executes the tasks produced by the compiler in proper dependency order Hive Server provides a Thrift interface and a JDBC / ODBC server.

In addition HiveQL supported in queries embedding individual MapReduce scripts. Hive also allows serialization / deserialization of data.

Hive supports text files (including flat files called), SequenceFiles (flat- Files consisting of binary key / value pairs) and RCFiles (Record Columnar files which store the columns of a table in the manner of a column-based database).

PIG –

Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters.

Pig enables developers to create query execution routines for analyzing large, distributed data sets without having to do low-level work in MapReduce, much like the way the Apache Hive data warehouse software provides a SQL-like interface for Hadoop that doesn’t require direct MapReduce programming.

The key parts of Pig are a compiler and a scripting language known as Pig Latin. Pig Latin is a data-flow language geared toward parallel processing. Through the use of user-defined functions (UDFs), Pig Latin applications can be extended to include custom processing tasks written in Java as well as languages such as JavaScript and Python.

Pig is intended to handle all kinds of data, including structured and unstructured information and relational and nested data.

HBase –

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). Hadoop is a framework for handling large datasets in a distributed computing environment.

HBase is designed to support high table-update rates and to scale out horizontally in distributed compute clusters. Its focus on scale enables it to support very large database tables — for example, ones containing billions of rows and millions of columns. Currently, one of the most prominent uses of HBase is as a structured data handler for Facebook’s basic messaging infrastructure.

HBase is known for providing strong data consistency on reads and writes, which distinguishes it from other NoSQL databases. Much like Hadoop, an important aspect of the HBase architecture is the use of master nodes to manage region servers that distribute and process parts of data tables.

ZooKeeper –

Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.

The Zookeeper service is provided by a cluster of servers to avoid a single point of failure. Zookeeper uses a distributed consensus protocol to determine which node in the ZooKeeper service is the leader at any given time.

The leader assigns a timestamp to each update to keep order. Once a majority of nodes have acknowledged receipt of a time-stamped update, the leader can declare a quorum, which means that any data contained in the update can be coordinated with elements of the data store. The use of a quorum ensures that the service always returns consistent answers.

Sqoop –

Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

Flume –

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for fail-over and recovery.

In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or decompress data, modify data by adding or removing pieces of information, and more.

Oozie –

Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive – then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

Mahout –

Mahout is an open source machine learning library from Apache. It’s highly scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. At the moment, it primarily implements recommender engines (collaborative filtering), clustering, and classification. Recommender engines try to infer tastes and preferences and identify unknown items that are of interest. Clustering attempts to group a large number of things together that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set. Classification decides how much a thing is or isn’t part of some type or category, or how much it does or doesn’t have some attribute.

Hadoop – An Introduction


Hadoop is the most well-known technology used for exploiting the big data, to enable businesses to gain insight from massive amounts of structured and unstructured data quickly.

In a nutshell Hadoop is one way of storing an enormous amount of data in an enormous cluster of machines and running distributed analysis on each cluster or we can say Hadoop is a software framework that supports data intensive distributed applications. It’s both distributed computing and parallel computing. 

Cluster – Connecting two or more computers together in such a way that they behave like a single computer. Clustering is used for parallel processing, load balancing and fault tolerance.
Distributed Computing – Multiple independent computations on multiple independent machines, where each client works on a separate chunk of information and returns the completed package to a centralized resource that’s responsible for managing the overall workload.
Parallel Computing – multiple CPUs or computers working simultaneously at the same problem requires constant sharing of small snippet of data between the CPUs.
In Parallel Computing all processors have access to a shared memory. In distributed computing, each processor has its own private memory.

In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to retrieve the data. In Hadoop distributed architecture, both data and processing are distributed across multiple servers. In simple terms, instead of running a query on a single server, the query is split across multiple servers i.e. when you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.

Hadoop employs Scale-Out (or horizontally) architecture  means adding new components (or building blocks) to a system i.e. adding a new Hadoop DataNode to a Hadoop cluster, as opposed to Scale-Up or vertically) used by traditional RDBMS which means adding more resources to an existing component of a system such as adding more CPU, adding more storage, etc.

Hadoop is batch processing centric ideal for the discovery, exploration and analysis of large amounts of multi-structured data that doesn’t fit nicely into table, and not suitable for real-time operations.

Apache DefinitionThe Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool.

Hadoop Architecture

Machine Roles“Hadoop employs master/slave architecture for both distributed storage and distributed computation”. In the distributed storage, the NameNode is the master and the DataNodes are the slaves. In the distributed computation, the Job Tracker is the master and the Task Trackers are the slaves.

Working of Hadoop Cluster – 

3 major categories of machine roles in a Hadoop Deployment are,

    1. Client Machine
    2. Master Node
    3. Slave Node

Machine Roles

Client Machine – has hadoop installed will all the cluster settings, but is neither a Master nor Slave. The role of client machine is to load data into cluster, submit Map Reduce jobs describing how that data should be processed and then retrieve or view all the results of the job when it’s finished.

Master Node – encapsulates 2 key functional pieces that make up hadoop

    1. Storing lots of data (HDFS)
    2. Running parallel computations on all that data (MapReduce).

The Name Node is responsible for coordinating the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using the MapReduce.

Slave Nodes – make up the vast majority of machines and do all the dirty work of storing the data and running the computations. Each slave runs both a Data Node and Task Tracker daemon that communicates and receives instructions from their master node.

Task Tracker daemon is slave to the Job Tracker, Data node daemon slave to Name Node.

Hadoop Ecosystem –

Hadoop is built on two main parts

    1. Distributed Storage – A special file system called Hadoop Distributed File System (HDFS)
    2. Distributed Processing – MapReduce framework

ecosystem

The other components of Hadoop Ecosystem are –

    1. Data access and query (Apache Hive)
    2. Scripting platform (Apache Pig)
    3. Column oriented database scaling to billions of rows (Apache HBase)
    4. Metadata services (Apache HCatalog)
    5. Workflow scheduler (Apache Oozie)
    6. Cluster coordination (Apache Zookeeper)
    7. Import data from relational databases (Apache Sqoop)
    8. Collection and import of log and event data (Apache Flume)
    9. Library of machine learning and data mining algorithms (Mahout)
Please refer the article Hadoop Ecosystem for more details.