HDFS Blocks – Size Does Matter

A block is the smallest unit of data that can be stored or retrieved from the disk.

In general a Filesystem also consists of blocks which is made out of these blocks on the disk. Normally disk blocks are of 512 bytes and those of filesystem are of a few kilobytes.  In case of HDFS we also have the blocks concept. But here one block size is of 64 MB by default and which can be increased in an integral multiple of 64 i.e. 128 MB, 256 MB, 512 MB or even more in GB’s. It all depend on the requirement and use-cases.

So Why are these blocks size so large for HDFS?

The main reason for having the HDFS blocks in large size is to reduce the cost of seek time. In general, the seek time is 10 ms and disk transfer rate is 100 MB/s. To make the seek time 1% of the disk transfer rate, the block size should be 100 MB. The default size HDFS block is 64 MB. 

Advantages Of HDFS Blocks

The benefits with HDFS block are: 

    • The blocks are of fixed size, so it is very easy to calculate the number of blocks that can be stored on a disk.
    • HDFS block concept simplifies the storage of the datanodes. The datanodes doesn’t need to concern about the blocks metadata data like file permissions etc. The namenode maintains the metadata of all the blocks.
    • If the size of the file is less than the HDFS block size, then the file does not occupy the complete block storage.
    • As the file is chunked into blocks, it is easy to store a file that is larger than the disk size as the data blocks are distributed and stored on multiple nodes in a hadoop cluster.

Blocks are easy to replicate between the datanodes and thus provide fault tolerance and high availability. Hadoop framework replicates each block across multiple nodes (default replication factor is 3). In case of any node failure or block corruption, the same block can be read from another node.


Big Data HDFS Way

Storage was not the biggest concern for the Big Data, the real concern was the rate at which data is getting accessed from the system which is usually used to measure performance of the system and it is determined by the number of I/O ports. As the number of queries to access data increase, the current file system I/O becomes inadequate to retrieve large amounts of data simultaneously. Further, the model of one large single storage starts becoming a bottleneck.

One Machine with four io port

Consider a machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is :

(1024 * 1024)/(100 * 4 * 60 ) = 43.69 mins, approximately 45 mins

Solution – To overcome the problems, a distributed file system was conceived that provided solution to the above problems. The solution tackled the problem as

1. When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines.

2. With the advent of block storage, the data access becomes distributed and enables us to combine the power of multiple machines into a single virtual machine, so that you are not limited to the capacity of a single unit.

3. When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to produce the whole file.

4. As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine.

5. Since it employs scale-out i.e distributed architecture  its easier to keep pace with data growth problems as well as increasing access demands by adding more nodes to the cluster. In this way, performance and I/O bandwidth all scale linearly as more capacity is added to the storage system.

ten machine with four io port



Consider 10 machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is : (1024 * 1024)/(100 * 4 * 60 * 10 ) = 4.369 mins, approximately 4.5 mins


Handling Failures and Re-Replicating Missing Replicas

Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake, using the same port number defined for the Name Node daemon, usually TCP 9000.  Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.  The block reports allow the Name Node build its metadata and insure (3) copies of the block exist on different nodes, in different racks.

NameNode detects the absence of a Heartbeat message from a DataNode and marks such DataNodes as dead and does not forward any new IO requests to them presuming it to be dead and any data it had to be gone as well.  Based on the block reports it had been receiving from the dead node, the Name Node knows which copies of blocks died along with the node and can make the decision to re-replicate those blocks to other Data Nodes.  It will also consult the Rack Awareness data in order to maintain the two copies in one rack, one copy in another rack replica rule when deciding which Data Node should receive a new copy of the blocks.

Re-replication may be needed for various reasons: a DataNode becoming unavailable, replica getting corrupted, a disk failure on a DataNode, or replication factor of a file getting increased.

replicating missing replicas

A block of data fetched from a DataNode may be corrupted because of faults in a storage device, network faults or buggy software. HDFS can detect this using the checksum checking on the contents of the file. When a client creates an HDFS file, it computes a checksum of each block and stores these in a separate hidden file in the same namespace. When a block is fetched, the client verifies that the data it receives matches the checksum already stored. If it is doesn’t match, the data is assumed to be corrupted. The client can opt to retrieve that block from another replica.

HDFS Data Replication Stratergy

Replication is nothing but keeping same blocks of data on different nodes.

HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The block size and replication factor are configurable per file.

hdfs data replicationThe NameNode periodically receives the following from the DataNodes:

1. a Heartbeat implying that the DataNode is functioning properly and

2. a Block report containing the list of all blocks on a DataNode

HDFS uses a rack-aware replica placement policy to improve data reliability, availability and network bandwidth utilization.

HDFS’s default policy is to put one replica in the local rack, another on a node in a remote rack and the last one on a different node in the same remote rack. So instead of three racks, it now involves only two racks cutting the inter-rack write traffic. As the chance of rack failure is far less than the node failure, this does not impact data reliability and availability guarantees.

The typical placement policy can be stated as “One third of replicas are on one node, one third of replicas are on one rack, and the other third are evenly distributed across the remaining racks”. This policy improves write performance without compromising data reliability or read performance.

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader (in the order of same node, same rack, local data center and remote replica).

HDFS Architecture

HDFS employs master-slave architecture and consists of :

1. Single NameNode which is master and

2. Multiple DataNode which acts as slaves.

hdfs architecture

HDFS High Level Architecture

In the above diagram, there is one Name Node and multiple Data Nodes (servers) with data blocks.

When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in Hadoop Clusters. HDFS creates several replications of the data blocks and distribute them accordingly in the cluster is a way that will be reliable and retrieved faster.

Hadoop will internally make sure that any node failure will never result in data loss.

There will be only one machine that manages the file system meta-data.

There will be multiple data nodes (These are the real cheap commodity servers that will store data blocks).

When we execute a query from a client, it will reach out to Name Node to get the file system meta-data information, and then it will reach out to the Data Node to get the real data blocks.

Name Node –

HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes make up the complete file.

The complete collection of all the files in the cluster is referred to as the file system namespace. The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. So it contains the information of all the files, directories and their hierarchy in the cluster. Along with the filesystem information it also knows about the datanode on which  all the blocks of a file is kept.

It is the NameNode’s job to oversee the health of Data Nodes and to coordinate access to data. The Name Node is the central controller of HDFS.

A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. The client presents a filesystem interface similar to a Portable Operating System Interface (POSIX), so the user code does not need to know about the namenode and datanode to function.


The above diagram shows how Name Node stores information in disks. Two different files are –

1. fs image – It’s the snapshot the file system when Name Node is started.

2. Edit Logs – It’s the sequence of changes made to the file system after the Name Node is started.

Only in the restart of Name Node, edit logs are applied to fs image to get the latest snapshot of the file system, but Name Node restart are very rare in production clusters which means edit logs can grow very large for the cluster where Name Node runs for a long period of time leading to below mentioned issues –

1. Name Node restart takes long time because of lot of changes has to be merged.

2. In the case of crash we will lose huge amount of metadata since fs image is very old. This is also the reason that’s why Hadoop is known as a Single Point of failure.

Secondary namenode –

helps to overcome the above mentioned issue.

Secondary NameNode

Well the Secondary namenode also contains a namespace image and edit logs like namenode. Now after every certain interval of time(which is one hour by default)  it copies the namespace image from namenode and merge this namespace image with the edit log and copy it back to the namenode so that namenode will have the fresh copy of namespace image. Now lets suppose at any instance of time the namenode goes down and becomes corrupt then we can restart  some other machine with the namespace image and the edit log that’s what we have with the secondary namenode and hence can be prevented from a total failure.

Secondary Name node takes almost the same amount of memory and CPU for its working as the Namenode. So it is also kept in a separate machine like that of a namenode.

Data Nodes –

These are the workers that does the real work. And here by real work we mean that the storage of actual data is done by the data node. They store and retrieve blocks when they are told to (by clients or the namenode).

Data nodes are not smart, but they are resilient. Within the HDFS cluster, data blocks are replicated across multiple data nodes and access is managed by the NameNode. The replication mechanism is designed for optimal efficiency when all the nodes of the cluster are collected into a rack. In fact, the NameNode uses a “rack ID” to keep track of the data nodes in the cluster.

Data nodes also provide “heartbeat” messages to detect and ensure connectivity between the NameNode and the data nodes. When a heartbeat is no longer present, the NameNode unmaps the data node from the cluster and keeps on operating as though nothing happened. When the heartbeat returns, it is added to the cluster transparently with respect to the user or application.

Data integrity is a key feature. HDFS supports a number of capabilities designed to provide data integrity. As you might expect, when files are broken into blocks and then distributed across different servers in the cluster, any variation in the operation of any element could affect data integrity. HDFS uses transaction logs and checksum validation to ensure integrity across the cluster.

Transaction logs keep track of every operation and are effective in auditing or rebuilding of the file system should something untoward occur.

Checksum validations are used to guarantee the contents of files in HDFS. When a client requests a file, it can verify the contents by examining its checksum. If the checksum matches, the file operation can continue. If not, an error is reported. Checksum files are hidden to help avoid tampering.

Hadoop Distributed File System (HDFS) Concepts and Design Fundamentals

Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is a filesystem developed specifically for storing very large files with streaming data access patterns, running on clusters of commodity hardware and is highly fault-tolerant. HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100 PB and beyond.

HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis.


Hadoop Distributed File System (HDFS) Concepts – 

1. Cluster : A hadoop cluster is made by having many machines in a network, each machine is termed as a node, and these nodes talks to each other over the network.

computer cluster2. Name Node : Name Node holds all the file system metadata for the cluster and oversees the health of Data Nodes and coordinates access to data.  The Name Node is the central controller of HDFS.

3. Secondary Name Node :  Since Name Node is the single point of failure, secondary NameNode constantly reads the data from the RAM of the NameNode and writes it into the hard disk or the file system. It is not a substitute to the NameNode, so if the NameNode fails, the entire Hadoop system goes down.

4. Data Node : These are the workers that does the real work of storing data as and when told by the Name Node.

5. Block Size: This is the minimum amount of size of one block in a filesystem, in which data can be kept contiguously. The default size of a single block in HDFS is 64 Mb.

Fundamentals Behind HDFS Design –

1. Very Large files: Now when we say very large files we mean here that the size of the file will be in a range of gigabyte, terabyte, petabyte or may be more.

2. Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. This makes HDFS is an excellent choice for supporting big data analysis.

3. Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors). This leads to an overall cost reduction to a great extent.

4. Data Replication and Fault Tolerance: HDFS works on the assumption that hardware is bound to fail at some point of time or the other. This disrupts the smooth and quick processing of large volumes of data. To overcome this obstacle, in HDFS, the files are divided into large blocks of data and each block is stored on three nodes: two on the same rack and one on a different rack for fault tolerance.

5. High Throughput : In HDFS, the task is divided and shared among different systems. So, all the systems will be executing the tasks assigned to them independently and in parallel achieving high throughput.

6. Moving Computation is Better than Moving Data: HDFS works on the principle that if a computation is done by an application near the data it operates on, it is much more efficient than done far of, particularly when there are large data sets. The major advantage is reduction in the network congestion and increased overall throughput of the system.

7. Scale Out Architecture : HDFS employs scale out architecture as compared to scale up architecture employed by RDBMS. To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application. Scaling out, also known as distributed programming, does something similar by distributing jobs across machines over the network.