HDFS Blocks – Size Does Matter

A block is the smallest unit of data that can be stored or retrieved from the disk.

In general a Filesystem also consists of blocks which is made out of these blocks on the disk. Normally disk blocks are of 512 bytes and those of filesystem are of a few kilobytes.  In case of HDFS we also have the blocks concept. But here one block size is of 64 MB by default and which can be increased in an integral multiple of 64 i.e. 128 MB, 256 MB, 512 MB or even more in GB’s. It all depend on the requirement and use-cases.

So Why are these blocks size so large for HDFS?

The main reason for having the HDFS blocks in large size is to reduce the cost of seek time. In general, the seek time is 10 ms and disk transfer rate is 100 MB/s. To make the seek time 1% of the disk transfer rate, the block size should be 100 MB. The default size HDFS block is 64 MB. 

Advantages Of HDFS Blocks

The benefits with HDFS block are: 

    • The blocks are of fixed size, so it is very easy to calculate the number of blocks that can be stored on a disk.
    • HDFS block concept simplifies the storage of the datanodes. The datanodes doesn’t need to concern about the blocks metadata data like file permissions etc. The namenode maintains the metadata of all the blocks.
    • If the size of the file is less than the HDFS block size, then the file does not occupy the complete block storage.
    • As the file is chunked into blocks, it is easy to store a file that is larger than the disk size as the data blocks are distributed and stored on multiple nodes in a hadoop cluster.

Blocks are easy to replicate between the datanodes and thus provide fault tolerance and high availability. Hadoop framework replicates each block across multiple nodes (default replication factor is 3). In case of any node failure or block corruption, the same block can be read from another node.