HDFS Data Replication Stratergy

Replication is nothing but keeping same blocks of data on different nodes.

HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The block size and replication factor are configurable per file.

hdfs data replicationThe NameNode periodically receives the following from the DataNodes:

1. a Heartbeat implying that the DataNode is functioning properly and

2. a Block report containing the list of all blocks on a DataNode

HDFS uses a rack-aware replica placement policy to improve data reliability, availability and network bandwidth utilization.

HDFS’s default policy is to put one replica in the local rack, another on a node in a remote rack and the last one on a different node in the same remote rack. So instead of three racks, it now involves only two racks cutting the inter-rack write traffic. As the chance of rack failure is far less than the node failure, this does not impact data reliability and availability guarantees.

The typical placement policy can be stated as “One third of replicas are on one node, one third of replicas are on one rack, and the other third are evenly distributed across the remaining racks”. This policy improves write performance without compromising data reliability or read performance.

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader (in the order of same node, same rack, local data center and remote replica).