Big Data HDFS Way


Storage was not the biggest concern for the Big Data, the real concern was the rate at which data is getting accessed from the system which is usually used to measure performance of the system and it is determined by the number of I/O ports. As the number of queries to access data increase, the current file system I/O becomes inadequate to retrieve large amounts of data simultaneously. Further, the model of one large single storage starts becoming a bottleneck.

One Machine with four io port

Consider a machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is :

(1024 * 1024)/(100 * 4 * 60 ) = 43.69 mins, approximately 45 mins

Solution – To overcome the problems, a distributed file system was conceived that provided solution to the above problems. The solution tackled the problem as

1. When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines.

2. With the advent of block storage, the data access becomes distributed and enables us to combine the power of multiple machines into a single virtual machine, so that you are not limited to the capacity of a single unit.

3. When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to produce the whole file.

4. As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine.

5. Since it employs scale-out i.e distributed architecture  its easier to keep pace with data growth problems as well as increasing access demands by adding more nodes to the cluster. In this way, performance and I/O bandwidth all scale linearly as more capacity is added to the storage system.

ten machine with four io port

 

 

Consider 10 machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is : (1024 * 1024)/(100 * 4 * 60 * 10 ) = 4.369 mins, approximately 4.5 mins

 

Advertisements