Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is a filesystem developed specifically for storing very large files with streaming data access patterns, running on clusters of commodity hardware and is highly fault-tolerant. HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100 PB and beyond.
HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis.
Hadoop Distributed File System (HDFS) Concepts –
1. Cluster : A hadoop cluster is made by having many machines in a network, each machine is termed as a node, and these nodes talks to each other over the network.
3. Secondary Name Node : Since Name Node is the single point of failure, secondary NameNode constantly reads the data from the RAM of the NameNode and writes it into the hard disk or the file system. It is not a substitute to the NameNode, so if the NameNode fails, the entire Hadoop system goes down.
4. Data Node : These are the workers that does the real work of storing data as and when told by the Name Node.
5. Block Size: This is the minimum amount of size of one block in a filesystem, in which data can be kept contiguously. The default size of a single block in HDFS is 64 Mb.
Fundamentals Behind HDFS Design –
1. Very Large files: Now when we say very large files we mean here that the size of the file will be in a range of gigabyte, terabyte, petabyte or may be more.
2. Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. This makes HDFS is an excellent choice for supporting big data analysis.
3. Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors). This leads to an overall cost reduction to a great extent.
4. Data Replication and Fault Tolerance: HDFS works on the assumption that hardware is bound to fail at some point of time or the other. This disrupts the smooth and quick processing of large volumes of data. To overcome this obstacle, in HDFS, the files are divided into large blocks of data and each block is stored on three nodes: two on the same rack and one on a different rack for fault tolerance.
5. High Throughput : In HDFS, the task is divided and shared among different systems. So, all the systems will be executing the tasks assigned to them independently and in parallel achieving high throughput.
6. Moving Computation is Better than Moving Data: HDFS works on the principle that if a computation is done by an application near the data it operates on, it is much more efficient than done far of, particularly when there are large data sets. The major advantage is reduction in the network congestion and increased overall throughput of the system.
7. Scale Out Architecture : HDFS employs scale out architecture as compared to scale up architecture employed by RDBMS. To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application. Scaling out, also known as distributed programming, does something similar by distributing jobs across machines over the network.