Big Data HDFS Way

Storage was not the biggest concern for the Big Data, the real concern was the rate at which data is getting accessed from the system which is usually used to measure performance of the system and it is determined by the number of I/O ports. As the number of queries to access data increase, the current file system I/O becomes inadequate to retrieve large amounts of data simultaneously. Further, the model of one large single storage starts becoming a bottleneck.

One Machine with four io port

Consider a machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is :

(1024 * 1024)/(100 * 4 * 60 ) = 43.69 mins, approximately 45 mins

Solution – To overcome the problems, a distributed file system was conceived that provided solution to the above problems. The solution tackled the problem as

1. When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines.

2. With the advent of block storage, the data access becomes distributed and enables us to combine the power of multiple machines into a single virtual machine, so that you are not limited to the capacity of a single unit.

3. When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to produce the whole file.

4. As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine.

5. Since it employs scale-out i.e distributed architecture  its easier to keep pace with data growth problems as well as increasing access demands by adding more nodes to the cluster. In this way, performance and I/O bandwidth all scale linearly as more capacity is added to the storage system.

ten machine with four io port



Consider 10 machine with 4 I/O channels, each channel having a speed of 100 MB/s.

Time taken to read 1 TB of data is : (1024 * 1024)/(100 * 4 * 60 * 10 ) = 4.369 mins, approximately 4.5 mins


Hadoop – An Introduction

Hadoop is the most well-known technology used for exploiting the big data, to enable businesses to gain insight from massive amounts of structured and unstructured data quickly.

In a nutshell Hadoop is one way of storing an enormous amount of data in an enormous cluster of machines and running distributed analysis on each cluster or we can say Hadoop is a software framework that supports data intensive distributed applications. It’s both distributed computing and parallel computing. 

Cluster – Connecting two or more computers together in such a way that they behave like a single computer. Clustering is used for parallel processing, load balancing and fault tolerance.
Distributed Computing – Multiple independent computations on multiple independent machines, where each client works on a separate chunk of information and returns the completed package to a centralized resource that’s responsible for managing the overall workload.
Parallel Computing – multiple CPUs or computers working simultaneously at the same problem requires constant sharing of small snippet of data between the CPUs.
In Parallel Computing all processors have access to a shared memory. In distributed computing, each processor has its own private memory.

In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to retrieve the data. In Hadoop distributed architecture, both data and processing are distributed across multiple servers. In simple terms, instead of running a query on a single server, the query is split across multiple servers i.e. when you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.

Hadoop employs Scale-Out (or horizontally) architecture  means adding new components (or building blocks) to a system i.e. adding a new Hadoop DataNode to a Hadoop cluster, as opposed to Scale-Up or vertically) used by traditional RDBMS which means adding more resources to an existing component of a system such as adding more CPU, adding more storage, etc.

Hadoop is batch processing centric ideal for the discovery, exploration and analysis of large amounts of multi-structured data that doesn’t fit nicely into table, and not suitable for real-time operations.

Apache DefinitionThe Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool.

Hadoop Architecture

Machine Roles“Hadoop employs master/slave architecture for both distributed storage and distributed computation”. In the distributed storage, the NameNode is the master and the DataNodes are the slaves. In the distributed computation, the Job Tracker is the master and the Task Trackers are the slaves.

Working of Hadoop Cluster – 

3 major categories of machine roles in a Hadoop Deployment are,

    1. Client Machine
    2. Master Node
    3. Slave Node

Machine Roles

Client Machine – has hadoop installed will all the cluster settings, but is neither a Master nor Slave. The role of client machine is to load data into cluster, submit Map Reduce jobs describing how that data should be processed and then retrieve or view all the results of the job when it’s finished.

Master Node – encapsulates 2 key functional pieces that make up hadoop

    1. Storing lots of data (HDFS)
    2. Running parallel computations on all that data (MapReduce).

The Name Node is responsible for coordinating the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using the MapReduce.

Slave Nodes – make up the vast majority of machines and do all the dirty work of storing the data and running the computations. Each slave runs both a Data Node and Task Tracker daemon that communicates and receives instructions from their master node.

Task Tracker daemon is slave to the Job Tracker, Data node daemon slave to Name Node.

Hadoop Ecosystem –

Hadoop is built on two main parts

    1. Distributed Storage – A special file system called Hadoop Distributed File System (HDFS)
    2. Distributed Processing – MapReduce framework


The other components of Hadoop Ecosystem are –

    1. Data access and query (Apache Hive)
    2. Scripting platform (Apache Pig)
    3. Column oriented database scaling to billions of rows (Apache HBase)
    4. Metadata services (Apache HCatalog)
    5. Workflow scheduler (Apache Oozie)
    6. Cluster coordination (Apache Zookeeper)
    7. Import data from relational databases (Apache Sqoop)
    8. Collection and import of log and event data (Apache Flume)
    9. Library of machine learning and data mining algorithms (Mahout)
Please refer the article Hadoop Ecosystem for more details.

Big Data is it really about ‘Volume’?

Whenever we think of Big Data we often tend to relate it with the enormous quantity of data without realizing that the problem is not just about huge volume of data, it’s also about the increasing velocity at which it is being created, as well as by its variety (think of “unstructured” data such as text documents, videos and pictures, for instance).

So, What is Big Data?

The term big data is believed to have originated with Web search companies who had to query very large distributed aggregations of loosely-structured data. Big data is a concept, or a popular term, used to describe a massive volume of structured, semi-structured, unstructured and multi-structured data that is so large that it’s difficult to process the data within a tolerable elapsed time, using traditional database and software techniques.

    • Structured Data – Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data.
    • Semi-Structured Data – Semi-structured data is data that is neither raw data, nor typed data in a conventional database system. It is type structured data, but does not conform to the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. XML, JSON are examples of semi-structured data.
    • Un-Structured Data – Data that does not reside in fixed locations. The term generally refers to free-form text, which is ubiquitous. Examples are word processing documents, PDF files, e-mail messages, blogs and Web pages, Twitter tweets, and other social media posts.
    • Multi-structured data – Refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.

An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible. According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. While the term may seem to reference the volume of data, that isn’t always the case. The term big data is defined with respect to the three V’s: volume (amount of data), velocity (speed of data in and out) and variety (range of data types and sources) and veracity is added by some organizations, and when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires handling the large amounts of data and storage facilities. BigData

    • Volume – Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.
    • Velocity – It’s not just the velocity of the incoming data that’s the issue: it’s streaming of fast-moving data into bulk storage for later batch processing. The importance lies in the speed of the feedback loop, taking data from input through to decision. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.
    • Variety – Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.
    • Veracity – Big Data Veracity refers to the biases, noise and abnormality in data i.e. uncertain and imprecise data.

Why Big Data?

In business, we often run into very similar problems where we need to make decisions based on incomplete information in a rapidly changing context.  The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyse it to find answers that smarter business decision-making. Big data can also be used to uncover hidden patterns, unknown correlations and other useful information by combining big data and high-powered analytics. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. It’s also possible to use clickstream analysis and data mining to detect fraudulent behavior and to determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually. Big Data can be used to develop the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance to avoid failures in new products. To get maximum value out of big data, it needs to be associated with traditional enterprise data, automatically or via purpose-built applications, reports, queries, and other approaches. For instance a retailer wants to link its website visitor behaviour logs (Big Data) with purchase information (found in Relational Database) based on text and image message volume trends (Unstructured big data).

Big Data Helps Serve Customers Better – Recent studies suggest that today’s customers get a major chunk of information about brands from social media websites. In order to survive in the connected economy, businesses need to actively Engage with their customers for a better brand awareness and recall; Analyze customers’ needs based on their preferences and behaviour; and Act on this to create goods and services that are aligned to the customer’s needs – when they want it

    • Engage Social Engagement with customers includes listening and monitoring what the customer says and keeping them engaged with specific content that provides the customers a ring-side view of the brands and products and in the process elicits responses about their needs and preferences.
    • Analyze – Understand your customers and the impact to your business/brand. Monitor context of the conversation against the matrices customer sentiment, Customer Psychograph and Demography, and so on.
    • Act When the social data is correlated with enterprise data, it provides a much potent view of the social impact. This makes it easier for organizations to improve their campaign efficacy or customer engagement. Armed with the knowledge of customer preference along with their behaviour in the store, the business leaders can more effectively tune their messaging, products and services for better impact.

Emerging Technologies for Big Data –

Hadoop – The first thing that comes to mind when we speak about Big Data is Hadoop. Hadoop is not Big Data it’s one of the most popular implementation of MapReduce for processing and handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive – Hive is a data warehouse tool that integrates with HDFS and MapReduce by providing a SQL like query language called Hive Query Language (HQL) and schema layer on top of HDFS files. Hive translates HQL queries into a series of MapReduce jobs that emulate the queries behavior and runs the jobs in cluster. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volume of data.

PIG – It’s an abstraction for Java MapReduce jobs similar to Hive. The language for this platform is called Pig Latin. Pig enables data workers to write complex data transformations without knowing Java using Pig’s simple SQL-like scripting language is called Pig Latin. Pig works with data from many sources, including structured and unstructured data, and stores the results into the Hadoop Data File System

Schema-less databases, or NoSQL databases – There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. MongoDb, Cassandra, HBase are few to name.