Big Data is it really about ‘Volume’?

Whenever we think of Big Data we often tend to relate it with the enormous quantity of data without realizing that the problem is not just about huge volume of data, it’s also about the increasing velocity at which it is being created, as well as by its variety (think of “unstructured” data such as text documents, videos and pictures, for instance).

So, What is Big Data?

The term big data is believed to have originated with Web search companies who had to query very large distributed aggregations of loosely-structured data. Big data is a concept, or a popular term, used to describe a massive volume of structured, semi-structured, unstructured and multi-structured data that is so large that it’s difficult to process the data within a tolerable elapsed time, using traditional database and software techniques.

    • Structured Data – Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data.
    • Semi-Structured Data – Semi-structured data is data that is neither raw data, nor typed data in a conventional database system. It is type structured data, but does not conform to the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. XML, JSON are examples of semi-structured data.
    • Un-Structured Data – Data that does not reside in fixed locations. The term generally refers to free-form text, which is ubiquitous. Examples are word processing documents, PDF files, e-mail messages, blogs and Web pages, Twitter tweets, and other social media posts.
    • Multi-structured data – Refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.

An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible. According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. While the term may seem to reference the volume of data, that isn’t always the case. The term big data is defined with respect to the three V’s: volume (amount of data), velocity (speed of data in and out) and variety (range of data types and sources) and veracity is added by some organizations, and when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires handling the large amounts of data and storage facilities. BigData

    • Volume – Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.
    • Velocity – It’s not just the velocity of the incoming data that’s the issue: it’s streaming of fast-moving data into bulk storage for later batch processing. The importance lies in the speed of the feedback loop, taking data from input through to decision. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.
    • Variety – Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.
    • Veracity – Big Data Veracity refers to the biases, noise and abnormality in data i.e. uncertain and imprecise data.

Why Big Data?

In business, we often run into very similar problems where we need to make decisions based on incomplete information in a rapidly changing context.  The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyse it to find answers that smarter business decision-making. Big data can also be used to uncover hidden patterns, unknown correlations and other useful information by combining big data and high-powered analytics. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. It’s also possible to use clickstream analysis and data mining to detect fraudulent behavior and to determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually. Big Data can be used to develop the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance to avoid failures in new products. To get maximum value out of big data, it needs to be associated with traditional enterprise data, automatically or via purpose-built applications, reports, queries, and other approaches. For instance a retailer wants to link its website visitor behaviour logs (Big Data) with purchase information (found in Relational Database) based on text and image message volume trends (Unstructured big data).

Big Data Helps Serve Customers Better – Recent studies suggest that today’s customers get a major chunk of information about brands from social media websites. In order to survive in the connected economy, businesses need to actively Engage with their customers for a better brand awareness and recall; Analyze customers’ needs based on their preferences and behaviour; and Act on this to create goods and services that are aligned to the customer’s needs – when they want it

    • Engage Social Engagement with customers includes listening and monitoring what the customer says and keeping them engaged with specific content that provides the customers a ring-side view of the brands and products and in the process elicits responses about their needs and preferences.
    • Analyze – Understand your customers and the impact to your business/brand. Monitor context of the conversation against the matrices customer sentiment, Customer Psychograph and Demography, and so on.
    • Act When the social data is correlated with enterprise data, it provides a much potent view of the social impact. This makes it easier for organizations to improve their campaign efficacy or customer engagement. Armed with the knowledge of customer preference along with their behaviour in the store, the business leaders can more effectively tune their messaging, products and services for better impact.

Emerging Technologies for Big Data –

Hadoop – The first thing that comes to mind when we speak about Big Data is Hadoop. Hadoop is not Big Data it’s one of the most popular implementation of MapReduce for processing and handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive – Hive is a data warehouse tool that integrates with HDFS and MapReduce by providing a SQL like query language called Hive Query Language (HQL) and schema layer on top of HDFS files. Hive translates HQL queries into a series of MapReduce jobs that emulate the queries behavior and runs the jobs in cluster. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volume of data.

PIG – It’s an abstraction for Java MapReduce jobs similar to Hive. The language for this platform is called Pig Latin. Pig enables data workers to write complex data transformations without knowing Java using Pig’s simple SQL-like scripting language is called Pig Latin. Pig works with data from many sources, including structured and unstructured data, and stores the results into the Hadoop Data File System

Schema-less databases, or NoSQL databases – There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. MongoDb, Cassandra, HBase are few to name.