What is Big Data Hadoop?
Until recently, the importance of the data was not recognized. A tremendous amount of data was discarded by stores, hospitals, transport, stock exchange, NGOs, government organizations, research organizations, social media sites, and companies due to lack of information on how data can lead to groundbreaking discoveries. It was in 2005 when the term ‘Big Data’ was first coined by Roger Magoulas of O’Reilly media.
The term Big Data translates to its literal meaning; a colossal amount of data that is impossible to handle or to make any meaningful interpretations from using traditional business intelligence tools. That was when Hadoop evolved. In 2006, Hadoop was originally created by Yahoo on top of Google’s Map Reduce.
Benefits of Big Data:
Big Data is large datasets of information which has proved to be of immense value in performing in-depth analysis. A few of the examples are listed below to understand the uses of Big Data.
- Based on past customer purchases, retail stores like Walmart can build recommendation systems and discounts on the items they are likely to purchase.
- The banking sector, with the help of past trends, is able to identify probable defaulters before issuing loans.
- The healthcare industry and insurance companies are relying on Big Data for disease diagnosis despite faulty testing practices.
Challenges of Big Data:
Up to 2003, when a majority of data was not saved, the entire set of data amounted to 5 billion gigabytes. This is equivalent to the amount of data collected every ten minutes in 2013. And yet, there is no stopping the massive amount of data produced every minute today. Naturally, containing such an immense quantity of data is challenging. The biggest challenges associated with Big Data are the following:
- Capturing the data
- Storing the data
- Analyzing the data
Other challenges with big data pose to be sharing, transferring, and presenting it.
What does Big Data include?
Big Data is broadly defined as 6 V’s, namely Volume, Value, Velocity, Variety, Veracity (authenticity), and Variability based on its features. Big Data can be classified into 3 types:
- Structured Data – Relational Data
- Semi-Structured Data – XML Data
- Unstructured Data – Voice, audio, text, logs, time
How does Hadoop help in managing Big Data?
The biggest challenge associated with Big Data is the non-uniform data. The collected data over various platforms could be structured, semi-structured or unstructured like texts, videos, audios, pictures, Facebook posts, etc which the RDBMS (Relational DataBase Management System) cannot handle. This led to the emergence of newer database technologies like Hadoop.
Hadoop is a framework that allows the enormous amount of unconventional data to be stored through a distributed management system and can simultaneously process it. Hadoop encompasses two components – Hadoop Storage (HDFS), where any unstructured data can be dumped among the clusters, and Hadoop YARN, where the resource management is handled.
HDFS
Hadoop Distributed File System (HDFS) is considered to be the most reliable and fault-tolerant database management solution. Though at large HDFS appears to be a single unit where one can dump all kinds of data, it inherently follows a distributed management of data. It pursues the master-slave architecture where the NameNode acts as the master daemon which stores the information about the DataNodes where the actual data is stored.
NameNodes:
- Stores the information about the DataNodes – metadata, size of files, permissions, hierarchy, etc.
- Maintains the heartbeats of the DataNodes and block report.
DataNodes:
- Stores the actual data.
- Sends heartbeat to NameNode in every 3 seconds by default.