Introduction to Big Data and Hadoop

Tarique Akhtar
5 min readDec 28, 2021
Photo by Júnior Ferreira on Unsplash

In this article we will talk about Big Data and why there is so much buzz about this Big Data? So big data is nothing but the data which cannot be processed and stored on a single system. Let’s say, today you have one gigabyte of file. So that one gigabyte of file might not be a big data for you. But later on, let’s say you got one petabyte of data so that you cannot process on a single machine.

So the definition of big data is the data which is always increasing is in different formats and it cannot be processed and stored on a single machine.

Now a days, general people contribute towards Big data. You may ask, How? So the answer is 1. when you upload some videos to YouTube or 2. you put some comments on the Facebook or 3. do anything on your Twitter or 4. you purchase something on the Amazon website, your each and every clicks are recorded or 5. you watch some movies on Netflix and you provide information like likes, dislikes. Based on your actions done on different platform, companies build recommendation systems or promote new products.

Next questions is how big data is processed, how it is different from our local system? So the answer is through Big data cluster.

Big Data Clusters:

  • Multiple machines are connected to each other via network to act as a single machine, and that is called Big Data Cluster. Below image shows an example of a cluster.
Big data cluster (Image by author)
  • These machines are nothing but a commodity hardware that are very cheap hardware that consist of only the CPU+RAM+Disk.
  • These commodity hardware and stacked together on a rack and then finally installed in a physical location called as data center. Below image shows how data center looks like.
Photo by imgix on Unsplash

And this cluster is horizontally scalable, which means if you have more data that is coming to you and then you can keep on adding those machines to your system to process all that incoming data. So this is on premises cluster.

Now a days, we do have cloud cluster also where you don’t have to buy all these commodity machines. The tech giant like Amazon, Microsoft and Google provide cloud infrastructures.

Big Data Pipeline:

Below diagram highlights different stage of Big data pipeline.

Big data pipeline (Image from author)

Big data ingestion: The data can come from various sources like some database or from some server or from networking site or from some FTP server. So there could be multiple sources for you from where you wanted to ingest your data. So that stage is called the big data ingestion

Data Cleaning and processing: Once you have the data, it might not be a cleaned up data, so you need to clean, validate and then process the data.

Data Analysis: After you get cleaned data, you might need to do some statistical analysis, this analysis will help you to understand the pattern of data and it will help you to forecast something for future.

Data Visualization: You also need to present your analysis using some visualization and dashboards to highlight business inside out.

What is Hadoop?

Hadoop is big data ecosystem. I have just divided the Hadoop definition into different points so it is easy to remember and you can form your own definition when somebody asks you what exactly Hadoop is.

  1. Open Source Component
  2. Part of Apache foundation software group
  3. Java based programming framework
  4. Data storage
  5. Processing of large dataset
  6. Distributed environment
  7. Commodity hardware

HDFS (Hadoop Distributed File System):

When data becomes large enough to accommodate on a single machine, it becomes necessary to break that data and distribute on multiple machine. This kind of big data give birth to distributed file system.

Hadoop comes with HDFS that is Hadoop distributed file system. It is the primary data storage in Hadoop. It is distributed that means when Hadoop takes a file, it breaks the file into smaller blocks of 128 MB and then distribute those blocks on different multiple machine on the cluster. Moreover, it also replicates those blocks to provide the false tolerance in case of failures.

HDFS Block example (Image by author)

Let’s take an example. Suppose we have a I GB of file i.e 1024 MB and consider the HDFS block sizes of 128MB. When we send this 1GB of of file to HDFS then, it will get divided into 8 blocks of 128MB size as shown in above image. Moreover, if the replication factor is 3 then these 8 blocks will be replicated three times, giving birth to 24 blocks.

These all operations happens behind the scene. You don’t have to worry about this breaking and replication. It is taken care by the Hadoop distributed file system. We just need to use Hadoop command and everything will be done in backend.

Master-Slave Architecture:

The HDFS uses the Master-Slave kind of Architecture. It consists of two nodes.

HDFS Architecture (Image by Author)

Name Node: The first one is the name node. It stores the metadata information about the file. It directs each block of the file to resides in different machines of the cluster. The secondary name node is the back up of the name node i.e if primary node goes down, then secondary node will handle all the responsibility of name node.

Data Node: The second node is called as data node which actually carry the data. So we have one name node i.e a master node and several data nodes working as a worker or slave.

In case of reading data, the name node can request all data nodes to get the full file back and give it back to the client.

Hadoop Commands:

In order to do any operation on Hadoop file system , we need to use hadoop commands and these are very similar to the Linux command.

Hadoop commands(Image by author)

You can download all the commands from my github.

Thank you for reading.

Reference:

--

--

Tarique Akhtar

Data Science Professional, Love to learn new things!!!We can get connected through LinkedIn (https://www.linkedin.com/in/tarique-akhtar-6b902651/)