301.1.7-Basics of Hadoop

Hadoop

What is Hadoop?

Hadoop is a bunch of tools.
Hadoop is a software platform.
One can easily write and run applications that process bigdata.
Hadoop is a java based open source framework for distributed storage and distributed computing.

Hadoop takes care of many tasks

Hadoop takes care of many difficult tasks in MapReduce implementation like:
- Data distribution
- Master node
- Slave nodes
- Cluster Management
- Job scheduling
- Data Replication
- Load balancing

Is Hadoop a Database?

Hadoop is not Bigdata.
Hadoop is not a database.
Hadoop is a platform/framework.
- Which allows the user to quickly write and test distributed systems.
- Which is efficient in automatically distributing the data and work across machines.
Hadoop is a software platform that lets one easily write and run applications that process bigdata.

What does Hadoop do?

It can reliably store and process petabytes.
It distributes the data and processing across clusters of commonly available computers.
As soon as you move the data onto HDFS, it divides the data into small pieces (64MB each) and distributes across the machines.
By distributing the data, it can process it in parallel on the nodes where the data is located.
It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop Main Components

Four major components
1. Hadoop Common: Java libraries/packaged codes/utilities
  - Hadoop is a software platform or a kind of software package which has common Java libraries.
  - What are Java Libraries??
    - These are compiled codes that will automatically take care of all the data distribution work like scheduling, etc.
    - There are some common libraries which are there in hadoop.
    - Hadoop is written on Java platform, so Java libraries, packaged codes and ulities are first part of hadoop components.
2. Hadoop YARN: Framework for job scheduling and cluster resource management
  - This is for job scheduling, cluster resource management, etc.
  - That is how do you manage resources: like how many cores are there, how much is the data, the hard disk size, the ram size in each of the computer, etc. in the cluster.
  - Thus, Hadoop YARN is one of the important component.
3. Hadoop Distributed File System: Distributed data storage
  - This is for distributed data storage, because data storage is very important as big data itself is not at all manageable.
  - Thus, managing or storing data in an efficient manner, without losing the data is very much important.
4. Hadoop MapReduce: Parallel processing and distributed computing
  - These are the set of libraries that will help us in distributed computing.
  - In MapReduce, the map function is applied to the data chunks in HDFS. The output from the map function is given to the reduce function. The reduce function gives us the desired output.

HDFS

HDFS is the abreviation form for Hadoop Distributed File System.
HDFS is designed to store very large files across machines in a large network
64MB
- Any file that is kept on HDFS is divided into small pieces and distributed.
- Each file is divided into a sequence of blocks and all blocks are of equal size, 64MB in general, sometimes even 128 MB.

HDFS-Replication

Blocks are replicated on different machine in the cluster for fault tolerance.
Replication placement is not random, it is optimized to improve:
- Reliability
- Availability
- Network bandwidth utilization

Name node and Data node

HDFS has small pieces of data called blocks and manages these blocks using master and slave architecture.
Name node:
- The Master node.
- Name node has metadata stored in it like namespace information, block information, etc.
- Keeps track of what blocks are on which slave nodes, where the replication of data blocks on the data nodes.
Data node:
- The Slave nodes.
- The blocks of data are stored on data nodes.
- Data nodes are not smart, their main role is to store the data.

Resource Manager and Node Manager

Resource Manager

There is only one Resource Manager per cluster.
Resource Manager knows where the slaves are located (Rack Awareness).
Keeps track of how many resources each slave have.
It runs several services, the most important is the Resource Scheduler which decides how to assign the resources.

Node Manager

There are many node managers in a cluster.
Node Manager is a slave to Resource Manager.
Each Node Manager tracks the available data processing resources on its slave nodes.
The processing resource capacity is the amount of memory and the number of cores.
At run-time, the Resource Scheduler will decide how to use this capacity.
Node manager sends regular reports(heartbeat) to the Resource Manager.
Tracks node-health, logs management.

22nd March 2018