Hadoop is a software platform or a kind of software package which has common Java libraries.
What are Java Libraries??
These are compiled codes that will automatically take care of all the data distribution work like scheduling, etc.
There are some common libraries which are there in hadoop.
Hadoop is written on Java platform, so Java libraries, packaged codes and ulities are first part of hadoop components.
Hadoop YARN: Framework for job scheduling and cluster resource management
This is for job scheduling, cluster resource management, etc.
That is how do you manage resources: like how many cores are there, how much is the data, the hard disk size, the ram size in each of the computer, etc. in the cluster.
Thus, Hadoop YARN is one of the important component.
Hadoop Distributed File System: Distributed data storage
This is for distributed data storage, because data storage is very important as big data itself is not at all manageable.
Thus, managing or storing data in an efficient manner, without losing the data is very much important.
Hadoop MapReduce: Parallel processing and distributed computing
These are the set of libraries that will help us in distributed computing.
In MapReduce, the map function is applied to the data chunks in HDFS. The output from the map function is given to the reduce function. The reduce function gives us the desired output.
HDFS
HDFS is the abreviation form for Hadoop Distributed File System.
HDFS is designed to store very large files across machines in a large network
64MB
Any file that is kept on HDFS is divided into small pieces and distributed.
Each file is divided into a sequence of blocks and all blocks are of equal size, 64MB in general, sometimes even 128 MB.
HDFS-Replication
Blocks are replicated on different machine in the cluster for fault tolerance.
Replication placement is not random, it is optimized to improve:
Reliability
Availability
Network bandwidth utilization
Name node and Data node
HDFS has small pieces of data called blocks and manages these blocks using master and slave architecture.
Name node:
The Master node.
Name node has metadata stored in it like namespace information, block information, etc.
Keeps track of what blocks are on which slave nodes, where the replication of data blocks on the data nodes.
Data node:
The Slave nodes.
The blocks of data are stored on data nodes.
Data nodes are not smart, their main role is to store the data.
Resource Manager and Node Manager
Resource Manager
There is only one Resource Manager per cluster.
Resource Manager knows where the slaves are located (Rack Awareness).
Keeps track of how many resources each slave have.
It runs several services, the most important is the Resource Scheduler which decides how to assign the resources.
Node Manager
There are many node managers in a cluster.
Node Manager is a slave to Resource Manager.
Each Node Manager tracks the available data processing resources on its slave nodes.
The processing resource capacity is the amount of memory and the number of cores.
At run-time, the Resource Scheduler will decide how to use this capacity.
Node manager sends regular reports(heartbeat) to the Resource Manager.