LOGIN

No products in the cart.

Home
Bigdata
301.1.6-Basics of MapReduce

301.1.6-Basics of MapReduce

Map Reduce is not new

Finally, if we want to make something or achieve something, then you need not to do it in one go, you can divide the whole problem into different pieces i.e., initial data is the raw data and then you get the intermediate output and then we do the reduce.

Thus, the idea was there earlier, now with the new network programming and all the network computing, we can achieve MapReduce distributed computing.
To handle big data, we need to write MapReduce programs, we can’t write simple programs.

MapReduce programs

The conventional program to count the number records in a data file:

count=count+1

The MapReduce program to count the number of records in a bigdata file:

count=count+1 
cum_sum=cum_sum+sum

More than just a MapReduce program

The MapReduce program to count the number of records in a bigdata file:

count=count+1 
cum_sum=cum_sum+sum

Who will setup the network of machines and store the data locally?
Who will divide and send the map program to local machines and call the reduce program on top of map?
What if one machine is very slow in the cluster?
What if there is a hardware failure in one of the machines?
- It is not just MapReduce, it is not that simple. It is much more than MapReduce.

Additional scripts for work distribution

We need to first setup a cluster of machines, then divide the whole data set into blocks and store them in local machines.
We also need to assign a master node that takes charge of all meta data, which block of data is on what machine.
We need to write a script that will take care of work scheduling, distribution of tasks and job orchestration.
We also need to assign worker slots to execute map and reduce functions.

Additional scripts for efficiently

We need to write scripts for load balancing (What if one machine is very slow in the cluster?).
We also need to write scripts for data backup, replication and Fault Tolerance (What if the intermediate data is partially read).
Finally write the map reduce code that solves our problem.

Implementation of MapReduce is difficult

Analysis on Bigdata can give us awesome insights.
But, datasets are huge, complex and difficult to process.
The solution is distributed computing or MapReduce.
But looks like this data storage & parallel processing, job orchestration and network setup is complicated.
What is the solution?
Is there a readymade tool? Or platform that can take care of all these tasks.
- Hadoop

22nd March 2018

© 2020. All Rights Reserved.