301.1.6-Basics of MapReduce

Map Reduce is not new

  • Finally, if we want to make something or achieve something, then you need not to do it in one go, you can divide the whole problem into different pieces i.e., initial data is the raw data and then you get the intermediate output and then we do the reduce.

  • Thus, the idea was there earlier, now with the new network programming and all the network computing, we can achieve MapReduce distributed computing.
  • To handle big data, we need to write MapReduce programs, we can’t write simple programs.

MapReduce programs

  • The conventional program to count the number records in a data file:
  • The MapReduce program to count the number of records in a bigdata file:

More than just a MapReduce program

  • The MapReduce program to count the number of records in a bigdata file:
  • Who will setup the network of machines and store the data locally?
  • Who will divide and send the map program to local machines and call the reduce program on top of map?
  • What if one machine is very slow in the cluster?
  • What if there is a hardware failure in one of the machines?
    • It is not just MapReduce, it is not that simple. It is much more than MapReduce.

Additional scripts for work distribution

  1. We need to first setup a cluster of machines, then divide the whole data set into blocks and store them in local machines.
  2. We also need to assign a master node that takes charge of all meta data, which block of data is on what machine.
  3. We need to write a script that will take care of work scheduling, distribution of tasks and job orchestration.
  4. We also need to assign worker slots to execute map and reduce functions.

Additional scripts for efficiently

  • We need to write scripts for load balancing (What if one machine is very slow in the cluster?).
  • We also need to write scripts for data backup, replication and Fault Tolerance (What if the intermediate data is partially read).
  • Finally write the map reduce code that solves our problem.

Implementation of MapReduce is difficult

  • Analysis on Bigdata can give us awesome insights.
  • But, datasets are huge, complex and difficult to process.
  • The solution is distributed computing or MapReduce.
  • But looks like this data storage & parallel processing, job orchestration and network setup is complicated.
  • What is the solution?
  • Is there a readymade tool? Or platform that can take care of all these tasks.
    • Hadoop


15th May 2017

0 responses on "301.1.6-Basics of MapReduce"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?