• No products in the cart.

301.1.6-Basics of MapReduce

Map Reduce is not new

  • Finally, if we want to make something or achieve something, then you need not to do it in one go, you can divide the whole problem into different pieces i.e., initial data is the raw data and then you get the intermediate output and then we do the reduce.

  • Thus, the idea was there earlier, now with the new network programming and all the network computing, we can achieve MapReduce distributed computing.
  • To handle big data, we need to write MapReduce programs, we can’t write simple programs.

MapReduce programs

  • The conventional program to count the number records in a data file:
  • The MapReduce program to count the number of records in a bigdata file:

More than just a MapReduce program

  • The MapReduce program to count the number of records in a bigdata file:
  • Who will setup the network of machines and store the data locally?
  • Who will divide and send the map program to local machines and call the reduce program on top of map?
  • What if one machine is very slow in the cluster?
  • What if there is a hardware failure in one of the machines?
    • It is not just MapReduce, it is not that simple. It is much more than MapReduce.

Additional scripts for work distribution

  1. We need to first setup a cluster of machines, then divide the whole data set into blocks and store them in local machines.
  2. We also need to assign a master node that takes charge of all meta data, which block of data is on what machine.
  3. We need to write a script that will take care of work scheduling, distribution of tasks and job orchestration.
  4. We also need to assign worker slots to execute map and reduce functions.

Additional scripts for efficiently

  • We need to write scripts for load balancing (What if one machine is very slow in the cluster?).
  • We also need to write scripts for data backup, replication and Fault Tolerance (What if the intermediate data is partially read).
  • Finally write the map reduce code that solves our problem.

Implementation of MapReduce is difficult

  • Analysis on Bigdata can give us awesome insights.
  • But, datasets are huge, complex and difficult to process.
  • The solution is distributed computing or MapReduce.
  • But looks like this data storage & parallel processing, job orchestration and network setup is complicated.
  • What is the solution?
  • Is there a readymade tool? Or platform that can take care of all these tasks.
    • Hadoop


22nd March 2018
0 responses on "301.1.6-Basics of MapReduce"

Leave a Message


Statinfer derived from Statistical inference is a company that focuses on the data science training and R&D.We offer training on Machine Learning, Deep Learning and Artificial Intelligence using tools like R, Python and TensorFlow

Contact Us

We Accept

Our Social Links

How to Become a Data Scientist?

© 2020. All Rights Reserved.