301.1.4-Handling Big Data

Bigdata Tool

Super Computer is a solution.
Put multiple CPUs in a machine (100?). It will give the result quickly.
Let us see if we have a normal laptop then it is very difficult to handle big data, because the data set size itself is 16 PB or 1 PB and if we have a normal system that even might have just 1 TB of hard disk space, then getting the data or acquiring the data or storing the data itself becomes difficult, forget about analyzing the data.
We can take a supercomputer, so instead of one CPU, we can put multiple CPUs in that, instead of one hard disk, we can put a huge hard disk so we can have a supercomputer to handle the big data.
Now the problem with the supercomputer is building a supercomputer or the cost of building a supercomputer is so high that the institutes like NASA or ISRO or really big institutes or really big companies can afford supercomputers.
The cost of buying a supercomputer might be sometimes really higher than whatever results that you are going to get out of big data.
If the dataset’s size is large, then that doesn’t mean we have to invest a lot on the computer.
Supercomputer is a solution but it is not that cost effective solution; it is really costly for individuals. It’s almost like impossible to buy a supercomputer just to perform these operations.

Till 1985, there is no way to connect multiple computers.
All computers were centralized individual systems.
Multi-core system or supercomputers were the only options for big data problems.
After 1985, we have powerful microprocessors and High-Speed Computer Networks.

The Computer Networks LANs, WANs lead to distributed systems.
Now that we have a distributed system that ensures a collection of independent computers appears to its users as a single coherent system.
We can use some low-priced connected computers and process our bigdata.

Cluster is nothing but when you take few machines and you connect them through LANs and WAN’s, that is called cluster.
A collection of independent computers that are joined together using LAN is called computer cluster.
We can do distributed computing or cluster computing to handle big data with a single machine, as it is really difficult for it to handle big data.

We have the overall final task, then we can divide the data into smaller pieces and place them on all these different machines.
Now, these smaller machines or low-end machines can handle smaller data set, if we have a huge data set, we can divide the dataset into smaller pieces and then distributed onto all these machines.
Then we connect all these machines using LAN or WAN and this whole set of machines or cluster of machines, the cluster of computers look like a really big supercomputer, we can make it work like that.
Put them in each of the machines, divide the overall problem into smaller pieces and then run them locally on each of the machines.

22nd March 2018