Can you think of running a query on 20,980,000 GB file?
What if we get a new dataset like this, every day?
What if we need to execute complex queries on this data set everyday?
Does anybody really deal with this type of data set?
Is it possible to store and analyze this data?
Yes, google deals with more than 20 PB data everyday (before some years back, as in 2008). Now they’re dealing with more than 20 PB of data every day. Google is collecting lots of data everyday.
Now running queries on one such data set which is off 20 PB on a SQL is very difficult.
Are there really big datasets?
Google processes 20 PB a day (2008).
Way back Machine has 3 PB + 100 TB/month (3/2009).
Facebook has 2.5 PB of user data + 15 TB/day (4/2009).
eBay has 6.5 PB of user data + 50 TB/day (5/2009).
CERN’s Large Hydron Collider (LHC) generates 15 PB a year.
In fact, in a minute…
Email users send more than 204 million messages;
Mobile Web receives 217 new users;
Google receives over 2 million search queries;
YouTube users upload 48 hours of new video;
Facebook users share 684,000 bits of content;
Twitter users send more than 100,000 tweets;
Consumers spend $272,000 on Web shopping;
Apple receives around 47,000 application downloads;
Brands receive more than 34,000 Facebook ‘likes’;
Tumblr blog owners publish 27,000 new posts;
Instagram users share 3,600 new photos;
Flickr users, on the other hand, add 3,125 new photos;
Foursquare users perform 2,000 check-ins;
WordPress users publish close to 350 new blog posts.
There are many places, where data is generated in very huge amount within one day or within a minute.
In fact, just in 1 minute, this much amount of data is being generated.
Conventional tools and their limitations
Traditional data handling tools and their limitations
Excel : Have you ever tried a pivot table on 500 MB file?
If you try excel, it is a good tool for a ad-hoc analysis, but if you try to open a file which is 500 MB or even 1 GB then it starts hanging the system as you won’t be able to handle the data more than 1 GB in excel on usual systems.
SAS/R : Have you ever tried a frequency table on 2 GB file?
SAS or R are the analytical tools as they tend to give it up when you try a data which is more than 2 GB file.
Access: Have you ever tried running a query on 10 GB file?
Access can handle data or query up to 10 GB, but beyond that it is not really going to help you.
SQL: Have you ever tried running a query on 50 GB file?
SQL on a supercomputer kind of system can handle up to 50 GB data, but beyond that, SQL won’t be able to handle data.
Thus, these are the conventional tools such as SQL, Excel, Access, R, SAS.
The conventional tools won’t be able to handle this type of data that is coming rapidly within a minute or every day.
There is so much of data that is coming up and the conventional tools won’t be able to handle this.