- In this blog series we are going to learn about hive. We will see:
- What exactly is hive?
- How MapReduce is different or similar to hive?
- Hive introduction
- Hive versus rdbms
- Hive architecture
- Hive basic vocabulary
- Hive query language
- Hive joins
- Hive operations, etc
Map Reduce vs Hive
- Using traditional data management systems, it is difficult to process Big Data.
- Big data by definition is the data set or a type of data that is very difficult to handle using traditional tools or conventional tools like SAS, R,Excel, SQL, etc that is used in the day-to-day life.
- Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges.
- Hadoop is an open-source framework to store and process Big Data in a distributed environment.
- It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS).
- MapReduce is a kind of distributed computing where we first divide the whole objective into various smaller tasks and then finally write a Map Reduce program where the map will carry out some computation and then the reducer will take the output of map as input and find the required final output. As we discussed in earlier sessions, HDFS is the distributed file storage. In MapReduce we saw some programs like word count program, line count program, finding the average, etc. While writing MapReduce programs, we observed that hdfs and MapReduce together increase the data processing and increases the speed of computation.
- It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware. HDFS:
- Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware.
MapReduce needs Java
- While HDFS and Map Reduce programming solves our issue with big data handling, but it is not very easy to write a MapReduce code
- One has to understand the logic of overall algorithm then convert to MapReduce format.
- MapReduce programs are written in Java
- We need to be an expert in Java to write MapReduce code for intermediate and Complex problems
- Not everyone has Java background
Map Reduce Made Easy
- The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data.
- This way to executing map reduce is not easy
- Hive is created to make the map reduce process easy. Hive query language.
- Hive is similar to the SQL query language.
- Let’s say, if we want to find the average of some data, then writing a map and reduce function in Java is difficult. In such cases writing an SQL query is much easier than writing a java code. Even if one does not know sql, learning SQL is very easy.
- So writing queries in hive will be much easier than writing a java code. Hive is like SQL on top of hadoop.
- If we write some queries in hive, it will understand our query, create the map and reduce functions, send it to hadoop distributed file system, perform the analysis on the data, and fetch the results back .
Line count- Code
- Let’s take an example of a line count code.
- In hive we just have to wite “Select count(*) from table” instead of writing a huge MapReduce code.
- For this particular line count program, we need to write many lines of code in java, whereas in hive we just have to write a single line query.