In this particular session we are going to learn the basic of the pig, such as âwhat is a pig, pig architecture ,pig latin scripts, pig basic operations , loading the data into pig , group by ,filtering , sorting, functions in pig , joins in pig and storing the data and exporting the data outside the pig. In short in this tutorial a detail study of the pig and its ecosystem will be covered. This session will teach you how to use pig tool with in Hadoop ecosystem, after learning the basic of the pig you can do the advance operations in the same. ## Contents
- What is Pig
- Pig Architecture
- Pig Latin
- Pig Latin basic Operations
- Loading the data
- Group by
- Storing data
What is Apache Pig
As we know already the Map reduce have some issue that is one need to be expert in the java programming language to write the map reduce code efficiently. Second issue with map reduce is that we need to convert each and every problems into map reduce framework , is not like normal program where one can just write the code in the traditional way, every program need to be converted into Map â that is running locally then get the output from the map and then execute reducerâ This makes map reduce program tough to write . As the data scientific or the analytic are not excepted to know java as a professional java developer ,then we need to take the help of the tool within Hadoop ecosystems that will do map reduce job with out actually writing the java scripts . So hive was doing the map reduce job by converting queries into map reduce code , similar there is one more tool called Pig. So pig is a high level scripting language, by using the help of the pig latin script the code can be written which will be converted into the map reduce . Being a big data analytic the pig tool is very useful, where one can write the code in the pig latin script which will be internally be converted into the map reduce task. This is called as map-reduce made easy.
Map Reduce Made easy
- The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data.
- This way to executing map reduce is not easy
- Pig is a high level scripting language to escape the MapReduce Java coding complexities
- Programmers need to write scripts using Pig Latin language.
- All these scripts are internally converted to Map and Reduce tasks.
What is Pig
In short pig is a simple scripting tool and it is powerful alternative to map-reduce . Apache pig is an abstraction over map-reduce. Pig works very good for certain types of the classes such as web log analysis, text mining and etc . Pigs can handle datasets where the datasets are slightly unstructured or semi-structured unlike hive which will fail if the datasets are not in the proper structured format.
- In simple terms, Pig is a simple scripting tool and powerful alternative to MapReduce
- Apache Pig is an abstraction over MapReduce.
- Hive was good but has lot of limitations. User defined functions, built-in functions, ad-hock analysis is not flexible in hive
The application of the pig are as follows:
- Web log processing.
- Data processing for web search platforms.
- Ad hoc queries across large- data sets.
In this session we will discuss about pig in detail and pig latin script and how to write pig latin scripts . Both hive and pig have their own advantage and disadvantages, in some types of problems hive is better and in some classes of problem pig is better , so there is no competition between hive and pig. Data scientist or analyst should decide which tool is better for achieving the goal as both hive and pig are tools are used for achieving the desired results but the approach is different. To interact with pig we need to learn new language which is called as pig latin script, which is very simple and have limited number of commands or operators , syntax is very simple and hence not much time needed to be spend on learning the pig latin scripts.
So now we will learn about Pig latin script which is necessary to interact with the pig . To write data analysis programs , pig provides an high-level language known as pig Latin. As said before pig latin have very limited keywords and operators, and very simple to learn too . There are many built in operations for joins , filter and ordering , we just need to call the write operator for the right task . It also provide nested data types such as tuples, bag and maps which are missing from the map reduce. Sow what exactly is the nested data types for example a bag consist of the tuples and a map consist of the key value pairs so basically each one of them is the sub group of one another , the use of the nested data types will be more clear once we start writing the pig latin scripts . Pig also allows us to write user defined functions , we can write our own functions for reading ,writing, processing or creating the report and then implement them in pig which be internally be converted into the map reduce code, which is really a powerful feature of pig and also solve our business purpose .
- To write data analysis programs, Pig provides a high-level language known as Pig Latin
- Pig Latin syntax is very simple and intuitive.
- Built-in operators like joins, filters, ordering
- Provides nested data types like tuples, bags, and maps that are missing from MapReduce.
- User-defined Functions : Pig Latin makes it easy to develop own functions for reading, writing, and processing data