301.4.2-Pig Architecture, Data Types and Relation

Pig Architecture

So lets us see about pig architecture, as from diagram we can see that pig sits over the Hadoop. With in the pig it is having execution engine , compiler , parser, optimiser. Pig offers us the grunt shell which is simply like a command line interface where pig latin script can be written . The pig latin script will be then passed down to parser , then code will be optimized , then compiler will check for any syntax error and after all this execution engine will send the code to the Hadoop by converting the code into the map reduce format and store them in hdfs, on which the data analysis and the data computation will be executed and again fetched back and given as output in pig .

Pig Data Types

Now we will look into the pig data types . What are the basic data types in pig latin script ? So the** simple data types** of the pig are** integers, float , double, byte array, boolean , date time and big integer. Apart from this, pig is having some unique data types called as “Atom”, which is just a single value in Pig latin script , irrespective of their data types whether they are integer , float or anything everything is known as atom it is similar to dynamic variable as in other programming language . The other one is called as the tuple** , tuple is like a row in a generic table or in a data table, basically tuple consist of a order set of fields so here is a example of the tuple where format is (Mobile, 200) where first filed shows the item name and second field tell us about the item cost for this particular example in short it means mobile price is 200. Tuple representation is done by both parentheses (), so anything written in this format is called as tuple , we can think it as a simple row with in table. The there is one more data type called bag. So what is a bag ? Bag is unordered set of tuples , which is represented by the “{}” It is similar to a table but it is not necessary that every tuple in the bag contain the same number of fields or that the fields in the same position (column ) have the same type. For example first row can have 20 column , 2nd row can have 25 column , 3rd row can have just 4 column . A bag is just a simple collection of tuples. The representation of the bag is {(Mobile, 200)(PC, 600)}where it means Mobile price is 200 and PC price is 600. So why we need to learn about bag and tuples? As pig latin script is going to use these data types for data operations and these kind of data types might make our analysis easy to handle on any type of datasets. Next data type would be Map. In map is a key value pair datatype, there will be key and there will be a value . The key needs to be type of “char array” means it should be character and it has to be unique . Map data type is represented by []. The example of the map data type is :[‘Age’#30] where “Age is the key” and “value is 30”. Map with 2 keys is : [‘Item’#‘Mobile’,‘quantity’#200] where “item is the key and mobile is the value”, and “quantity is the key and 200 is the value”. In short map is just a key value pair.

Simple Data Types:

int, float, long, double, chararray, bytearray, Boolean, Datetime and Biginteger

Atom:

Any single value in Pig Latin, irrespective of their data, type is known as an Atom.
Like variable in other languages

Tuple:

A record that is formed by an ordered set of fields.
Similar to a row in a table
Represented by ()
Eg: (Mobile, 200)

Bag:

A bag is an unordered set of tuples.
Represented by “{}”.
It is similar to a table but it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
Eg: {(Mobile, 200), (PC, 600)}

Map:

Nothing but Key-Value pairs
The key needs to be of type chararray and should be unique.
The value can be of any type.
Represented by ‘[]’
Eg:
- Map with one key: [‘Age’#30]
- Map with two keys : [‘Item’#‘Mobile’, ‘quantity’#200]

Relation

Now we will see the relation which is similar to bag or it is similar to table in some cases . We can say that relation is a bag , bag contains tuples , tuples contain fields and filed is a simple data . From the diagram we can say that the relation is the outermost structure of the pig Latin data model . A relation can have multiple bags . As soon as when we import an dataset the data will be converted into the tuples , bags and relations . So pig handles data sightly differently .

Relation is a Bag(Similar to table in some cases)
Bag contains tuples
Tuples contain fields
A field is a simple data
Relation is the outermost structure of the Pig Latin data model

22nd March 2018