Before writing the pig latin scripts some important note should be taken in consideration First one is Pig is the case sensitive for certain commands in certain cases . Keywords in the pig latin are not case sensitive but the function name and reaction names are case sensitive. Suppose if we define particular relation with upper and the lower case then that is case sensitive but the keywords for example “store ,load, some kind of import” are not case sensitive . But the relation names and the tables names are case sensitive. There are two commenting styles either we can use SQL-style of single line comments or java style of the multiline comments .
Lets start working with the pig . As we discussed earlier pig is build upon hadoop or the pig sits above the hadoop so we need to start hadoop before we start pig . So for starting the hadoop we need to run the command “start-all.sh” which will start the hadoop services. It is necessary to start hadoop before starting the pig because the pig latin script will be converted into the map reduce code internally, and map reduce is on top of the hadoop . So start the hadoop before staring the pig by command “start-all.sh”. Now we can see that hadoop is up and running . Now we can start the pig . “pig” is the command to start the pig . And it will open the ** grunt** shell
Lets work on the dataset Online_Retail_Invoice.txt, First we have push the data into hdfs, then from hdfs we have to move the data inside the pig, because pig is the part of the hadoop ecosystem and it works with hdfs.
hadoop fs -copyFromLocal /home/hduser/datasets/ Online_Retail_Sales_Data/Online_Retail_Invoice.txt /Retail_invoice_hdfs
Retail_invoice_pig = LOAD '/Retail_invoice_hdfs' USING PigStorage('\t');
Lets understand the code in detail, “Retail_invoice_pig” is the relation name its like the dataset name or data file name or the table name . Load is a keyword this load keyword is used to bring the hdfs data inside the pig and “Retail_Invoice_hdfs” is the location of the hdfs file that we want to load . Using PigStorage(‘’) is also an keyword which means the given data is in tab delimiter function. The tuple will be created based on the delimiter mentioned in this storage function . There are several options for loading functions such as Binstorage , JSonloader, Pigstorage, Textloader, in this particular example we are using the pig storage. We have used tab delimiter function for this particular example. Once we run the above command it shows an warning message as “command is depreciated” which is okay .
Now the dataset is inside the pig and we can use a dump statement “DUMP Retail_invoice_pig;” which is kind of a print statement which print the data. Inside the pig the relation name is “Retal_invoic_pig”. Inside pig we don’t call it as data set or data table this is called as the relation . So lets run the dump command. Once we run the dump command it will start calling some java libraries and finally we can see that it is print the data which was inside the “retail invoice”. Being a huge dataset it will take time to print all the rows. By now we can understand that “DUMP” is not a good option for printing the data set if the dataset is too larger, because dump command will consume considerable amount of time for printing the whole dataset. So “DUMP” command is not recommended to be used when the dataset is large.
Lets have a look at data on Pig
So “DUMP” command is not recommended to be used when the dataset is large. Instead of “DUMP” we can use the “Describe” this is an another keyword inside pig. Lets try to run the “Describe”. Command for running the Describe is “Describe Retail_invoice_pig;” As soon as we run the describe command we get an error message saying “Schema for retail_invoice_pig is unknown”. Now we have to define along with schema to describe command to work
#### Loading the Data with Schema
Retail_invoice_pig1 = LOAD '/Retail_invoice_hdfs' USING PigStorage('\t') as (uniq_idi:chararray, InvoiceNo:chararray, StockCode:chararray, Description:chararray,Quantity:INT);
Lets try to understand the command . “Retail_invoice_pig1”” is the new relation, like it is the new table inside the pig . “Load” statement will tell the location of the data that should be loaded from. “Using” is an keyword which will take care about the delimiter used in the dataset. Now the next part of the command will tell about the scheme . First one is unique id which is type of char array Second one is ** Invoice Id** which type of char array Third one is ** StockCode ** which type of char array. Fourth one is description which type of char array. And the last one is the “Quantity” which is type of integer. Then we are loading the data again inside “Retail_invoice_pig1” but this time we are loading along with schema . The data is now loaded inside the table scheme “Describe Retail_invoice_pig1;”.
By running the describe “Describe Retail_invoice_pig1;” command it will give us the small description about relation, which consist of the column name and the structure of the column .
By running the “DUMP Retail_invoice_pig1;” it starred to print the data, again dump is not recommend command if the dataset consist of large amount of rows.
head_Retail_invoice_pig1 = LIMIT Retail_invoice_pig1 10; DUMP head_Retail_invoice_pig1;