301.4.6-Filter & Sorting

Filter and Sorting

Filter

• The basic syntax of filtering is use the filter operation and then relation name which is needed to be filtered, followed by the condition.

Filtering – Rows

• Filter on Numerical variable.
• For now we will see how to filter the numerical variable.
Retail_Customer_F1 = FILTER Retail_Customer_pig BY UnitPrice>2000;
• In the code, Retail_Customer_F1 is the new relation name, FILTER is the keyword, Retail_Customer_pig is the existing relation name which is already inside the pig, UnitPrice is the column name which is inside the relation Retail_Customer_pig and the numeric value greater than 2000 means that we are filtering all the unit price which is having the value greater than 2000 and will be stored inside the new relation which is Retail_Customer_F1.
• Once the filtered data is inside the Retail_Cutomer_F1, we can print them by using the DUMP command.
Dump Retail_Customer_F1;

• So after printing we can see that filtered data is having only those rows where the unit prices is greater than 2000.
• This is the example of the numeric filter.
• Filter on text variable.
• Filtering on the text variable is almost similar with slight variations.
• Lets see how to do the filtering on the text variable .
Retail_Customer_F2 = FILTER Retail_Customer_pig BY (Country MATCHES '.*USA.*');
DUMP Retail_Customer_F2;

Selecting Few Columns

• For filtering based on the column we can use FOREACH and GENERATE operators for the same .
• Use FOREACH and GENERATE operators.
Retail_Customer_Few_Col = FOREACH Retail_Customer_pig GENERATE InvoiceDate, UnitPrice, CustomerID;
DUMP Retail_Customer_Few_Col;
• The command can be interpreted as, for each relation which is already existing in the pig that is Retail_Customer_pig generate the column of InvoiceDate, UnitPrice,CustomerID and this will be stored in the new relation which is named as Retail_Customer_Few_Col.
• In short for each row or tuples or each bag inside retail customer pig relation generate the following 3 columns.
• To print it we can again use the dump command and we can see that it has printed 3 columns and all the rows.

• We can use this filtering technique specially when we have huge column in the data set and we are only interested in the few columns.

Sorting Or Ordering

In this section we will look into how to do the sorting in the pig. For sorting in the pig we will be using order command. The usage of the command is pretty simple here, the flow of the command as follows.

• Descending:
Retail_Customer_sort = ORDER Retail_Customer_pig BY UnitPrice DESC;
• First take the relation that has been sorted in this example; Retail_Customer_pig is the relation that needed to be sorted then using ORDER which is an key word we can sort the column UnitPrice in descending order and it is stored in the new relation called Retail_Customer_sort.
• Once we had created the sorted relation and we want to just look at the first 20 observation then we can use the head command,
head= LIMIT Retail_Customer_sort 20;

and then print head relation by dump command.

DUMP head;

• By printing we can see that the output result are in descending order of the UnitPrice.
• Ascending – Default is Ascending
• Lets now try to print in the ascending order, the command of printing in the ascending order is similar in fact the default ordering is the ascending order.
        Retail_Customer_sort1 = ORDER Retail_Customer_pig BY UnitPrice ASC;
• And now we can create another relation called head just to print starting 20 observation
        head= LIMIT Retail_Customer_sort1 20;
• To print the result just use dump command and it will print data which were inside the head relation.
        DUMP head;

• By seeing the result we can know that unit price has printed in the ascending order.
• Default is Ascending
        Retail_Customer_sort1 = ORDER Retail_Customer_pig BY UnitPrice;
DUMP head;