In previous section, we studied about Descriptive Statistics, now we will be studying about Percentile and Quartile.
Now let us discuss percentile and quartile. What a percentile is? Let us see with an example. A student attended an exam along with 999 other students. He secured 68% marks in the exam. So from this, what can we say? Is he clever or not? This is very difficult to say. What if 70% is the highest score and he secured 68%, then he is a clever guy and what if the highest score is 95% and the lowest score is 66%, then this student is not that much clever. So just looking at the marks we cannot predict whether he is clever or not, we need some other parameters as well. Say, with 68%, he stood at 90th position. So there are 910 students who got less than 68% and 89 students got more than 68%. So from this, what can we conclude? Is he a clever student or not? So from this we can say that he is a good student. He is at the top percentile. Only 10% of the students got better marks than him and 90% of them got less than him. So he is standing at 90th percentile.
So instead of just the marks(68%), his standing position,i.e., 90th percentile will give a better idea on his performance. Therefore, in exams as well, they give the percentile instead of the marks obtained, which give a better idea on his performance. Percentile makes the data look simpler.
So what is the pth percentile?
P% of the object or observation are below the given percentile and (100-P)% of the observation are above it. Let’s say if it is 90 percentile, then 90% of them are below it and only 10% are above it.
Now, the marks are given as 40%, but the percentile is given as 80%. So we can say that 20% of the people are above him and 80% of them are below him.
Percentile helps a lot in finding the outliers if any.
Let us see an example.
The highest income value is 4,00,000 but the 95th percentile is only 20,000. Here, the highest income value is in millions, but the average income or the 95th percentile is 20,000 only. Only 5% of the people have more than 20,000 and rests of the 95% are having less than 20,000. But what is the highest value? It will in crores, which means 95% of the values are less than 20,000 and the values near 4,00,000 are clearly the outliers i.e., some values are very much different from the rest of the values in the dataset.
So percentile gives us a good idea on outliers. The percentile divides everything into 100 groups whereas Quartiles dives the data into four groups.
So quartiles in terms of percentile can be given as:
- 25th percentile is First Quartile or Lower Quartile (LQ).
- 50th percentile is Second Quartile or Median.
- 75th percentile is Third Quartile or Upper Quartile (UQ).
So what is a Quartile?
25th percentile will have 25% below it and 75% above it, which is called the 1st Quartile. 50th percentile means, 50% are above it and 50% are below it. So it is called the 2nd Quartile or the median. 3rd Quartile will be 75th percentile which means 25% are above and 75% is below it.
Calculate Percentile and Quartile in R
Let us consider the Income dataset.
In this, let us see the summary of the variable “capital.gain”.
>summary(Income$capital.gain) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 0 0 1078 0 100000
We get minimum as 0, 1st Quartile 0, Median is 0, Mean is 1078, 3rd Quartile is 0 and the maximum is 1,00,000. So to find Quartile, we use the function, quantile().
>quantile(Income$capital.gain, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1) is equivalent to c(min. value, 10th percentile, 20th percentile, …, 100th percentile)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ## 0 0 0 0 0 0 0 0 0 0 99999
So from above result we can see that till 90% the value is 0 but after that there are some values. The same data we had taken for the calculation of mean and median.
>mean(Income$capital.gain) #[1] 1077.649 >median(Income$capital.gain) #[1] 0
So from mean and median itself we came to know that there are lots of outliers in the data as some records are very much different from rest of the records. Similarly we can also see the distribution of “capital.loss”.
>quantile(Income$capital.loss, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)) ## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ## 0 0 0 0 0 0 0 0 0 0 4356
In this data as well we get a similar kind of result. Now let us se the variable “hours.per.week”, which gives the working hour per week i.e., how many hours people are spending.
>quantile(Income$hours.per.week, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)) ## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ## 1 24 35 40 40 40 40 40 48 55 99
From the result, we can see that only 10% of the people are working 24 hours or less and the remaining are working for more than 24 hours. 50% of the people are working for 40 hours. Then 80% of the people are working for 48 hours or less and rests of the people are working for more than 48 hours. Similarly we can say for the rest of the percentile.
LAB
Consider the “bank_market” data. Get the summary of the variable “balance” and check if there are any outliers in it.
>bank_market<-read.csv("C:\\Amrita\\Datavedi\\Bank Marketing\\bank_market.csv")
Also calculate the percentile and see the distribution of the data.
>summary(bank_market$balance) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8019 72 448 1362 1428 102100
Here we get the min. value, 1st quartile, median, mean, 3rd quartile and max. value. We can see that some values are having a lot deviation i.e., there are outliers. Lets have a detailed summary of the data.
>quantile(bank_market$balance, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)) ## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ##-8019 0 22 131 272 448 701 1126 1859 3574 102127
The min. value is -8019 and the maximum is going upto 102127. 90% of the people are having balance less than or equal to 3574, similarly it is calculated for the rest of the distribution. From the above result we can say that the outliers are present in the last 10% i.e., beyond 90% and in the beginning as well as the value is -8019 which is very much lesser than rest of the value. Consider another variable i.e., “age”.
>summary(bank_market$age) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.00 33.00 39.00 40.94 48.00 95.00
Let us see the percentile.
>quantile(bank_market$age, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)) ## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ## 18 29 32 34 36 39 42 46 51 56 95
From the above results, 20% of the people are below or equal to 32 years of age.*0% of the people are below or equal to 51 years.
Therefore,from this result, we can say that the age groups are evenly distributed.
So we can say that there are no outliers in this data as compared to the difference we saw in the previous data. This section is very important because having outliers in the data is not that good. It may affect our calculation or analysis. So it is necessary to monitor the distribution of the data.
In next section, we will be studying about Box Plots and Outlier Detection.