Link to the previous post: https://statinfer.com/104-3-3-dispersion-measures-in-python/

In the previous post, we went through Dispersion Measures and implemented them using python.

This post is an extension of previous posts, again we will go on with the data we have imported in last sessions.

Percentiles and Quartiles are very useful when we need to identify the outlier in our data. They also help us understand the basic distribution of the data.

Percentiles

A student attended an exam along with 1000 others.
- He got 68% marks? How good or bad he performed in the exam?
- What will be his rank overall?
- What will be his rank if there were 100 students overall?
For example, with 68 marks, he stood at 90th position. There are 910 students who got less than 68, only 89 students got more marks than him
He is standing at 91 percentile.
Instead of stating 68 marks, 91% gives a good idea on his performance
Percentiles make the data easy to read
pth percentile: p percent of observations below it, (100 – p)% above it.
Marks are 40 but percentile is 80%, what does this mean?
80% of CAT exam percentile means
- 20% are above & 80% are below
Percentiles help us in getting an idea on outliers.
For example the highest income value is 400,000 but 95th percentile is 20,000 only. That means 95% of the values are less than 20,000. So the values near 400,000 are clearly outliers

Quartiles

Percentiles divide the whole population into 100 groups where as quartiles divide the population into 4 groups
p = 25: First Quartile or Lower quartile (LQ)
p = 50: second quartile or Median
p = 75: Third Quartile or Upper quartile (UQ)

Percentiles & Quartiles in Python

By default summary gives 4 quartiles

In [22]:

Income_Data['capital-gain'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[22]:

0.0        0.0
0.1        0.0
0.2        0.0
0.3        0.0
0.4        0.0
0.5        0.0
0.6        0.0
0.7        0.0
0.8        0.0
0.9        0.0
1.0    99999.0
Name: capital-gain, dtype: float64

In [23]:

Income_Data['capital-loss'].quantile([0, 0.1, 0.2, 0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

Out[23]:

0.0       0.0
0.1       0.0
0.2       0.0
0.3       0.0
0.4       0.0
0.5       0.0
0.6       0.0
0.7       0.0
0.8       0.0
0.9       0.0
1.0    4356.0
Name: capital-loss, dtype: float64

In [24]:

Income_Data['hours-per-week'].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

Out[24]:

0.0     1.0
0.1    24.0
0.2    35.0
0.3    40.0
0.4    40.0
0.5    40.0
0.6    40.0
0.7    40.0
0.8    48.0
0.9    55.0
1.0    99.0
Name: hours-per-week, dtype: float64

Looks like some people are working 90 hours perweek.

Practice : Percentiles & Quartiles in Python

Dataset: “./Bank Marketing/bank_market.csv”
Get the summary of the balance variable
Do you suspect any outliers in balance ?
Get relevant percentiles and see their distribution.
Are there really some outliers present?
Get the summary of the age variable
Do you suspect any outliers in age?
Get relevant percentiles and see their distribution.
Are there really some outliers present?

In [25]:

bank=pd.read_csv("datasets\\Bank Marketing\\bank_market.csv",encoding = "ISO-8859-1")
bank.shape

Out[25]:

(45211, 18)

In [26]:

#Get the summary of the balance variable
#we can find the summary of the balance variable by using .describe()
summary_bala=bank["balance"].describe()
summary_bala

Out[26]:

count     45211.000000
mean       1362.272058
std        3044.765829
min       -8019.000000
25%          72.000000
50%         448.000000
75%        1428.000000
max      102127.000000
Name: balance, dtype: float64

Yes, There are outliers as mean and median is very different

In [27]:

#Get relevant percentiles and see their distribution.
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[27]:

0.0     -8019.0
0.1         0.0
0.2        22.0
0.3       131.0
0.4       272.0
0.5       448.0
0.6       701.0
0.7      1126.0
0.8      1859.0
0.9      3574.0
1.0    102127.0
Name: balance, dtype: float64

In [28]:

#Get the summary of the age variable
summary_age=bank['age'].describe()
summary_age

Out[28]:

count    45211.000000
mean        40.936210
std         10.618762
min         18.000000
25%         33.000000
50%         39.000000
75%         48.000000
max         95.000000
Name: age, dtype: float64

Looks like no outliers.

In [29]:

#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[29]:

0.0    18.0
0.1    29.0
0.2    32.0
0.3    34.0
0.4    36.0
0.5    39.0
0.6    42.0
0.7    46.0
0.8    51.0
0.9    56.0
1.0    95.0
Name: age, dtype: float64

The next post is on box plots and outlier detection using python.
Link to the next post : https://statinfer.com/104-3-5-box-plots-and-outlier-dectection-using-python/

23rd January 2018

104.3.4 Percentiles & Quartiles in Python

Implementing the concept of percentile and quartiles.

Percentiles

Quartiles

Percentiles & Quartiles in Python

Practice : Percentiles & Quartiles in Python

Statinfer

Statinfer

Statinfer

104.3.4 Percentiles & Quartiles in Python

Implementing the concept of percentile and quartiles.

Percentiles

Quartiles

Percentiles & Quartiles in Python

Practice : Percentiles & Quartiles in Python

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer