Link to the previous post: https://statinfer.com/104-3-3-dispersion-measures-in-python/
In the previous post, we went through Dispersion Measures and implemented them using python.
This post is an extension of previous posts, again we will go on with the data we have imported in last sessions.
Percentiles and Quartiles are very useful when we need to identify the outlier in our data. They also help us understand the basic distribution of the data.
Percentiles
- A student attended an exam along with 1000 others.
- He got 68% marks? How good or bad he performed in the exam?
- What will be his rank overall?
- What will be his rank if there were 100 students overall?
- For example, with 68 marks, he stood at 90th position. There are 910 students who got less than 68, only 89 students got more marks than him
- He is standing at 91 percentile.
- Instead of stating 68 marks, 91% gives a good idea on his performance
- Percentiles make the data easy to read
- pth percentile: p percent of observations below it, (100 – p)% above it.
- Marks are 40 but percentile is 80%, what does this mean?
- 80% of CAT exam percentile means
- 20% are above & 80% are below
- Percentiles help us in getting an idea on outliers.
- For example the highest income value is 400,000 but 95th percentile is 20,000 only. That means 95% of the values are less than 20,000. So the values near 400,000 are clearly outliers
Quartiles
- Percentiles divide the whole population into 100 groups where as quartiles divide the population into 4 groups
- p = 25: First Quartile or Lower quartile (LQ)
- p = 50: second quartile or Median
- p = 75: Third Quartile or Upper quartile (UQ)
Percentiles & Quartiles in Python
- By default summary gives 4 quartiles
In [22]:
Income_Data['capital-gain'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[22]:
In [23]:
Income_Data['capital-loss'].quantile([0, 0.1, 0.2, 0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
Out[23]:
In [24]:
Income_Data['hours-per-week'].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
Out[24]:
Looks like some people are working 90 hours perweek.
Practice : Percentiles & Quartiles in Python
- Dataset: “./Bank Marketing/bank_market.csv”
- Get the summary of the balance variable
- Do you suspect any outliers in balance ?
- Get relevant percentiles and see their distribution.
- Are there really some outliers present?
- Get the summary of the age variable
- Do you suspect any outliers in age?
- Get relevant percentiles and see their distribution.
- Are there really some outliers present?
In [25]:
bank=pd.read_csv("datasets\\Bank Marketing\\bank_market.csv",encoding = "ISO-8859-1")
bank.shape
Out[25]:
In [26]:
#Get the summary of the balance variable
#we can find the summary of the balance variable by using .describe()
summary_bala=bank["balance"].describe()
summary_bala
Out[26]:
Yes, There are outliers as mean and median is very different
In [27]:
#Get relevant percentiles and see their distribution.
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[27]:
In [28]:
#Get the summary of the age variable
summary_age=bank['age'].describe()
summary_age
Out[28]:
Looks like no outliers.
In [29]:
#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[29]: