• No products in the cart.

104.3.2 Descriptive Statistics : Mean and Median

Basics of descriptive statistic.

Link to the previous post :  https://statinfer.com/104-3-1-data-sampling-in-python/

Descriptive statistics

  • The basic descriptive statistics to give us an idea on the variables and their distributions.
  • Permit the analyst to describe many pieces of data with a few indices.
  • Can also help us find the underlying outliers in the dataset which is important before cleaning the data.
  • Central tendencies:
    • Mean
    • Median
  • Dispersion:
    • Range
    • Variance
    • Standard deviation

Central Tendencies

  • Mean
    • The arithmetic mean
    • Sum of values / Count of values
    • Gives a quick idea on average of a variable
  • Median
    • Mean is not a good measure in presence of outliers
    • For example Consider below data vector
      • 1.5,1.7,1.9,0.8,0.8,1.2,1.9,1.4, 9 , 0.7 , 1.1
    • 90% of the above values are less than 2, but the mean of above vector is 2
    • There is an unusual value in the above data vector i.e 9
    • It is also known as outlier.
    • Mean is not the true middle value in presence of outliers. Mean is very much effected by the outliers.
    • We use median, the true middle value in such cases
    • Sort the data either in ascending or descending order
Numbers Sorted Numbers
1.5 0.7
1.7 0.8
1.9 0.8
0.8 1.1
0.8 1.2
1.2 1.4
1.9 1.5
1.4 1.7
9 1.9
0.7 1.9
1.1 9
  • Mean of the data is 2
  • Median of the data is 1.4
  • Even if we have the outlier as 90, we will have the same median
  • Median is a positional measure, it doesn’t really depend on outliers
  • When there are no outliers then mean and median will be nearly equal
  • When mean is not equal to median it gives us an idea on presence of outliers in the data

Mean and Median on Python

In [5]:
gain_mean=Income_Data["capital-gain"].mean()
gain_mean
Out[5]:
1077.6488437087312
In [6]:
gain_median=Income_Data["capital-gain"].median()
gain_median
Out[6]:
0.0

Mean is far away from median. Looks like there are outliers, we need to look at percentiles and box plot.

Practice : Mean and Median on Python

  • Dataset: “./Online Retail Sales Data/Online Retail.csv”
  • What is the mean of “UnitPrice”
  • What is the median of “UnitPrice”
  • Is mean equal to median? Do you suspect the presence of outliers in the data?
  • What is the mean of “Quantity”
  • What is the median of “Quantity”
  • Is mean equal to median? Do you suspect the presence of outliers in the data?
In [7]:
Retail=pd.read_csv("datasets\\Online Retail Sales Data\\Online Retail.csv", encoding = "ISO-8859-1")
Retail.shape
Out[7]:
(541909, 8)
In [8]:
UnitPrice_mean=Retail["UnitPrice"].mean()
UnitPrice_mean
Out[8]:
4.611113626083471
In [9]:
UnitPrice_median=Retail["UnitPrice"].median()
UnitPrice_median
Out[9]:
2.08
In [10]:
UnitPrice_mean=Retail["Quantity"].mean()
UnitPrice_mean
Out[10]:
9.55224954743324
In [11]:
UnitPrice_median=Retail["Quantity"].median()
UnitPrice_median
Out[11]:
3.0

Yes, looks like we have outliers presents in this variable.

The next post is on dispersion measures in python.

Link to the next post : https://statinfer.com/104-3-3-dispersion-measures-in-python/

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.