Link to the previous post : https://statinfer.com/104-3-1-data-sampling-in-python/
Descriptive statistics
- The basic descriptive statistics to give us an idea on the variables and their distributions.
- Permit the analyst to describe many pieces of data with a few indices.
- Can also help us find the underlying outliers in the dataset which is important before cleaning the data.
- Central tendencies:
- Mean
- Median
- Dispersion:
- Range
- Variance
- Standard deviation
Central Tendencies
- Mean
- The arithmetic mean
- Sum of values / Count of values
- Gives a quick idea on average of a variable
- Median
- Mean is not a good measure in presence of outliers
- For example Consider below data vector
- 1.5,1.7,1.9,0.8,0.8,1.2,1.9,1.4, 9 , 0.7 , 1.1
- 90% of the above values are less than 2, but the mean of above vector is 2
- There is an unusual value in the above data vector i.e 9
- It is also known as outlier.
- Mean is not the true middle value in presence of outliers. Mean is very much effected by the outliers.
- We use median, the true middle value in such cases
- Sort the data either in ascending or descending order
Numbers | Sorted Numbers |
---|---|
1.5 | 0.7 |
1.7 | 0.8 |
1.9 | 0.8 |
0.8 | 1.1 |
0.8 | 1.2 |
1.2 | 1.4 |
1.9 | 1.5 |
1.4 | 1.7 |
9 | 1.9 |
0.7 | 1.9 |
1.1 | 9 |
- Mean of the data is 2
- Median of the data is 1.4
- Even if we have the outlier as 90, we will have the same median
- Median is a positional measure, it doesn’t really depend on outliers
- When there are no outliers then mean and median will be nearly equal
- When mean is not equal to median it gives us an idea on presence of outliers in the data
Mean and Median on Python
In [5]:
gain_mean=Income_Data["capital-gain"].mean()
gain_mean
Out[5]:
In [6]:
gain_median=Income_Data["capital-gain"].median()
gain_median
Out[6]:
Mean is far away from median. Looks like there are outliers, we need to look at percentiles and box plot.
Practice : Mean and Median on Python
- Dataset: “./Online Retail Sales Data/Online Retail.csv”
- What is the mean of “UnitPrice”
- What is the median of “UnitPrice”
- Is mean equal to median? Do you suspect the presence of outliers in the data?
- What is the mean of “Quantity”
- What is the median of “Quantity”
- Is mean equal to median? Do you suspect the presence of outliers in the data?
In [7]:
Retail=pd.read_csv("datasets\\Online Retail Sales Data\\Online Retail.csv", encoding = "ISO-8859-1")
Retail.shape
Out[7]:
In [8]:
UnitPrice_mean=Retail["UnitPrice"].mean()
UnitPrice_mean
Out[8]:
In [9]:
UnitPrice_median=Retail["UnitPrice"].median()
UnitPrice_median
Out[9]:
In [10]:
UnitPrice_mean=Retail["Quantity"].mean()
UnitPrice_mean
Out[10]:
In [11]:
UnitPrice_median=Retail["Quantity"].median()
UnitPrice_median
Out[11]:
Yes, looks like we have outliers presents in this variable.
The next post is on dispersion measures in python.
Link to the next post : https://statinfer.com/104-3-3-dispersion-measures-in-python/