Link to the previous post : https://statinfer.com/104-3-1-data-sampling-in-python/

**Descriptive statistics**

- The basic descriptive statistics to give us an idea on the variables and their distributions.
- Permit the analyst to describe many pieces of data with a few indices.
- Can also help us find the underlying outliers in the dataset which is important before cleaning the data.
- Central tendencies:
- Mean
- Median

- Dispersion:
- Range
- Variance
- Standard deviation

### Central Tendencies

- Mean
- The arithmetic mean
- Sum of values / Count of values
- Gives a quick idea on average of a variable

- Median
- Mean is not a good measure in presence of outliers
- For example Consider below data vector
- 1.5,1.7,1.9,0.8,0.8,1.2,1.9,1.4, 9 , 0.7 , 1.1

- 90% of the above values are less than 2, but the mean of above vector is 2
- There is an unusual value in the above data vector i.e 9
- It is also known as outlier.
- Mean is not the true middle value in presence of outliers. Mean is very much effected by the outliers.
- We use median, the true middle value in such cases
- Sort the data either in ascending or descending order

Numbers | Sorted Numbers |
---|---|

1.5 | 0.7 |

1.7 | 0.8 |

1.9 | 0.8 |

0.8 | 1.1 |

0.8 | 1.2 |

1.2 | 1.4 |

1.9 | 1.5 |

1.4 | 1.7 |

9 | 1.9 |

0.7 | 1.9 |

1.1 | 9 |

- Mean of the data is 2
- Median of the data is 1.4
- Even if we have the outlier as 90, we will have the same median
- Median is a positional measure, it doesn’t really depend on outliers
- When there are no outliers then mean and median will be nearly equal
- When mean is not equal to median it gives us an idea on presence of outliers in the data

### Mean and Median on Python

In [5]:

```
gain_mean=Income_Data["capital-gain"].mean()
gain_mean
```

Out[5]:

In [6]:

```
gain_median=Income_Data["capital-gain"].median()
gain_median
```

Out[6]:

### Practice : Mean and Median on Python

- Dataset: “./Online Retail Sales Data/Online Retail.csv”
- What is the mean of “UnitPrice”
- What is the median of “UnitPrice”
- Is mean equal to median? Do you suspect the presence of outliers in the data?
- What is the mean of “Quantity”
- What is the median of “Quantity”
- Is mean equal to median? Do you suspect the presence of outliers in the data?

In [7]:

```
Retail=pd.read_csv("datasets\\Online Retail Sales Data\\Online Retail.csv", encoding = "ISO-8859-1")
Retail.shape
```

Out[7]:

In [8]:

```
UnitPrice_mean=Retail["UnitPrice"].mean()
UnitPrice_mean
```

Out[8]:

In [9]:

```
UnitPrice_median=Retail["UnitPrice"].median()
UnitPrice_median
```

Out[9]:

In [10]:

```
UnitPrice_mean=Retail["Quantity"].mean()
UnitPrice_mean
```

Out[10]:

In [11]:

```
UnitPrice_median=Retail["Quantity"].median()
UnitPrice_median
```

Out[11]:

Yes, looks like we have outliers presents in this variable.

The next post is on dispersion measures in python.

Link to the next post : https://statinfer.com/104-3-3-dispersion-measures-in-python/