Link to the previous post : https://statinfer.com/104-3-5-box-plots-and-outlier-dectection-using-python/
In the last post we made box plot and understood how it is useful in detecting outliers.
In this post we will cover 3 most common plots for preliminary analysis.
- Scatter Plot
- Bar Chart
- Trend Chart
Scatter Plot:
Scatter plots give us an indication on the relation between the two chosen variables.
Example:
cars=pd.read_csv("datasets\\Cars Data\\Cars.csv",encoding = "ISO-8859-1")
cars.shape
(428, 15)
cars.columns.values
array(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice', 'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway', 'Weight', 'Wheelbase', 'Length'], dtype=object)
cars['Horsepower'].describe()
count 428.000000 mean 215.885514 std 71.836032 min 73.000000 25% 165.000000 50% 210.000000 75% 255.000000 max 500.000000 Name: Horsepower, dtype: float64
cars['MPG_City'].describe()
count 428.000000 mean 20.060748 std 5.238218 min 10.000000 25% 17.000000 50% 19.000000 75% 21.250000 max 60.000000 Name: MPG_City, dtype: float64
import matplotlib.pyplot as plt
plt.scatter(cars.Horsepower,cars.MPG_City)
<matplotlib.collections.PathCollection at 0xd272e10>
Practice : Creating Scatter Plots
- Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
- Draw a scatter plot between Average_Income and Sales. Is there any relation between two variables?
- Draw a scatter plot between Under35_Population_pect and Sales. Is there any relation between two?
sports_data=pd.read_csv("datasets\\Sporting_goods_sales\\Sporting_goods_sales.csv",encoding = "ISO-8859-1")
sports_data.head(5)
Sr_no | Avg_family_size | Average_Income | M_F_Gender_Ratio | Un_emp_rate | Under35_Population_pect | Number_schools | Sales | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 9305.306044 | 46.654268 | 2.587691 | 51.426218 | 395.379432 | 140870.7288 |
1 | 2 | 2 | 8907.622334 | 64.505029 | 2.731910 | 28.485052 | 316.503520 | 100305.7146 |
2 | 3 | 2 | 9846.602630 | 63.595331 | 4.269577 | 49.452727 | 359.077144 | 135474.6688 |
3 | 4 | 2 | 8871.731173 | 50.451251 | 3.124004 | 44.678507 | 346.833014 | 126349.5082 |
4 | 5 | 4 | 9891.047985 | 51.353801 | 2.004201 | 37.664024 | 329.034161 | 117434.7267 |
# Draw a scatter plot between Average_Income and Sales. Is there any relation between two variables?
import matplotlib.pyplot as plt
plt.scatter(sports_data.Average_Income ,sports_data.Sales)
<matplotlib.collections.PathCollection at 0xd2dbba8>
import matplotlib.pyplot as plt
plt.scatter(sports_data.Under35_Population_pect ,sports_data.Sales)
<matplotlib.collections.PathCollection at 0xd4f9eb8>
Bar Chart
Bar charts used to summarize the categorical variables
freq=cars.Cylinders.value_counts()
freq.values
array([190, 136, 87, 7, 3, 2, 1], dtype=int64)
freq.index
Float64Index([6.0, 4.0, 8.0, 5.0, 12.0, 10.0, 3.0], dtype='float64')
import matplotlib.pyplot as plt
plt.bar(freq.index,freq.values)
<Container object of 4 artists>
Practice : Bar Chart
- Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
- Create a bar chart summarizing the information on family size.
freq=sports_data.Avg_family_size.value_counts()
freq.values
array([61, 57, 18, 14], dtype=int64)
freq.index
Int64Index([3, 2, 4, 1], dtype='int64')
import matplotlib.pyplot as plt
plt.bar(freq.index,freq.values)
<Container object of 4 artists>
Trend chart
- Trend chart is used for time series datasets
AirPassengers=pd.read_csv("datasets\\Air Travel Data\\Air_travel.csv", encoding = "ISO-8859-1")
AirPassengers.head()
DATE | AIR | |
---|---|---|
0 | JAN49 | 112 |
1 | FEB49 | 118 |
2 | MAR49 | 132 |
3 | APR49 | 129 |
4 | MAY49 | 121 |
AirPassengers.columns.values
array(['DATE', 'AIR'], dtype=object)
import matplotlib.pyplot as plt
plt.plot(AirPassengers.AIR)
[<matplotlib.lines.Line2D at 0xd55ff98>]