Link to the previous post : https://statinfer.com/204-1-9-issue-of-multicollinearity-in-python/
In this practice post we will build a Multiple Regression model and try to improve it by clearing the problem of multicollinearity in the model.
Practice : Multiple Regression
- Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
- Build a model to predict sales using rest of the variables
- Drop the less impacting variables based on p-values.
- Is there any multicollinearity?
- How many variables are there in the final model?
- What is the R-squared of the final model?
- Can you improve the model using same data and variables?
In [57]:
import pandas as pd
Webpage_Product_Sales=pd.read_csv("datasets\\Webpage_Product_Sales\\Webpage_Product_Sales.csv")
Webpage_Product_Sales.shape
Out[57]:
In [58]:
Webpage_Product_Sales.columns
Out[58]:
In [59]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()
Out[59]:
In [60]:
#VIF
vif_cal(Webpage_Product_Sales,"Sales")
In [61]:
##Dropped Clicks_From_Serach_Engine based on VIF
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()
Out[61]:
In [62]:
#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")
In [63]:
##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value
import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()
Out[63]:
In [65]:
#How many variables are there in the final model?
8
Out[65]:
In [69]:
#What is the R-squared of the final model?
fitted3.rsquared
Out[69]: