Statinfer

204.1.10 Practice : Multiple Regression with Multicollinearity

Dealing with multicollinearity.
Link to the previous post : https://statinfer.com/204-1-9-issue-of-multicollinearity-in-python/
In this practice post we will build a Multiple Regression model and try to improve it by clearing the problem of multicollinearity in the model.

Practice : Multiple Regression

  • Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
  • Build a model to predict sales using rest of the variables
  • Drop the less impacting variables based on p-values.
  • Is there any multicollinearity?
  • How many variables are there in the final model?
  • What is the R-squared of the final model?
  • Can you improve the model using same data and variables?
In [57]:
import pandas as pd 
Webpage_Product_Sales=pd.read_csv("datasets\\Webpage_Product_Sales\\Webpage_Product_Sales.csv")
Webpage_Product_Sales.shape
Out[57]:
(675, 12)
In [58]:
Webpage_Product_Sales.columns
Out[58]:
Index(['ID', 'DayofMonth', 'Weekday', 'Month', 'Social_Network_Ref_links',
       'Online_Ad_Paid_ref_links', 'Clicks_From_Serach_Engine',
       'Special_Discount', 'Holiday', 'Server_Down_time_Sec', 'Web_UI_Score',
       'Sales'],
      dtype='object')
In [59]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()
Out[59]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.815
Method: Least Squares F-statistic: 298.4
Date: Wed, 27 Jul 2016 Prob (F-statistic): 5.54e-238
Time: 12:45:36 Log-Likelihood: -6456.7
No. Observations: 675 AIC: 1.294e+04
Df Residuals: 664 BIC: 1.299e+04
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6545.8922 1286.240 5.089 0.000 4020.304 9071.481
Web_UI_Score -6.2582 11.545 -0.542 0.588 -28.928 16.412
Server_Down_time_Sec -134.0441 14.009 -9.569 0.000 -161.551 -106.537
Holiday 1.877e+04 683.077 27.477 0.000 1.74e+04 2.01e+04
Special_Discount 4718.3978 402.019 11.737 0.000 3929.016 5507.780
Clicks_From_Serach_Engine -0.1258 0.944 -0.133 0.894 -1.980 1.728
Online_Ad_Paid_ref_links 6.1557 1.002 6.142 0.000 4.188 8.124
Social_Network_Ref_links 6.6841 0.411 16.261 0.000 5.877 7.491
Month 481.0294 41.508 11.589 0.000 399.527 562.532
Weekday 1355.2153 67.224 20.160 0.000 1223.218 1487.213
DayofMonth 47.0579 15.198 3.096 0.002 17.216 76.900
Omnibus: 40.759 Durbin-Watson: 1.356
Prob(Omnibus): 0.000 Jarque-Bera (JB): 102.136
Skew: 0.297 Prob(JB): 6.63e-23
Kurtosis: 4.811 Cond. No. 2.57e+04
In [60]:
#VIF
vif_cal(Webpage_Product_Sales,"Sales")
ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.02
Online_Ad_Paid_ref_links  VIF =  12.13
Clicks_From_Serach_Engine  VIF =  12.08
Special_Discount  VIF =  1.37
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02
In [61]:
##Dropped Clicks_From_Serach_Engine based on VIF

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()
Out[61]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.815
Method: Least Squares F-statistic: 332.0
Date: Wed, 27 Jul 2016 Prob (F-statistic): 2.98e-239
Time: 12:48:18 Log-Likelihood: -6456.7
No. Observations: 675 AIC: 1.293e+04
Df Residuals: 665 BIC: 1.298e+04
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6598.7469 1222.658 5.397 0.000 4198.012 8999.482
Web_UI_Score -6.3332 11.523 -0.550 0.583 -28.959 16.293
Server_Down_time_Sec -133.9518 13.981 -9.581 0.000 -161.405 -106.499
Holiday 1.877e+04 681.292 27.557 0.000 1.74e+04 2.01e+04
Special_Discount 4713.9295 400.323 11.775 0.000 3927.881 5499.978
Online_Ad_Paid_ref_links 6.0279 0.291 20.740 0.000 5.457 6.599
Social_Network_Ref_links 6.6872 0.410 16.307 0.000 5.882 7.492
Month 480.6876 41.398 11.611 0.000 399.401 561.974
Weekday 1355.2536 67.174 20.175 0.000 1223.355 1487.152
DayofMonth 47.0168 15.184 3.097 0.002 17.203 76.831
Omnibus: 40.826 Durbin-Watson: 1.356
Prob(Omnibus): 0.000 Jarque-Bera (JB): 102.313
Skew: 0.298 Prob(JB): 6.07e-23
Kurtosis: 4.812 Cond. No. 1.94e+04
In [62]:
#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")
ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.01
Online_Ad_Paid_ref_links  VIF =  1.02
Special_Discount  VIF =  1.36
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02
In [63]:
##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value

import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()
Out[63]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.816
Method: Least Squares F-statistic: 373.9
Date: Wed, 27 Jul 2016 Prob (F-statistic): 1.74e-240
Time: 12:49:15 Log-Likelihood: -6456.9
No. Observations: 675 AIC: 1.293e+04
Df Residuals: 666 BIC: 1.297e+04
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6101.1539 821.286 7.429 0.000 4488.532 7713.776
Server_Down_time_Sec -134.0717 13.972 -9.596 0.000 -161.507 -106.637
Holiday 1.874e+04 678.528 27.623 0.000 1.74e+04 2.01e+04
Special_Discount 4726.1858 399.491 11.831 0.000 3941.771 5510.600
Online_Ad_Paid_ref_links 6.0357 0.290 20.802 0.000 5.466 6.605
Social_Network_Ref_links 6.6738 0.409 16.312 0.000 5.870 7.477
Month 479.5231 41.322 11.605 0.000 398.386 560.660
Weekday 1354.4252 67.122 20.179 0.000 1222.629 1486.221
DayofMonth 46.9564 15.175 3.094 0.002 17.159 76.754
Omnibus: 41.049 Durbin-Watson: 1.352
Prob(Omnibus): 0.000 Jarque-Bera (JB): 103.243
Skew: 0.298 Prob(JB): 3.81e-23
Kurtosis: 4.821 Cond. No. 1.31e+04
In [65]:
#How many variables are there in the final model?
8
Out[65]:
8
In [69]:
#What is the R-squared of the final model?
fitted3.rsquared
Out[69]:
0.8178742020411971


The next post is interaction terms.
Link to the next post : https://statinfer.com/204-1-11-interaction-terms/

0 responses on "204.1.10 Practice : Multiple Regression with Multicollinearity"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?

top