• No products in the cart.

203.1.12 Linear Regression with Multicollinearity in R and Conclusion

Finishing what we started

Practice: Multiple Regression

In previous section, we studied about Issue of Multicollinearity in R

  1. Import Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
  2. Build a model to predict sales using rest of the variables
  3. Drop the less impacting variables based on p-values.
  4. Is there any multicollinearity?
  5. How many variables are there in the final model?
  6. What is the R-squared of the final model?
  7. Can you improve the model using same data and variables?

Solution

  1. Import Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
Webpage_Product_Sales = read.csv("R dataset\\Webpage_Product_Sales\\Webpage_Product_Sales.csv")
  1. Build a model to predict sales using rest of the variables
web_sales_model1<-lm(Sales~Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth, data=Webpage_Product_Sales)
summary(web_sales_model1)
## 
## Call:
## lm(formula = Sales ~ Web_UI_Score + Server_Down_time_Sec + Holiday + 
##     Special_Discount + Clicks_From_Serach_Engine + Online_Ad_Paid_ref_links + 
##     Social_Network_Ref_links + Month + Weekday + DayofMonth, 
##     data = Webpage_Product_Sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14391.9  -2186.2   -191.6   2243.1  15462.1 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                6545.8922  1286.2404   5.089 4.69e-07 ***
## Web_UI_Score                 -6.2582    11.5453  -0.542  0.58796    
## Server_Down_time_Sec       -134.0441    14.0087  -9.569  < 2e-16 ***
## Holiday                   18768.5954   683.0769  27.477  < 2e-16 ***
## Special_Discount           4718.3978   402.0193  11.737  < 2e-16 ***
## Clicks_From_Serach_Engine    -0.1258     0.9443  -0.133  0.89403    
## Online_Ad_Paid_ref_links      6.1557     1.0022   6.142 1.40e-09 ***
## Social_Network_Ref_links      6.6841     0.4111  16.261  < 2e-16 ***
## Month                       481.0294    41.5079  11.589  < 2e-16 ***
## Weekday                    1355.2153    67.2243  20.160  < 2e-16 ***
## DayofMonth                   47.0579    15.1982   3.096  0.00204 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3480 on 664 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.8152 
## F-statistic: 298.4 on 10 and 664 DF,  p-value: < 2.2e-16
  1. Drop the less impacting variables based on p-values.

From the p-value of the output we can see that Clicks_From_Serach_Engine and Web_UI_Score are insignificant hence dropping these two variables

web_sales_model2<-lm(Sales~Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth,data=Webpage_Product_Sales)
summary(web_sales_model2)
## 
## Call:
## lm(formula = Sales ~ Server_Down_time_Sec + Holiday + Special_Discount + 
##     Online_Ad_Paid_ref_links + Social_Network_Ref_links + Month + 
##     Weekday + DayofMonth, data = Webpage_Product_Sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14305.2  -2154.8   -185.7   2252.3  15383.2 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6101.1539   821.2864   7.429 3.37e-13 ***
## Server_Down_time_Sec      -134.0717    13.9722  -9.596  < 2e-16 ***
## Holiday                  18742.7123   678.5281  27.623  < 2e-16 ***
## Special_Discount          4726.1858   399.4915  11.831  < 2e-16 ***
## Online_Ad_Paid_ref_links     6.0357     0.2901  20.802  < 2e-16 ***
## Social_Network_Ref_links     6.6738     0.4091  16.312  < 2e-16 ***
## Month                      479.5231    41.3221  11.605  < 2e-16 ***
## Weekday                   1354.4252    67.1219  20.179  < 2e-16 ***
## DayofMonth                  46.9564    15.1755   3.094  0.00206 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3476 on 666 degrees of freedom
## Multiple R-squared:  0.8179, Adjusted R-squared:  0.8157 
## F-statistic: 373.9 on 8 and 666 DF,  p-value: < 2.2e-16
  1. Is there any multicollinearity?
library(car)
vif(web_sales_model2)
##     Server_Down_time_Sec                  Holiday         Special_Discount 
##                 1.018345                 1.366781                 1.353936 
## Online_Ad_Paid_ref_links Social_Network_Ref_links                    Month 
##                 1.018222                 1.004572                 1.011388 
##                  Weekday               DayofMonth 
##                 1.004399                 1.003881

No. From the above results it can be seen that there is no multicollinearity.

  1. How many variables are there in the final model?

Eight

  1. What is the R-squared of the final model?

0.8179

  1. Can you improve the model using same data and variables?

No

Interaction Terms

Adding interaction terms might help in improving the prediction accuracy of the model.The addition of interaction terms needs prior knowledge of the dataset and variables.

web_sales_model3<-lm(Sales~Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday,data=Webpage_Product_Sales)

summary(web_sales_model3)
## 
## Call:
## lm(formula = Sales ~ Server_Down_time_Sec + Holiday + Special_Discount + 
##     Online_Ad_Paid_ref_links + Social_Network_Ref_links + Month + 
##     Weekday + DayofMonth + Holiday * Weekday, data = Webpage_Product_Sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7486.3 -2073.0  -270.4  2104.2  9146.2 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6753.6923   708.7910   9.528  < 2e-16 ***
## Server_Down_time_Sec     -140.4922    12.0438 -11.665  < 2e-16 ***
## Holiday                  2201.8694  1232.3364   1.787 0.074434 .  
## Special_Discount         4749.0044   344.1454  13.799  < 2e-16 ***
## Online_Ad_Paid_ref_links    5.9515     0.2500  23.805  < 2e-16 ***
## Social_Network_Ref_links    7.0657     0.3534  19.994  < 2e-16 ***
## Month                     480.3156    35.5970  13.493  < 2e-16 ***
## Weekday                  1164.8864    59.1435  19.696  < 2e-16 ***
## DayofMonth                 47.0967    13.0729   3.603 0.000339 ***
## Holiday:Weekday          4294.6865   281.6829  15.247  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2994 on 665 degrees of freedom
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8632 
## F-statistic: 473.6 on 9 and 665 DF,  p-value: < 2.2e-16

Conclusion

In this chapter, we have discussed what is simple regression , what is multiple regression ,how to build simple linear regression ,multiple linear regression what are the most important metric that one should consider in output of a regression line, what is Multicollinearity how to detect, how to eliminate Multicollinearity, what is R square what is adjusted R square , difference between R square and Adjusted R-squared, how do we see the individual impact of the variables and etc.

This is a basic regression class once you get a good idea on regression you can explore more in regression by going through the advance topics like adding the polynomial and interaction terms to your regression line , sometimes they work will charm. Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.

About cross-validation we will talk in future lectures in more detail. Outliers can influence the regression line; we need to take care of data sanitization before building the regression line, because at the end of the day these are all mathematical formula if wrong adjustment is done,then the wrong result we will get , so data cleaning is very important before getting into regression.

In next section, we will be studying about  Logistic Regression, why do we need it?

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.