Practice: Multiple Regression
In previous section, we studied about Issue of Multicollinearity in R
- Import Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
- Build a model to predict sales using rest of the variables
- Drop the less impacting variables based on p-values.
- Is there any multicollinearity?
- How many variables are there in the final model?
- What is the R-squared of the final model?
- Can you improve the model using same data and variables?
Solution
- Import Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
Webpage_Product_Sales = read.csv("R dataset\\Webpage_Product_Sales\\Webpage_Product_Sales.csv")
- Build a model to predict sales using rest of the variables
web_sales_model1<-lm(Sales~Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth, data=Webpage_Product_Sales)
summary(web_sales_model1)
##
## Call:
## lm(formula = Sales ~ Web_UI_Score + Server_Down_time_Sec + Holiday +
## Special_Discount + Clicks_From_Serach_Engine + Online_Ad_Paid_ref_links +
## Social_Network_Ref_links + Month + Weekday + DayofMonth,
## data = Webpage_Product_Sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14391.9 -2186.2 -191.6 2243.1 15462.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6545.8922 1286.2404 5.089 4.69e-07 ***
## Web_UI_Score -6.2582 11.5453 -0.542 0.58796
## Server_Down_time_Sec -134.0441 14.0087 -9.569 < 2e-16 ***
## Holiday 18768.5954 683.0769 27.477 < 2e-16 ***
## Special_Discount 4718.3978 402.0193 11.737 < 2e-16 ***
## Clicks_From_Serach_Engine -0.1258 0.9443 -0.133 0.89403
## Online_Ad_Paid_ref_links 6.1557 1.0022 6.142 1.40e-09 ***
## Social_Network_Ref_links 6.6841 0.4111 16.261 < 2e-16 ***
## Month 481.0294 41.5079 11.589 < 2e-16 ***
## Weekday 1355.2153 67.2243 20.160 < 2e-16 ***
## DayofMonth 47.0579 15.1982 3.096 0.00204 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3480 on 664 degrees of freedom
## Multiple R-squared: 0.818, Adjusted R-squared: 0.8152
## F-statistic: 298.4 on 10 and 664 DF, p-value: < 2.2e-16
- Drop the less impacting variables based on p-values.
From the p-value of the output we can see that Clicks_From_Serach_Engine and Web_UI_Score are insignificant hence dropping these two variables
web_sales_model2<-lm(Sales~Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth,data=Webpage_Product_Sales)
summary(web_sales_model2)
##
## Call:
## lm(formula = Sales ~ Server_Down_time_Sec + Holiday + Special_Discount +
## Online_Ad_Paid_ref_links + Social_Network_Ref_links + Month +
## Weekday + DayofMonth, data = Webpage_Product_Sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14305.2 -2154.8 -185.7 2252.3 15383.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6101.1539 821.2864 7.429 3.37e-13 ***
## Server_Down_time_Sec -134.0717 13.9722 -9.596 < 2e-16 ***
## Holiday 18742.7123 678.5281 27.623 < 2e-16 ***
## Special_Discount 4726.1858 399.4915 11.831 < 2e-16 ***
## Online_Ad_Paid_ref_links 6.0357 0.2901 20.802 < 2e-16 ***
## Social_Network_Ref_links 6.6738 0.4091 16.312 < 2e-16 ***
## Month 479.5231 41.3221 11.605 < 2e-16 ***
## Weekday 1354.4252 67.1219 20.179 < 2e-16 ***
## DayofMonth 46.9564 15.1755 3.094 0.00206 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3476 on 666 degrees of freedom
## Multiple R-squared: 0.8179, Adjusted R-squared: 0.8157
## F-statistic: 373.9 on 8 and 666 DF, p-value: < 2.2e-16
- Is there any multicollinearity?
library(car)
vif(web_sales_model2)
## Server_Down_time_Sec Holiday Special_Discount
## 1.018345 1.366781 1.353936
## Online_Ad_Paid_ref_links Social_Network_Ref_links Month
## 1.018222 1.004572 1.011388
## Weekday DayofMonth
## 1.004399 1.003881
No. From the above results it can be seen that there is no multicollinearity.
- How many variables are there in the final model?
Eight
- What is the R-squared of the final model?
0.8179
- Can you improve the model using same data and variables?
No
Interaction Terms
Adding interaction terms might help in improving the prediction accuracy of the model.The addition of interaction terms needs prior knowledge of the dataset and variables.
web_sales_model3<-lm(Sales~Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday,data=Webpage_Product_Sales)
summary(web_sales_model3)
##
## Call:
## lm(formula = Sales ~ Server_Down_time_Sec + Holiday + Special_Discount +
## Online_Ad_Paid_ref_links + Social_Network_Ref_links + Month +
## Weekday + DayofMonth + Holiday * Weekday, data = Webpage_Product_Sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7486.3 -2073.0 -270.4 2104.2 9146.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6753.6923 708.7910 9.528 < 2e-16 ***
## Server_Down_time_Sec -140.4922 12.0438 -11.665 < 2e-16 ***
## Holiday 2201.8694 1232.3364 1.787 0.074434 .
## Special_Discount 4749.0044 344.1454 13.799 < 2e-16 ***
## Online_Ad_Paid_ref_links 5.9515 0.2500 23.805 < 2e-16 ***
## Social_Network_Ref_links 7.0657 0.3534 19.994 < 2e-16 ***
## Month 480.3156 35.5970 13.493 < 2e-16 ***
## Weekday 1164.8864 59.1435 19.696 < 2e-16 ***
## DayofMonth 47.0967 13.0729 3.603 0.000339 ***
## Holiday:Weekday 4294.6865 281.6829 15.247 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2994 on 665 degrees of freedom
## Multiple R-squared: 0.865, Adjusted R-squared: 0.8632
## F-statistic: 473.6 on 9 and 665 DF, p-value: < 2.2e-16
Conclusion
In this chapter, we have discussed what is simple regression , what is multiple regression ,how to build simple linear regression ,multiple linear regression what are the most important metric that one should consider in output of a regression line, what is Multicollinearity how to detect, how to eliminate Multicollinearity, what is R square what is adjusted R square , difference between R square and Adjusted R-squared, how do we see the individual impact of the variables and etc.
This is a basic regression class once you get a good idea on regression you can explore more in regression by going through the advance topics like adding the polynomial and interaction terms to your regression line , sometimes they work will charm. Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.
About cross-validation we will talk in future lectures in more detail. Outliers can influence the regression line; we need to take care of data sanitization before building the regression line, because at the end of the day these are all mathematical formula if wrong adjustment is done,then the wrong result we will get , so data cleaning is very important before getting into regression.
In next section, we will be studying about Logistic Regression, why do we need it?