In previous section, we studied about R-Squared in R
Up till now the regression model we have studied is called as a simple regression model.It is called as a simple regression model because there is only one predictor variable. In the previous example to predict the number of passengers we were using only one variable that is a promotional budget equation;
X where Y=number of passengers and X= promotional budget.
In real life this is not the scenario just one variable won’t impact the overall target; there are multiple variables together, which impact the overall target variable, that’s why we need to build multiple regression models. The theory remains same just a small difference is that instead of using one variable X we would be dealing with multiple variables X_1,X_2.. And so on. Earlier we were building one regression line in a 2 dimensional plane. But in multiple regression models this would be inclined of a plane in 3 dimension system. R squared interpretation will remain same as it was in simple regression
Multiple Regression in R
Build a multiple regression model to predict the number of passengers. So here we will try to predict the number of passengers by using multiple variables such as Promotional budget, inter metro flight ratio, and Service quality score. We are still using the same function lm , for predicting the number of passengers but the only change is that instead of one variables we are using 3 variables to predict from the dataset air
## ## Call: ## lm(formula = air$Passengers ~ air$Promotion_Budget + air$Inter_metro_flight_ratio + ## air$Service_Quality_Score) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4792.4 -1980.1 15.3 2317.9 4717.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.921e+04 3.543e+03 5.424 6.68e-07 *** ## air$Promotion_Budget 5.550e-02 3.586e-03 15.476 < 2e-16 *** ## air$Inter_metro_flight_ratio -2.003e+03 2.129e+03 -0.941 0.35 ## air$Service_Quality_Score -2.802e+03 5.304e+02 -5.283 1.17e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2533 on 76 degrees of freedom ## Multiple R-squared: 0.9514, Adjusted R-squared: 0.9494 ## F-statistic: 495.6 on 3 and 76 DF, p-value: < 2.2e-16
The R squared value is around 95% so there is a slight increment in the R squared value by adding these two variables ,instead of building model with a single independent variable we are using multiple predictor variables for predictions that is called a multiple regression model.
Individual Impact of variables
There no standard rule that if we keep on adding the predictor variables the output mode will be better, this is because in the given dataset sometime we don’t know what variables are impacting the target, and which variables are not impacting the target. In a given regression model suppose we build the output model Y by using 10 predictor variables how will we know that what are the most impacting variables out of those 10 or which variables we can simply drop as it’s effect would be minimum or zero on the output model Y. If a variable is not impacting the model, we can simply drop it, there is no real value in keeping it in the model How do we see the individual impact of the variable? The individual impact of the variables is determined by p value of the t-test.
P-value If the p-value is less than 5%, then variable has a significant impact, this variable should be kept in the variable.If the p-value is more than 5%, then this variable has no impact, we can simply drop this variable as there will be no change in the R squared value. For examples if we are considering the number of laptop sale then there is no point in considering the number of street dogs in the area. By considering the number of street dogs we will not get any help in finding the number of laptop sale so we can simply drop the variable number of street dogs. The p-value can be summarize as , If we drop a low impact variable, there will be no impact on the model, r-square will not change. If we drop a high impact variable, there will be significant impact on the model, r-square will drop significantly
In the summary , there will be a P-value or the probability value. Let’s check the P-value of the variables used in the multi-model in the dataset of Air Passengers. By running the summary of the multi-model which we have already created, we can see the P-value of all the variables those are used. Promotional Budget P-value is less than 5% so this variable is highly impactful. Inter metro flight ratio’s P-value is greater than 5% so we can conclude that this variable is having no impact, keeping or dropping this variable won’t make any changes to output. Service quality score variable is having P-value less than 5% this variable is impactful. By dropping the Inter metro flight ratio variable there would be no impact on the output model. The model can be re-build without using Inter metro flight ratio variable, let’s rename the multi_model as multi_model1 which will have only two variables and they are promotional budget and service quality score. Then by running the summary of multi_model1 value of R squared is almost near the previous model which was having inter metro flight ratio variable, so this conclude that removing the variables whose P-value is greater than 5 won’t affect the model much. By this we can measure the impact of the individual variables and decide based upon their P-value whether these variable should be used or not in building the model. As we already know that promotional budget is highly impactful variable what if we drop the promotional budget variable, will this affect the R squared value? Let’s find out what will happen if we drop a highly impactful variable. This time we will we will drop the variable promotional budget. The r-squared now dropped to 79% from 95%, which proves that the variable is highly impacting. Dropping the variable really impacts model’s overall prediction power. So, if we have 20 predictor variables you want to know which variable is impacting and which variable is not impacting , this can be decided by looking at the P-value.
The next post is about Adjusted R-squared in R.