203.1.9 Adjusted R-squared in R

Adjusted R-Squared

In previous section, we studied about Multiple Regression in R

In the regression output summary, you might have already observed the adjusted r-squared field. It is good to have as many as many variables in the data, but it’s not recommended to keep on increasing the numbers of variables while building the model , in the hope that R-squared value will also be increasing simultaneously . In fact the R – squared value will never decrease if we keep on adding the variables it slightly increases or it stays same. For example, if we build a model using 10 variables with R-squared value is 80% if more 10 variable are added the value of R-squared will never go below 80% either it will stay at 80 or increase slightly. Suppose some the junk variables are added in those extra 10 variables then also R-squared value won’t go below 80%.

This is the small problem with R-squared, suppose we keep on adding the junk variables and at some point of time after adding too much junk variables we attain the R-squared value to be 100% which is wrong. So we need a better or slight adjustment to the R-squared value, the original R-squared formula has some issues this issue can be corrected by doing some adjustment. Adjusted R-squared is derived from the r-squared. This adjusted R-squared take cares that if any new variable is added and its impact is not significant the value of Adjusted R-square will not grow. If the new variable which is added is a junk variable then the value of Adjusted R-squared might decrease. So Adjusted R-squared imposes a penalty on adding a new predictor variable, Adjusted R-squared only increases only if the new predictor variables have some significant effect.Adjusted R-squared take care of variable impact as well , if the variable have no impact then, Adjusted R-square won’t increase; if we keep on adding too many variables which are not impactful then the value of Adjusted-r-squared might decrease.

R ¯ 2 = R 2 - k - 1 n - k (1 - R 2)

where n – number of observations and k – number of parameters

R-Squared vs Adjusted R-Squared

To understand the concept of adjusted R square, we will use an example. Build a model to predict y using x1, x2 and x3. Note down R-Square and Adj R-Square values. Build a model to predict y using x1, x2, x3, x4, x5 and x6. Note down R-Square and Adj R-Square values. Build a model to predict y using x1, x2, x3, x4, x5, x6, x7 and x8. Note down R-Square and Adj R-Square values. Load the dataset into the R by using the R commands. Then build the model 1 named as m1.

adj_sample = read.csv("R dataset\\Adjusted RSquare\\Adj_Sample.csv")


attach(adj_sample)
m1<-lm(Y~x1+x2)
m2<-lm(Y~x1+x2+x3+x4+x5+x6)
m3<-lm(Y~x1+x2+x3+x4+x5+x6+x7+x8)

summary(m1)
summary(m2)
summary(m3)

detach(adj_sample)

Output m1

## 
## Call:
## lm(formula = Y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0050 -0.2381  0.1893  0.4254  1.2321 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.015677   1.169583  -0.868    0.408
## x1           0.279345   0.282874   0.988    0.349
## x2           0.001882   0.001328   1.417    0.190
## 
## Residual standard error: 0.9043 on 9 degrees of freedom
## Multiple R-squared:  0.419,  Adjusted R-squared:  0.2899 
## F-statistic: 3.245 on 2 and 9 DF,  p-value: 0.08685

If you look at summary m1 model m1 has an R square of 68% and adjusted R square of 56%. So look at the model m1’s predictor variable’s p-value, there are 3 variables out of which one 1 is non-impactful that’s the variable x1 and remaining two are slightly impactful.

Output m2

## The following objects are masked from adj_sample (pos = 3):
## 
##     x1, x2, x3, x4, x5, x6, x7, x8, Y

## 
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4 + x5 + x6)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  0.25902  0.06800  0.45286  0.62004 -1.13449 -0.53961 -0.41898  0.52544 
##        9       10       11       12 
## -0.36028 -0.04814  0.83404 -0.25789 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -5.375099   4.686803  -1.147   0.3033  
## x1          -0.669681   0.536981  -1.247   0.2676  
## x2           0.002969   0.001518   1.956   0.1079  
## x3           0.506261   0.248695   2.036   0.0974 .
## x4           0.037611   0.083834   0.449   0.6725  
## x5           0.043624   0.168830   0.258   0.8064  
## x6           0.051554   0.087708   0.588   0.5822  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8468 on 5 degrees of freedom
## Multiple R-squared:  0.7169, Adjusted R-squared:  0.3773 
## F-statistic: 2.111 on 6 and 5 DF,  p-value: 0.2149

In the summary of m2 model the R-squared increased from 68% to 71% whereas Adjusted R-squared dropped from 56% to 37%, this is because in this model m2 there are many non-impactful variables and in the presence of too many non-impactful variable the effect of impact-full variables have gone down resulting in the decreased value of Adjusted -r-squared.

Output m3

## The following objects are masked from adj_sample (pos = 3):
## 
##     x1, x2, x3, x4, x5, x6, x7, x8, Y

## The following objects are masked from adj_sample (pos = 4):
## 
##     x1, x2, x3, x4, x5, x6, x7, x8, Y

## 
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9 
##  0.4989  0.4490 -0.1764  0.3267 -0.8213 -0.6679 -0.2299  0.2323 -0.2973 
##      10      11      12 
##  0.3333  0.6184 -0.2658 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.0439629 19.9031715   0.856    0.455
## x1          -0.0955943  0.7614799  -0.126    0.908
## x2           0.0007376  0.0025362   0.291    0.790
## x3           0.5157015  0.3062833   1.684    0.191
## x4           0.0578632  0.1033356   0.560    0.615
## x5           0.0858136  0.1914803   0.448    0.684
## x6          -0.1746565  0.2197152  -0.795    0.485
## x7          -0.0323678  0.1530067  -0.212    0.846
## x8          -0.2321183  0.2065655  -1.124    0.343
## 
## Residual standard error: 0.9071 on 3 degrees of freedom
## Multiple R-squared:  0.8051, Adjusted R-squared:  0.2855 
## F-statistic: 1.549 on 8 and 3 DF,  p-value: 0.3927

In the summary of m3 model the R-squared increased from 71% to 80% whereas Adjusted R-squared dropped from 37% to 28%, this happens only when you are trying to add more predicting variables which are not even related to the target variable.

R-Squared vs Adjusted R-Squared

If the values of R-squared and Adjusted-R-squared are nearby this means that there is no junk variables in the data. That means all the variables are impactful or the entire predicting variables that we are considering for building the model from, is impacting target variable in a significant way. If the difference is too high between the values of r squared and adjusted r squared then we can conclude it as there are some variables in the data which are not useful for this particular model.

The next post i a practice session on Multiple Regression Issues.

20th June 2017

203.1.9 Adjusted R-squared in R

R square for multiple variables in regression.

Adjusted R-Squared

R-Squared vs Adjusted R-Squared

Output m1

Output m2

Output m3

R-Squared vs Adjusted R-Squared