In previous section, we studied about Multiple Regression Issues
In the above example we found the correlation between the sum1_ math and sem2_maths.we found the correlation value to be 99% between these predictor variables. So now interdependency between these variables is as called multicollinearity. Multicollinearity is an issue because the coefficients that we are getting in the presence of Multicollinearity are not correct because this interdependency really inflates the variance of coefficients, this is a problem. Detection of the Multicollinearity is must and we have to reduce or remove Multicollinearity. In presence of multicollinearity the individual variable impact analysis will lead to wrong conclusions.So we have to first remove or minimize the effect of the Multicollinearity after that only we can trust the coefficient.
Multiple regression is wonderful. It allows you to consider the effect of multiple variables simultaneously. Multiple regression is extremely unpleasant because it allows you to consider the effect of multiple variables simultaneously.The relationships between the explanatory variables are the key to understanding multiple regression.
Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves. It might not be just the pairwise correlation. Sometimes many variables together might explain the whole variation in a predictor variable
vs X2 X3 X4 find R2
vs X1 X3 X4 find R2
vs X1 X2 X4 find R2
vs X1 X2 X3 find R2
Build a regression line x1 vs. X2, X3 and X4 and find the R Square value, say it R1. If the R1 value is higher, that is if the R1 value is around 98%, what does this mean ? This means X2, X3 and X4 together are explaining 98% of variation in X1. We can say that X1 is totally dependent on X2, X3 and X4. X1 is redundant in presence of X2, X3 & X4. This is a clear indication of multicollinearity. Now if in the second case the R1 value is not really high just around 20% then we can say that X1 is carrying some independent information which is not same as X2, X3, and X4 together. Similarly we can build intermediate models for X2 vs. X1, X3, X4; X3 vs. X1, X2, X4 and X4 vs. X1, X2, X3. With the values of R1, R2, R3 and R4, the respective R-Square values of each of these four models, we can easily detect multicollinearity. Instead of finding these R square we can use another technique called VIF . VIF VIF is derived from the intermediate R-Square value 1(1−R2)
is called VIF. VIF option in SAS automatically calculates VIF values for each of the predictor variables There is a function in R which calculates the VIF value automatically. VIF will be calculated individually for each variable. If a model has 10 independent variables, then we will have 10 VIF values. If a R squared is more than 80% for a particular variable then we can say that Multicollinearity exits. Similarly if we get VIF value more than 5 that mean that the variable can be explained by all the remaining other variables, then this variable can be dropped. Multicollinearity might come in pairs, what do I mean by that is, let us say, in presences of X1, X2 is redundant, in presence of X2, X1 is also redundant. So while calculating the VIF of X1 we will use X2, X3, X4 then VIF of X1 will be higher. The VIF of x2 will be higher as well because, we are using X1,X3,X4 if X1 and X2 are related this does not mean we should drop both x1 and x2 if we drop both then we might lose a whole information together , only one of them should be dropped so that multicollinearity will be lost.
Let’s see on the final exam data how to find the VIF values. For calculating the VIF values, we need a package called Car companion to the applied regression. First install the Car package, then build a model named exam.
## Sem1_Science Sem2_Science Sem1_Math Sem2_Math ## 7.399845 5.396257 68.788480 68.013104
Now we observe that VIF of all the variables are higher than the 5. But the VIF of Sem1_math and Sem2_Math are really high and we also know that their correlation is also very high. Even though sem1 and sem2 math’s VIF is high, we cannot drop both because, in the presence of Sem2_math , Sem1 is redundant and in presence of Sem1, Sem2 is redundant. Only way to choose the variable which we can drop is the one which is having higher value. So we choose to drop Sem1_Math. Once we drop Sem1_Math, the Sem2_Math gets auto corrected and there will be no Multicollinearity.
## Sem1_Science Sem2_Science Sem2_Math ## 7.219277 5.383840 4.813302
After dropping the Sem1_Math, we get the VIF value of Sem2_math as 4.81 which is less than 5. Observing the VIF value once again, we can see that still there is a Multicollinearity between Sem1_Science and Sem2_Scinece. We will go one step further and drop the Sem1_Science variable. By dropping the sem1_Science variable the VIF value is further reduce and the values are less than 5. Now, both the variable’s value is less than 5. So we don’t need to drop any other variables.
Similarly we can do an exercise of finding the Multicollinearity in air passenger data. Identify and eliminate the Multicollinearity in the Air passengers’ model. Dropping of variable should be done one by one we can’t just drop all the variables at once.
The next post is about Linear Regression with Multicollinearity in R and Conclusion.