• No products in the cart.

# 204.1.9 Issue of Multicollinearity in Python

In previous post of this series we looked into the issues with Multiple Regression models. In this part we will understand what Multicollinearity is and how it’s bad for the model.
We will also go through the measures of Multicollinearity and how to deal with it.

## Multicollinearity

• Multiple regression is wonderful – It allows you to consider the effect of multiple variables simultaneously.
• Multiple regression is extremely unpleasant -Because it allows you to consider the effect of multiple variables simultaneously.
• The relationships between the explanatory variables are the key to understanding multiple regression.
• Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
• The parameter estimates will have inflated variance in presence of multicollineraity.
• Sometimes the signs of the parameter estimates tend to change.
• If the relation between the independent variables grows really strong then the variance of parameter estimates tends to be infinity – Can you prove it?

### Multicollinearity Detection

Y=β0+β1X1+β2X2+β3X3+β4X4
• Build a model X1 vs X2 X3 X4 find R^2, say R1
• Build a model X2 vs X1 X3 X4 find R^2, say R2
• Build a model X3 vs X1 X2 X4 find R^2, say R3
• Build a model X4 vs X1 X2 X3 find R^2, say R4
• For example if R3 is 95% then we don’t really need X3 in the model.
• Since it can be explained as liner combination of other three.
• For each variable we find individual R2.
• 1/(1R^2) is called VIF.
• VIF option in SAS automatically calculates VIF values for each of the predictor variables
R2 40% 50% 60% 70% 75% 80% 90%
VIF 1.67 2.00 2.50 3.33 4.00 5.00 10.00

### Practice : Multicollinearity

• Identify the Multicollinearity in the Final Exam Score model.
• Drop the variable one by one to reduce the multicollinearity.
In :
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])

In :
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()

Out:
Dep. Variable: R-squared: Final_exam_marks 0.990 OLS 0.987 Least Squares 452.3 Wed, 27 Jul 2016 1.50e-18 11:52:41 -38.099 24 86.20 19 92.09 4 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] -1.6226 1.999 -0.812 0.427 -5.806 2.561 0.1738 0.063 2.767 0.012 0.042 0.305 0.2785 0.052 5.379 0.000 0.170 0.387 0.7890 0.197 4.002 0.001 0.376 1.202 -0.2063 0.191 -1.078 0.294 -0.607 0.194
 Omnibus: Durbin-Watson: 6.343 1.863 0.042 4.332 0.973 0.115 3.737 1200
In :
fitted1.summary2()

Out:
 Model: OLS Adj. R-squared: 0.987 Dependent Variable: Final_exam_marks AIC: 86.198 Date: 2016-07-27 11:53 BIC: 92.0883 No. Observations: 24 Log-Likelihood: -38.099 Df Model: 4 F-statistic: 452.3 Df Residuals: 19 Prob (F-statistic): 1.5e-18 R-squared: 0.990 Scale: 1.7694
Coef. Std.Err. t P>|t| [0.025 0.975] -1.6226 1.9987 -0.8118 0.4269 -5.8060 2.5607 0.1738 0.0628 2.7668 0.0123 0.0423 0.3052 0.2785 0.0518 5.3795 0.0000 0.1702 0.3869 0.7890 0.1971 4.0023 0.0008 0.3764 1.2016 -0.2063 0.1914 -1.0782 0.2944 -0.6069 0.1942
 Omnibus: 6.343 Durbin-Watson: 1.863 Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332 Skew: 0.973 Prob(JB): 0.115 Kurtosis: 3.737 Condition No.: 1200
In :
There is no function for calculating VIF values.
None of the pre-built libraries have this VIF calculation function
We may have to write our own function to calculate VIF values for each variable

#Code for VIF Calculation

#Writing a function to calculate the VIF values

def vif_cal(input_data, dependent_col):
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared
vif=round(1/(1-rsq),2)
print (xvar_names[i], " VIF = " , vif)

In :
#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.4
Sem2_Science  VIF =  5.4
Sem1_Math  VIF =  68.79
Sem2_Math  VIF =  68.01

In :
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()

Out:
Dep. Variable: R-squared: Final_exam_marks 0.981 OLS 0.978 Least Squares 341.4 Wed, 27 Jul 2016 2.44e-17 12:03:55 -45.436 24 98.87 20 103.6 3 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] -2.3986 2.632 -0.911 0.373 -7.889 3.092 0.2130 0.082 2.595 0.017 0.042 0.384 0.2686 0.068 3.925 0.001 0.126 0.411 0.5320 0.067 7.897 0.000 0.391 0.673
 Omnibus: Durbin-Watson: 5.869 2.424 0.053 3.793 0.864 0.15 3.898 1030
In :
vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.22
Sem2_Science  VIF =  5.38
Sem2_Math  VIF =  4.81

In :
vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")

Sem2_Science  VIF =  3.4
Sem2_Math  VIF =  3.4

24th January 2018

### 0 responses on "204.1.9 Issue of Multicollinearity in Python"

Statinfer Software Solutions LLP

Software Technology Parks of India,
NH16, Krishna Nagar, Benz Circle,