Statinfer

204.1.9 Issue of Multicollinearity in Python

In previous post of this series we looked into the issues with Multiple Regression models. In this part we will understand what Multicollinearity is and how it’s bad for the model.
We will also go through the measures of Multicollinearity and how to deal with it.

Multicollinearity

  • Multiple regression is wonderful – It allows you to consider the effect of multiple variables simultaneously.
  • Multiple regression is extremely unpleasant -Because it allows you to consider the effect of multiple variables simultaneously.
  • The relationships between the explanatory variables are the key to understanding multiple regression.
  • Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
  • The parameter estimates will have inflated variance in presence of multicollineraity.
  • Sometimes the signs of the parameter estimates tend to change.
  • If the relation between the independent variables grows really strong then the variance of parameter estimates tends to be infinity – Can you prove it?

Multicollinearity Detection

Y=β0+β1X1+β2X2+β3X3+β4X4
  • Build a model X1 vs X2 X3 X4 find R^2, say R1
  • Build a model X2 vs X1 X3 X4 find R^2, say R2
  • Build a model X3 vs X1 X2 X4 find R^2, say R3
  • Build a model X4 vs X1 X2 X3 find R^2, say R4
  • For example if R3 is 95% then we don’t really need X3 in the model.
  • Since it can be explained as liner combination of other three.
  • For each variable we find individual R2.
  • 1/(1R^2) is called VIF.
  • VIF option in SAS automatically calculates VIF values for each of the predictor variables
R2 40% 50% 60% 70% 75% 80% 90%
VIF 1.67 2.00 2.50 3.33 4.00 5.00 10.00

Practice : Multicollinearity

  • Identify the Multicollinearity in the Final Exam Score model.
  • Drop the variable one by one to reduce the multicollinearity.
In [42]:
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
In [44]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
Out[44]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.990
Model: OLS Adj. R-squared: 0.987
Method: Least Squares F-statistic: 452.3
Date: Wed, 27 Jul 2016 Prob (F-statistic): 1.50e-18
Time: 11:52:41 Log-Likelihood: -38.099
No. Observations: 24 AIC: 86.20
Df Residuals: 19 BIC: 92.09
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -1.6226 1.999 -0.812 0.427 -5.806 2.561
Sem1_Science 0.1738 0.063 2.767 0.012 0.042 0.305
Sem2_Science 0.2785 0.052 5.379 0.000 0.170 0.387
Sem1_Math 0.7890 0.197 4.002 0.001 0.376 1.202
Sem2_Math -0.2063 0.191 -1.078 0.294 -0.607 0.194
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Cond. No. 1.20e+03
In [45]:
fitted1.summary2()
Out[45]:
Model: OLS Adj. R-squared: 0.987
Dependent Variable: Final_exam_marks AIC: 86.1980
Date: 2016-07-27 11:53 BIC: 92.0883
No. Observations: 24 Log-Likelihood: -38.099
Df Model: 4 F-statistic: 452.3
Df Residuals: 19 Prob (F-statistic): 1.50e-18
R-squared: 0.990 Scale: 1.7694
Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept -1.6226 1.9987 -0.8118 0.4269 -5.8060 2.5607
Sem1_Science 0.1738 0.0628 2.7668 0.0123 0.0423 0.3052
Sem2_Science 0.2785 0.0518 5.3795 0.0000 0.1702 0.3869
Sem1_Math 0.7890 0.1971 4.0023 0.0008 0.3764 1.2016
Sem2_Math -0.2063 0.1914 -1.0782 0.2944 -0.6069 0.1942
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Condition No.: 1200
In [48]:
There is no function for calculating VIF values. 
None of the pre-built libraries have this VIF calculation function
We may have to write our own function to calculate VIF values for each variable 

#Code for VIF Calculation

#Writing a function to calculate the VIF values

def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)
In [49]:
#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")
Sem1_Science  VIF =  7.4
Sem2_Science  VIF =  5.4
Sem1_Math  VIF =  68.79
Sem2_Math  VIF =  68.01
In [51]:
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
Out[51]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.981
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 341.4
Date: Wed, 27 Jul 2016 Prob (F-statistic): 2.44e-17
Time: 12:03:55 Log-Likelihood: -45.436
No. Observations: 24 AIC: 98.87
Df Residuals: 20 BIC: 103.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.3986 2.632 -0.911 0.373 -7.889 3.092
Sem1_Science 0.2130 0.082 2.595 0.017 0.042 0.384
Sem2_Science 0.2686 0.068 3.925 0.001 0.126 0.411
Sem2_Math 0.5320 0.067 7.897 0.000 0.391 0.673
Omnibus: 5.869 Durbin-Watson: 2.424
Prob(Omnibus): 0.053 Jarque-Bera (JB): 3.793
Skew: 0.864 Prob(JB): 0.150
Kurtosis: 3.898 Cond. No. 1.03e+03
In [52]:
vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")
Sem1_Science  VIF =  7.22
Sem2_Science  VIF =  5.38
Sem2_Math  VIF =  4.81
In [53]:
vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")
Sem2_Science  VIF =  3.4
Sem2_Math  VIF =  3.4

0 responses on "204.1.9 Issue of Multicollinearity in Python"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?

top