In previous post of this series we looked into the issues with Multiple Regression models. In this part we will understand what Multicollinearity is and how it’s bad for the model.

We will also go through the measures of Multicollinearity and how to deal with it.

Multicollinearity

Multiple regression is wonderful – It allows you to consider the effect of multiple variables simultaneously.
Multiple regression is extremely unpleasant -Because it allows you to consider the effect of multiple variables simultaneously.
The relationships between the explanatory variables are the key to understanding multiple regression.
Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
The parameter estimates will have inflated variance in presence of multicollineraity.
Sometimes the signs of the parameter estimates tend to change.
If the relation between the independent variables grows really strong then the variance of parameter estimates tends to be infinity – Can you prove it?

Multicollinearity Detection

Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4

Build a model $X_{1}$ vs $X_{2}$ $X_{3}$ $X_{4}$ find $R^{2}$ , say R1

Build a model $X_{2}$ vs $X_{1}$ $X_{3}$ $X_{4}$ find $R^{2}$ , say R2

Build a model $X_{3}$ vs $X_{1}$ $X_{2}$ $X_{4}$ find $R^{2}$ , say R3

Build a model $X_{4}$ vs $X_{1}$ $X_{2}$ $X_{3}$ find $R^{2}$ , say R4

For example if R3 is 95% then we don’t really need X3 in the model.
Since it can be explained as liner combination of other three.
For each variable we find individual $R^{2}$ .
$\frac{1}{(1 - R^{2})}$ is called VIF.
VIF option in SAS automatically calculates VIF values for each of the predictor variables

$R^{2}$	40%	50%	60%	70%	75%	80%	90%
VIF	1.67	2.00	2.50	3.33	4.00	5.00	10.00

Practice : Multicollinearity

Identify the Multicollinearity in the Final Exam Score model.
Drop the variable one by one to reduce the multicollinearity.

In [42]:

from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])

In [44]:

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()

Out[44]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.990
Model:	OLS	Adj. R-squared:	0.987
Method:	Least Squares	F-statistic:	452.3
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	1.50e-18
Time:	11:52:41	Log-Likelihood:	-38.099
No. Observations:	24	AIC:	86.20
Df Residuals:	19	BIC:	92.09
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-1.6226	1.999	-0.812	0.427	-5.806 2.561
Sem1_Science	0.1738	0.063	2.767	0.012	0.042 0.305
Sem2_Science	0.2785	0.052	5.379	0.000	0.170 0.387
Sem1_Math	0.7890	0.197	4.002	0.001	0.376 1.202
Sem2_Math	-0.2063	0.191	-1.078	0.294	-0.607 0.194

Omnibus:	6.343	Durbin-Watson:	1.863
Prob(Omnibus):	0.042	Jarque-Bera (JB):	4.332
Skew:	0.973	Prob(JB):	0.115
Kurtosis:	3.737	Cond. No.	1.20e+03

In [45]:

fitted1.summary2()

Out[45]:

Model:	OLS	Adj. R-squared:	0.987
Dependent Variable:	Final_exam_marks	AIC:	86.1980
Date:	2016-07-27 11:53	BIC:	92.0883
No. Observations:	24	Log-Likelihood:	-38.099
Df Model:	4	F-statistic:	452.3
Df Residuals:	19	Prob (F-statistic):	1.50e-18
R-squared:	0.990	Scale:	1.7694

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
Intercept	-1.6226	1.9987	-0.8118	0.4269	-5.8060	2.5607
Sem1_Science	0.1738	0.0628	2.7668	0.0123	0.0423	0.3052
Sem2_Science	0.2785	0.0518	5.3795	0.0000	0.1702	0.3869
Sem1_Math	0.7890	0.1971	4.0023	0.0008	0.3764	1.2016
Sem2_Math	-0.2063	0.1914	-1.0782	0.2944	-0.6069	0.1942

Omnibus:	6.343	Durbin-Watson:	1.863
Prob(Omnibus):	0.042	Jarque-Bera (JB):	4.332
Skew:	0.973	Prob(JB):	0.115
Kurtosis:	3.737	Condition No.:	1200

In [48]:

There is no function for calculating VIF values. 
None of the pre-built libraries have this VIF calculation function
We may have to write our own function to calculate VIF values for each variable 

#Code for VIF Calculation

#Writing a function to calculate the VIF values

def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [49]:

#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.4
Sem2_Science  VIF =  5.4
Sem1_Math  VIF =  68.79
Sem2_Math  VIF =  68.01

In [51]:

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()

Out[51]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.981
Model:	OLS	Adj. R-squared:	0.978
Method:	Least Squares	F-statistic:	341.4
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	2.44e-17
Time:	12:03:55	Log-Likelihood:	-45.436
No. Observations:	24	AIC:	98.87
Df Residuals:	20	BIC:	103.6
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-2.3986	2.632	-0.911	0.373	-7.889 3.092
Sem1_Science	0.2130	0.082	2.595	0.017	0.042 0.384
Sem2_Science	0.2686	0.068	3.925	0.001	0.126 0.411
Sem2_Math	0.5320	0.067	7.897	0.000	0.391 0.673

Omnibus:	5.869	Durbin-Watson:	2.424
Prob(Omnibus):	0.053	Jarque-Bera (JB):	3.793
Skew:	0.864	Prob(JB):	0.150
Kurtosis:	3.898	Cond. No.	1.03e+03

In [52]:

vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.22
Sem2_Science  VIF =  5.38
Sem2_Math  VIF =  4.81

In [53]:

vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")

Sem2_Science  VIF =  3.4
Sem2_Math  VIF =  3.4

24th January 2018

204.1.9 Issue of Multicollinearity in Python

Multicollinearity

Multicollinearity Detection

Practice : Multicollinearity

Statinfer

Statinfer

Statinfer

204.1.9 Issue of Multicollinearity in Python

Multicollinearity

Multicollinearity Detection

Practice : Multicollinearity

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer