204.2.5 Multicollinearity and Individual Impact Of Variables in Logistic Regression

Solving the issue of multicollinearity and finding impact of individual variable.

Link to the previous post : https://statinfer.com/204-2-4-goodness-of-fit-for-logistic-regression/

Previous post was about goodness of fit, we covered Confusion matrix and will cover the rest in next posts too.

But, first let’s deal with a common issue with modeling:


  • The relation between X and Y is non linear, we used logistic regression.
  • The multicollinearity is an issue related to predictor variables.
  • Multicollinearity need to be fixed in logistic regression as well.
  • Otherwise the individual coefficients of the predictors will be effected by the inter-dependency.
  • The process of identification is same as linear regression.

Practice : Multicollinearity

  • Is there any multicollinearity in fiber bits model?
  • Identify and remove multicollinearity from the model
In [27]:
def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    for i in range(0,xvar_names.shape[0]):
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        print (xvar_names[i], " VIF = " , vif)
In [28]:
#Calculating VIF values using that function
vif_cal(input_data=Fiber, dependent_col="active_cust")
income  VIF =  1.02
months_on_network  VIF =  1.03
Num_complaints  VIF =  1.01
number_plan_changes  VIF =  1.59
relocated  VIF =  1.56
monthly_bill  VIF =  1.02
technical_issues_per_month  VIF =  1.06
Speed_test_result  VIF =  1.0

Individual Impact of Variables

  • Out of these predictor variables, what are the important variables?
  • If we have to choose the top 5 variables what are they?
  • While selecting the model, we may want to drop few less impacting variables.
  • How to rank the predictor variables in the order of their importance?
  • We can simply look at the z values of the each variable. Look at their absolute values.
  • Or calculate the Wald chi-square, which is nearly equal to square of the z-score.
  • Wald Chi-Square value helps in ranking the variables.

Practice : Individual Impact of Variables

  • Identify top impacting and least impacting variables in fiber bits models.
  • Find the variable importance and order them based on their impact.
In [29]:
Logit Regression Results
Dep. Variable: active_cust No. Observations: 100000
Model: Logit Df Residuals: 99992
Method: MLE Df Model: 7
Date: Sun, 16 Oct 2016 Pseudo R-squ.: 0.2403
Time: 14:35:51 Log-Likelihood: -51717.
converged: True LL-Null: -68074.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
income 1.71e-05 4.17e-06 4.097 0.000 8.92e-06 2.53e-05
months_on_network 0.0150 0.000 31.172 0.000 0.014 0.016
Num_complaints -1.7669 0.027 -65.284 0.000 -1.820 -1.714
number_plan_changes -0.1784 0.007 -23.909 0.000 -0.193 -0.164
relocated -3.0826 0.040 -76.259 0.000 -3.162 -3.003
monthly_bill -0.0024 0.000 -16.014 0.000 -0.003 -0.002
technical_issues_per_month -0.4636 0.007 -64.010 0.000 -0.478 -0.449
Speed_test_result 0.1094 0.001 75.073 0.000 0.107 0.112

Top impacting variables are – relocated & Speed_test_result.

Least impacting variables are – monthly_bill & income.

The next post is about model selection logistic regression.

Link to the next post : https://statinfer.com/204-2-6-model-selection-logistic-regression/

