203.2.5 Multi-collinearity and Individual Impact Of Variables in Logistic Regression

Multicollinearity

In previous section, we studied about Goodness of fit for Logistic Regression

When the relation between X and Y is non linear, we use logistic regression
The multicollinearity is an issue related to predictor variables. = Multicollinearity need to be fixed in logistic regression as well.
Otherwise the individual coefficients of the predictors will be effected by the interdependency
The process of identification is same as linear regression

Multicollinearity in R

library(car)

## Warning: package 'car' was built under R version 3.1.3

vif(Fiberbits_model_1)

##                     income          months_on_network 
##                   4.590705                   4.641040 
##             Num_complaints        number_plan_changes 
##                   1.018607                   1.126892 
##                  relocated               monthly_bill 
##                   1.145847                   1.017565 
## technical_issues_per_month          Speed_test_result 
##                   1.020648                   1.206999

Individual Impact of Variables

Out of these predictor variables, what are the important variables?
If we have to choose the top 5 variables what are they?
While selecting the model, we may want to drop few less impacting variables.
How to rank the predictor variables in the order of their importance?
We can simply look at the z values of the each variable. Look at their absolute values
Or calculate the Wald chi-square, which is nearly equal to square of the z-score
Wald Chi-Square value helps in ranking the variables

Code-Individual Impact of Variables

library(caret)

## Warning: package 'caret' was built under R version 3.1.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.1.3

varImp(Fiberbits_model_1, scale = FALSE)

##                             Overall
## income                     20.81981
## months_on_network          28.65421
## Num_complaints             22.81102
## number_plan_changes        24.93955
## relocated                  79.92677
## monthly_bill               13.99490
## technical_issues_per_month 54.58123
## Speed_test_result          93.43471

This will give the absolute value of the Z-score

Model Selection – AIC and BIC

AIC and BIC values are like adjusted R-squared values in linear regression
Stand-alone model AIC has no real use, but if we are choosing between the models AIC really helps.
Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models
If we are choosing between two models, a model with less AIC is preferred
AIC is an estimate of the information lost when a given model is used to represent the process that generates the data
AIC= -2ln(L)+ 2k
L be the maximum value of the likelihood function for the model
k is the number of independent variables
BIC is a substitute to AIC with a slightly different formula. We will follow either AIC or BIC throughout our analysis

Code-AIC and BIC

library(stats)
AIC(Fiberbits_model_1)

## [1] 98377.36

BIC(Fiberbits_model_1)

## [1] 98462.97

The next post is about Model Selection in logistic regression.

20th June 2017