Multicollinearity
In previous section, we studied about Goodness of fit for Logistic Regression
- When the relation between X and Y is non linear, we use logistic regression
- The multicollinearity is an issue related to predictor variables. = Multicollinearity need to be fixed in logistic regression as well.
- Otherwise the individual coefficients of the predictors will be effected by the interdependency
- The process of identification is same as linear regression
Multicollinearity in R
library(car)
## Warning: package 'car' was built under R version 3.1.3
vif(Fiberbits_model_1)
## income months_on_network
## 4.590705 4.641040
## Num_complaints number_plan_changes
## 1.018607 1.126892
## relocated monthly_bill
## 1.145847 1.017565
## technical_issues_per_month Speed_test_result
## 1.020648 1.206999
Individual Impact of Variables
- Out of these predictor variables, what are the important variables?
- If we have to choose the top 5 variables what are they?
- While selecting the model, we may want to drop few less impacting variables.
- How to rank the predictor variables in the order of their importance?
- We can simply look at the z values of the each variable. Look at their absolute values
- Or calculate the Wald chi-square, which is nearly equal to square of the z-score
- Wald Chi-Square value helps in ranking the variables
Code-Individual Impact of Variables
library(caret)
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.1.3
varImp(Fiberbits_model_1, scale = FALSE)
## Overall
## income 20.81981
## months_on_network 28.65421
## Num_complaints 22.81102
## number_plan_changes 24.93955
## relocated 79.92677
## monthly_bill 13.99490
## technical_issues_per_month 54.58123
## Speed_test_result 93.43471
This will give the absolute value of the Z-score
Model Selection – AIC and BIC
- AIC and BIC values are like adjusted R-squared values in linear regression
- Stand-alone model AIC has no real use, but if we are choosing between the models AIC really helps.
- Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models
- If we are choosing between two models, a model with less AIC is preferred
- AIC is an estimate of the information lost when a given model is used to represent the process that generates the data
- AIC= -2ln(L)+ 2k
- L be the maximum value of the likelihood function for the model
- k is the number of independent variables
- BIC is a substitute to AIC with a slightly different formula. We will follow either AIC or BIC throughout our analysis
Code-AIC and BIC
library(stats)
AIC(Fiberbits_model_1)
## [1] 98377.36
BIC(Fiberbits_model_1)
## [1] 98462.97
The next post is about Model Selection in logistic regression.