# 203.2.6 Model Selection : Logistic Regression

### LAB-Logistic Regression Model Selection

In previous section, we studied about Multi-collinearity and Individual Impact Of Variables in Logistic Regression

1. What are the top-2 impacting variables in fiber bits model?
1. What are the least impacting variables in fiber bits model?
1. Can we drop any of these variables?
1. Can we derive any new variables to increase the accuracy of the model?
1. What is the final model? What the best accuracy that you can expect on this data?

### Solution

1. What are the top-2 impacting variables in fiber bits model?

Speed_test_result and relocation status are the top two important variables

1. What are the least impacting variables in fiber bits model?

monthly_bill and income are the least impacting variables

1. Can we drop any of these variables?

We can drop monthly_bill and income, they have the least impact when compared to other predictors. But we need to see the accuracy and AIC then take the final decision.

### AIC & Accuracy of Model1

threshold=0.5

predicted_values<-ifelse(predict(Fiberbits_model_1,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_1$y conf_matrix<-table(predicted_values,actual_values) conf_matrix ## actual_values ## predicted_values 0 1 ## 0 29492 10847 ## 1 12649 47012 accuracy1<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix)) accuracy1 ## [1] 0.76504 AIC(Fiberbits_model_1) ## [1] 98377.36 ### AIC & Accuracy of Model1 without monthly_bill Fiberbits_model_11<-glm(active_cust~.-monthly_bill,family=binomial(),data=Fiberbits) ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred threshold=0.5 predicted_values<-ifelse(predict(Fiberbits_model_11,type="response")>threshold,1,0) actual_values<-Fiberbits_model_11$y

conf_matrix<-table(predicted_values,actual_values)
accuracy11<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy11
## [1] 0.76337
AIC(Fiberbits_model_11)
## [1] 98580.54

### AIC & Accuracy of Model1 without income

Fiberbits_model_2<-glm(active_cust~.-income,family=binomial(),data=Fiberbits)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
threshold=0.5

predicted_values<-ifelse(predict(Fiberbits_model_2,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_2$y conf_matrix<-table(predicted_values,actual_values) accuracy2<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix)) accuracy2 ## [1] 0.76695 AIC(Fiberbits_model_2) ## [1] 99076.27 ### Deciding which Variable to Drop Model All Variables Without monthly_bill Without income AIC 98377.36 98580.54 99076.27 Accuracy 0.76504 0.76337 0.76695 Dropping Income has not reduced the accuracy. AIC(Loss of information) also shows no big change. ### Output of Model2 summary(Fiberbits_model_2) ## ## Call: ## glm(formula = active_cust ~ . - income, family = binomial(), ## data = Fiberbits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -8.4904 -0.8901 0.4175 0.7675 3.1083 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.309e+01 2.100e-01 -62.34 <2e-16 *** ## months_on_network 1.004e-02 4.644e-04 21.62 <2e-16 *** ## Num_complaints -7.071e-01 2.990e-02 -23.65 <2e-16 *** ## number_plan_changes -2.016e-01 7.571e-03 -26.63 <2e-16 *** ## relocated -3.133e+00 3.933e-02 -79.66 <2e-16 *** ## monthly_bill -2.253e-03 1.566e-04 -14.39 <2e-16 *** ## technical_issues_per_month -3.970e-01 7.159e-03 -55.45 <2e-16 *** ## Speed_test_result 2.198e-01 2.334e-03 94.16 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 136149 on 99999 degrees of freedom ## Residual deviance: 99060 on 99992 degrees of freedom ## AIC: 99076 ## ## Number of Fisher Scoring iterations: 7 1. Can we derive any new variables to increase the accuracy of the model? Fiberbits_model_3<-glm(active_cust~ income +months_on_network +Num_complaints +number_plan_changes +relocated +monthly_bill +technical_issues_per_month +technical_issues_per_month*number_plan_changes +Speed_test_result+I(Speed_test_result^2), family=binomial(),data=Fiberbits) ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(Fiberbits_model_3) ## ## Call: ## glm(formula = active_cust ~ income + months_on_network + Num_complaints + ## number_plan_changes + relocated + monthly_bill + technical_issues_per_month + ## technical_issues_per_month * number_plan_changes + Speed_test_result + ## I(Speed_test_result^2), family = binomial(), data = Fiberbits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6112 -0.8478 0.3780 0.7401 2.9909 ## ## Coefficients: ## Estimate Std. Error ## (Intercept) -2.501e+01 3.647e-01 ## income 1.831e-03 8.386e-05 ## months_on_network 2.905e-02 1.011e-03 ## Num_complaints -6.972e-01 3.030e-02 ## number_plan_changes -4.404e-01 2.199e-02 ## relocated -3.253e+00 3.997e-02 ## monthly_bill -2.295e-03 1.588e-04 ## technical_issues_per_month -4.670e-01 9.694e-03 ## Speed_test_result 3.910e-01 4.260e-03 ## I(Speed_test_result^2) -9.438e-04 1.272e-05 ## number_plan_changes:technical_issues_per_month 7.481e-02 6.164e-03 ## z value Pr(>|z|) ## (Intercept) -68.56 <2e-16 *** ## income 21.83 <2e-16 *** ## months_on_network 28.73 <2e-16 *** ## Num_complaints -23.00 <2e-16 *** ## number_plan_changes -20.03 <2e-16 *** ## relocated -81.39 <2e-16 *** ## monthly_bill -14.46 <2e-16 *** ## technical_issues_per_month -48.17 <2e-16 *** ## Speed_test_result 91.79 <2e-16 *** ## I(Speed_test_result^2) -74.20 <2e-16 *** ## number_plan_changes:technical_issues_per_month 12.14 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 136149 on 99999 degrees of freedom ## Residual deviance: 97105 on 99989 degrees of freedom ## AIC: 97127 ## ## Number of Fisher Scoring iterations: 7 ### AIC & Accuracy of Model 3 threshold=0.5 predicted_values<-ifelse(predict(Fiberbits_model_3,type="response")>threshold,1,0) actual_values<-Fiberbits_model_3$y

conf_matrix<-table(predicted_values,actual_values)
accuracy3<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy3
## [1] 0.76061
AIC(Fiberbits_model_3)
## [1] 97127.17
1. What is the final model? What the best accuracy that you can expect on this data?
AIC(Fiberbits_model_1,Fiberbits_model_2,Fiberbits_model_3)
##                   df      AIC
## Fiberbits_model_1  9 98377.36
## Fiberbits_model_2  8 99076.27
## Fiberbits_model_3 11 97127.17
accuracy1
## [1] 0.76504
accuracy2
## [1] 0.76695
accuracy3
## [1] 0.76061

Conclusion: Logistic Regression

• Logistic Regression is the base of all classification algorithms.
• A good understanding on logistic regression and goodness of fit measures will really help in understanding complex machine learning algorithms like neural networks and SVMs.
• One has to be careful while selecting the model, all the goodness of fit measures are calculated on training data.
• We may have to do cross validation to get an idea on the test error.

In next section we will be studying about Decision Trees in r : Segmentation

20th June 2017

