In previous section, we studied about Model Selection and Cross Validation

```
Fiberbits <- read.csv("C:\\Amrita\\Datavedi\\Fiberbits\\Fiberbits.csv")
Fiberbits_model_1<-glm(active_cust~., family=binomial, data=Fiberbits)
```

`## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred`

`summary(Fiberbits_model_1)`

```
##
## Call:
## glm(formula = active_cust ~ ., family = binomial, data = Fiberbits)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.8752 0.4055 0.7619 2.9465
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.761e+01 3.008e-01 -58.54 <2e-16 ***
## income 1.710e-03 8.213e-05 20.82 <2e-16 ***
## months_on_network 2.880e-02 1.005e-03 28.65 <2e-16 ***
## Num_complaints -6.865e-01 3.010e-02 -22.81 <2e-16 ***
## number_plan_changes -1.896e-01 7.603e-03 -24.94 <2e-16 ***
## relocated -3.163e+00 3.957e-02 -79.93 <2e-16 ***
## monthly_bill -2.198e-03 1.571e-04 -13.99 <2e-16 ***
## technical_issues_per_month -3.904e-01 7.152e-03 -54.58 <2e-16 ***
## Speed_test_result 2.222e-01 2.378e-03 93.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136149 on 99999 degrees of freedom
## Residual deviance: 98359 on 99991 degrees of freedom
## AIC: 98377
##
## Number of Fisher Scoring iterations: 8
```

```
threshold=0.5
predicted_values<-ifelse(predict(Fiberbits_model_1,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_1$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
```

```
## actual_values
## predicted_values 0 1
## 0 29492 10847
## 1 12649 47012
```

`library(caret)`

`## Warning: package 'caret' was built under R version 3.1.3`

`## Loading required package: lattice`

`## Loading required package: ggplot2`

`## Warning: package 'ggplot2' was built under R version 3.1.3`

`sensitivity(conf_matrix)`

`## [1] 0.699841`

`specificity(conf_matrix)`

`## [1] 0.812527`

```
threshold=0.8
predicted_values<-ifelse(predict(Fiberbits_model_1,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_1$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
```

```
## actual_values
## predicted_values 0 1
## 0 37767 30521
## 1 4374 27338
```

`sensitivity(conf_matrix)`

`## [1] 0.8962056`

`specificity(conf_matrix)`

`## [1] 0.4724935`

- By changing the threshold, the good and bad customers classification will be changed hence the sensitivity and specificity will be changed
- Which one of these two we should maximize? What should be ideal threshold?
- Ideally we want to maximize both Sensitivity & Specificity. But this is not possible always. There is always a tradeoff.
- Sometimes we want to be 100% sure on Predicted negatives, sometimes we want to be 100% sure on Predicted positives.
- Sometimes we simply don’t want to compromise on sensitivity sometimes we don’t want to compromise on specificity
- The threshold is set based on business problem

- Predicting a bad customers or defaulters before issuing the loan

- Predicting a bad defaulters before issuing the loan

- The profit on good customer loan is not equal to the loss on one bad customer loan
- The loss on one bad loan might eat up the profit on 100 good customers
- In this case one bad customer is not equal to one good customer.
- If p is probability of default then we would like to set our threshold in such a way that we don’t miss any of the bad customers.
- We set the threshold in such a way that Sensitivity is high
- We can compromise on specificity here. If we wrongly reject a good customer, our loss is very less compared to giving a loan to a bad customer.
- We don’t really worry about the good customers here, they are not harmful hence we can have less Specificity

- Testing a medicine is good or poisonous

- Testing a medicine is good or poisonous

- In this case, we have to really avoid cases like , Actual medicine is poisonous and model is predicting them as good.
- We can’t take any chance here.
- The specificity need to be near 100.
- The sensitivity can be compromised here. It is not very harmful not to use a good medicine when compared with vice versa case

- There are some cases where Sensitivity is important and need to be near to 1
- There are business cases where Specificity is important and need to be near to 1
- We need to understand the business problem and decide the importance of Sensitivity and Specificity

- If we consider all the possible threshold values and the corresponding specificity and sensitivity rate what will be the final model accuracy.
- ROC(Receiver operating characteristic) curve is drawn by taking False positive rate on X-axis and True positive rate on Y- axis
- ROC tells us, how many mistakes are we making to identify all the positives?

The next post is about ROC and AUC.