203.4.2 Calculating Sensitivity and Specificity in R

Calculating Sensitivity and Specificity

In previous section, we studied about Model Selection and Cross Validation

Building Logistic Regression Model

Fiberbits <- read.csv("C:\\Amrita\\Datavedi\\Fiberbits\\Fiberbits.csv")
Fiberbits_model_1<-glm(active_cust~., family=binomial, data=Fiberbits)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(Fiberbits_model_1)

## 
## Call:
## glm(formula = active_cust ~ ., family = binomial, data = Fiberbits)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.8752   0.4055   0.7619   2.9465  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.761e+01  3.008e-01  -58.54   <2e-16 ***
## income                      1.710e-03  8.213e-05   20.82   <2e-16 ***
## months_on_network           2.880e-02  1.005e-03   28.65   <2e-16 ***
## Num_complaints             -6.865e-01  3.010e-02  -22.81   <2e-16 ***
## number_plan_changes        -1.896e-01  7.603e-03  -24.94   <2e-16 ***
## relocated                  -3.163e+00  3.957e-02  -79.93   <2e-16 ***
## monthly_bill               -2.198e-03  1.571e-04  -13.99   <2e-16 ***
## technical_issues_per_month -3.904e-01  7.152e-03  -54.58   <2e-16 ***
## Speed_test_result           2.222e-01  2.378e-03   93.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136149  on 99999  degrees of freedom
## Residual deviance:  98359  on 99991  degrees of freedom
## AIC: 98377
## 
## Number of Fisher Scoring iterations: 8

Confusion Matrix

threshold=0.5
predicted_values<-ifelse(predict(Fiberbits_model_1,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_1$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values     0     1
##                0 29492 10847
##                1 12649 47012

Code-Sensitivity and Specificity

library(caret)

## Warning: package 'caret' was built under R version 3.1.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.1.3

sensitivity(conf_matrix)

## [1] 0.699841

specificity(conf_matrix)

## [1] 0.812527

Changing Threshold

threshold=0.8
predicted_values<-ifelse(predict(Fiberbits_model_1,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_1$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values     0     1
##                0 37767 30521
##                1  4374 27338

Changed Sensitivity and Specificity

sensitivity(conf_matrix)

## [1] 0.8962056

specificity(conf_matrix)

## [1] 0.4724935

Sensitivity and Specificity

By changing the threshold, the good and bad customers classification will be changed hence the sensitivity and specificity will be changed
Which one of these two we should maximize? What should be ideal threshold?
Ideally we want to maximize both Sensitivity & Specificity. But this is not possible always. There is always a tradeoff.
Sometimes we want to be 100% sure on Predicted negatives, sometimes we want to be 100% sure on Predicted positives.
Sometimes we simply don’t want to compromise on sensitivity sometimes we don’t want to compromise on specificity
The threshold is set based on business problem

When Sensitivity is a High Priority

Predicting a bad customers or defaulters before issuing the loan

Predicting a bad defaulters before issuing the loan

The profit on good customer loan is not equal to the loss on one bad customer loan
The loss on one bad loan might eat up the profit on 100 good customers
In this case one bad customer is not equal to one good customer.
If p is probability of default then we would like to set our threshold in such a way that we don’t miss any of the bad customers.
We set the threshold in such a way that Sensitivity is high
We can compromise on specificity here. If we wrongly reject a good customer, our loss is very less compared to giving a loan to a bad customer.
We don’t really worry about the good customers here, they are not harmful hence we can have less Specificity

When Specificity is a High Priority

Testing a medicine is good or poisonous

Testing a medicine is good or poisonous

In this case, we have to really avoid cases like , Actual medicine is poisonous and model is predicting them as good.
We can’t take any chance here.
The specificity need to be near 100.
The sensitivity can be compromised here. It is not very harmful not to use a good medicine when compared with vice versa case

Sensitivity vs Specificity – Importance

There are some cases where Sensitivity is important and need to be near to 1
There are business cases where Specificity is important and need to be near to 1
We need to understand the business problem and decide the importance of Sensitivity and Specificity

ROC Curve

If we consider all the possible threshold values and the corresponding specificity and sensitivity rate what will be the final model accuracy.
ROC(Receiver operating characteristic) curve is drawn by taking False positive rate on X-axis and True positive rate on Y- axis
ROC tells us, how many mistakes are we making to identify all the positives?

The next post is about ROC and AUC.

21st June 2017

203.4.2 Calculating Sensitivity and Specificity in R

Building a model, creating Confusion Matrix and finding Specificity and Sensitivity.

Calculating Sensitivity and Specificity

Building Logistic Regression Model

Confusion Matrix

Code-Sensitivity and Specificity

Changing Threshold

Changed Sensitivity and Specificity

Sensitivity and Specificity

When Sensitivity is a High Priority

When Specificity is a High Priority

Sensitivity vs Specificity – Importance

ROC Curve