203.7.6 Practice : Random Forest

In previous section, we studied about The Random Forest

Let’s implement the concept of Random Forest into practice using R.

LAB: Random Forest

Dataset: /Car Accidents IOT/Train.csv
Build a decision tree model to predict the fatality of accident
Build a decision tree model on the training data.
On the test data, calculate the classification error and accuracy.
Build a random forest model on the training data.
On the test data, calculate the classification error and accuracy.
What is the improvement of the Random Forest model when compared with the single tree?

Solution

#Data Import
train<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Train.csv")
test<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Test.csv")

dim(train)

## [1] 15109    23

head(train)

##   Fatal      S1       S2       S3  S4       S5 S6 S7 S8 S9      S10 S11
## 1     1 36.2247 10.77330 0.243897 596 100.6710  0  0  1 28 0.016064 313
## 2     1 35.7343 17.45510 0.243897 600 100.0000  0  0  1 14 0.015812 319
## 3     1 31.6561  7.61366 0.308763 604  99.3377  0  0  1  4 0.015560 323
## 4     1 33.8320 13.11190 0.293195 616  97.4026  0  0  1  8 0.016001 320
## 5     1 42.5138 13.99850 0.259465 632  94.9367  0  0  1  8 0.016064 322
## 6     1 36.1261 14.85930 0.278925 600 100.0000  0  0  1  4 0.015749 314
##   S12 S13 S14 S15   S16  S17     S18 S19  S20 S21     S22
## 1   1   1  57   0 0.280  240 5.99375   0  0.0   4 14.9382
## 2   1   1  57   0 0.175  240 5.99375   0  0.0   4 14.8827
## 3   1   1  58   0 0.280  240 5.99375   0  0.0   4 14.6005
## 4   1   1  58   0 0.385  240 4.50625   0 13.0   4 14.6782
## 5   1   1  57   0 0.070  240 5.99375   0 19.5   4 15.3461
## 6   1   1  58   0 0.175 1008 4.50625   0 23.9   4 15.0559

###Decision Tree
library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class", control=rpart.control(minsplit=30, cp=0.03),   data=train)

#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)

## predicted_y
##    0    1 
## 5745 9364

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4735 1010
##          1 1581 7783
##                                           
##                Accuracy : 0.8285          
##                  95% CI : (0.8224, 0.8345)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.643           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7497          
##             Specificity : 0.8851          
##          Pos Pred Value : 0.8242          
##          Neg Pred Value : 0.8312          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3134          
##    Detection Prevalence : 0.3802          
##       Balanced Accuracy : 0.8174          
##                                           
##        'Positive' Class : 0               
##

#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2897  561
##          1  995 4612
##                                           
##                Accuracy : 0.8284          
##                  95% CI : (0.8204, 0.8361)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6448          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7443          
##             Specificity : 0.8916          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.8225          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3196          
##    Detection Prevalence : 0.3815          
##       Balanced Accuracy : 0.8179          
##                                           
##        'Positive' Class : 0               
##

###Random Forest
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(as.factor(train$Fatal) ~ ., ntree=200,   mtry=ncol(train)/3, data=train)

#Training accuaracy
predicted_y<-predict(rf_model)
table(predicted_y)

## predicted_y
##    0    1 
## 5921 9188

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5600  321
##          1  716 8472
##                                           
##                Accuracy : 0.9314          
##                  95% CI : (0.9272, 0.9353)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8577          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8866          
##             Specificity : 0.9635          
##          Pos Pred Value : 0.9458          
##          Neg Pred Value : 0.9221          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3706          
##    Detection Prevalence : 0.3919          
##       Balanced Accuracy : 0.9251          
##                                           
##        'Positive' Class : 0               
##

#Accuaracy on Test data
predicted_test_rf<-predict(rf_model,test, type="class")
confusionMatrix(predicted_test_rf,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3479  192
##          1  413 4981
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.9279, 0.9383)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8628          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8939          
##             Specificity : 0.9629          
##          Pos Pred Value : 0.9477          
##          Neg Pred Value : 0.9234          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3838          
##    Detection Prevalence : 0.4050          
##       Balanced Accuracy : 0.9284          
##                                           
##        'Positive' Class : 0               
##

We can see an improvement in the Accuracy

The next post is on Boosting.

21st June 2017

203.7.6 Practice : Random Forest

Building a Random Forest model using Python.

LAB: Random Forest

Solution