LAB: The Problem of Over fitting
In previous section, we studied about Validating the Tree
- Import Dataset: “Buyers Profiles/Train_data.csv”
- Import both test and training data
- Build a decision tree model on training data
- Find the accuracy on training data
- Find the predictions for test data
- What is the model prediction accuracy on test data?
Solution
- Import Dataset: “Buyers Profiles/Train_data.csv”
- Import both test and training data
Train <- read.csv("C:\\Amrita\\Datavedi\\Buyers Profiles\\Train_data.csv")
Test <- read.csv("C:\\Amrita\\Datavedi\\Buyers Profiles\\Test_data.csv")
- Build a decision tree model on training data
buyers_model<-rpart(Bought ~ Age + Gender, method="class", data=Train,control=rpart.control(minsplit=2))
buyers_model
## n= 14
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 14 7 No (0.5000000 0.5000000)
## 2) Gender=Female 7 1 No (0.8571429 0.1428571)
## 4) Age>=20 4 0 No (1.0000000 0.0000000) *
## 5) Age< 20 3 1 No (0.6666667 0.3333333)
## 10) Age< 11.5 2 0 No (1.0000000 0.0000000) *
## 11) Age>=11.5 1 0 Yes (0.0000000 1.0000000) *
## 3) Gender=Male 7 1 Yes (0.1428571 0.8571429)
## 6) Age>=47 3 1 Yes (0.3333333 0.6666667)
## 12) Age< 52 1 0 No (1.0000000 0.0000000) *
## 13) Age>=52 2 0 Yes (0.0000000 1.0000000) *
## 7) Age< 47 4 0 Yes (0.0000000 1.0000000) *
- Find the accuracy on training data
predicted_values<-predict(buyers_model,type="class")
actual_values<-Train$Bought
conf_matrix<-table(predicted_values,actual_values)
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 1
- Find the predictions for test data
predicted_values<-predict(buyers_model,type="class",newdata=Test)
predicted_values
## 1 2 3 4 5 6
## No No No Yes No Yes
## Levels: No Yes
What is the model prediction accuracy on test data?
actual_values<-Test$Bought
conf_matrix<-table(predicted_values,actual_values)
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.3333333
The Final Tree with Rules
-
- Gender=Female & Age>=20 No *
-
- Gender=Female & Age< 20 & Age< 11.5 No *
-
- Gender=Female & Age< 20 & Age>=11.5 Yes *
-
- Gender=Male & Age>=47 & Age< 52 No *
-
- Gender=Male & Age>=47 & Age>=52 Yes *
-
- Gender=Male & Age< 47 Yes *
The Problem of Overfitting
- If we further grow the tree we might even see each row of the input data table as the final rules
- The model will be really good on the training data but it will fail to validate on the test data
- Growing the tree beyond a certain level of complexity leads to overfitting
- A really big tree is very likely to suffer from overfitting.
The next post is about Pruning a Decision tree in R.