Statinfer

203.4.5 Type of Datasets, Type of Errors and Problem of Overfitting

Things to know before proceeding further.

The Problem of Over Fitting

In previous section, we studied about What is a Best Model?

  • In search of the best model on the given data we add many predictors, polynomial terms, Interaction terms, variable transformations, derived variables, indicator/dummy variables etc.,
  • Most of the times we succeed in reducing the error. What error is this?
  • So by complicating the model we fit the best model for the training data.
  • Sometimes the error on the training data can reduce to near zero
  • But the same best model on training data fails miserably on test data.
  • Imagine building multiple models with small changes in training data. The resultant set of models will have huge variance in their parameter estimates.
  • The model is made really complicated, that it is very sensitive to minimal changes
  • By complicating the model the variance of the parameters estimates inflates
  • Model tries to fit the irrelevant characteristics in the data
  • Over fitting
  • The model is super good on training data but not so good on test data
  • We fit the model for the noise in the data
  • Less training error, high testing error
  • The model is over complicated with too many predictors
  • Model need to be simplified
  • A model with lot of variance

LAB: Model with huge Variance

  • Data: Fiberbits/Fiberbits.csv
  • Take initial 90% of the data. Consider it as training data. Keep the final 10% of the records for validation.
  • Build the best model(5% error) model on training data.
  • Use the validation data to verify the error rate. Is the error rate on the training data and validation data same?

Solution

fiber_bits_train<-Fiberbits[1:90000,]
fiber_bits_validation<-Fiberbits[90001:100000,]

Model on training data

Fiber_bits_tree3<-rpart(active_cust~., method="class", control=rpart.control(minsplit=5, cp=0.000001), data=fiber_bits_train)
Fbits_pred3<-predict(Fiber_bits_tree3, type="class")
conf_matrix3<-table(Fbits_pred3,fiber_bits_train$active_cust)
accuracy3<-(conf_matrix3[1,1]+conf_matrix3[2,2])/(sum(conf_matrix3))
accuracy3
## [1] 0.9524889

Validation Accuracy

fiber_bits_validation$pred <- predict(Fiber_bits_tree3, fiber_bits_validation,type="class")

conf_matrix_val<-table(fiber_bits_validation$pred,fiber_bits_validation$active_cust)
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val
## [1] 0.7116

Error rate on validation data is more than the training data error.

The Problem of Under-fitting

  • Simple models are better. Its true but is that always true? May not be always true.
  • We might have given it up too early. Did we really capture all the information?
  • Did we do enough research and future reengineering to fit the best model? Is it the best model that can be fit on this data?
  • By being over cautious about variance in the parameters, we might miss out on some patterns in the data.
  • Model need to be complicated enough to capture all the information present.
  • If the training error itself is high, how can we be so sure about the model performance on unknown data?
  • Most of the accuracy and error measuring statistics give us a clear idea on training error, this is one advantage of under fitting, we can identify it confidently.
  • Under fitting
  • A model that is too simple
  • A mode with a scope for improvement
  • A model with lot of bias

LAB: Model with huge Bias

  • Lets simplify the model.
  • Take the high variance model and prune it.
  • Make it as simple as possible.
  • Find the training error and validation error.

Solution

  • Simple Model
Fiber_bits_tree4<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.25), data=fiber_bits_train)
prp(Fiber_bits_tree4)

Fbits_pred4<-predict(Fiber_bits_tree4, type="class")
conf_matrix4<-table(Fbits_pred4,fiber_bits_train$active_cust)
conf_matrix4
##            
## Fbits_pred4     0     1
##           0 11209   921
##           1 25004 52866
accuracy4<-(conf_matrix4[1,1]+conf_matrix4[2,2])/(sum(conf_matrix4))
accuracy4
## [1] 0.7119444
  • Validation accuracy
fiber_bits_validation$pred1 <- predict(Fiber_bits_tree4, fiber_bits_validation,type="class")

conf_matrix_val1<-table(fiber_bits_validation$pred1,fiber_bits_validation$active_cust)
accuracy_val1<-(conf_matrix_val1[1,1]+conf_matrix_val1[2,2])/(sum(conf_matrix_val1))
accuracy_val1
## [1] 0.4224

The next post is about Model Bias Variance Tradeoff.

0 responses on "203.4.5 Type of Datasets, Type of Errors and Problem of Overfitting"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?

top