203.4.5 Type of Datasets, Type of Errors and Problem of Overfitting

Things to know before proceeding further.

The Problem of Over Fitting

In previous section, we studied about What is a Best Model?

  • In search of the best model on the given data we add many predictors, polynomial terms, Interaction terms, variable transformations, derived variables, indicator/dummy variables etc.,
  • Most of the times we succeed in reducing the error. What error is this?
  • So by complicating the model we fit the best model for the training data.
  • Sometimes the error on the training data can reduce to near zero
  • But the same best model on training data fails miserably on test data.
  • Imagine building multiple models with small changes in training data. The resultant set of models will have huge variance in their parameter estimates.
  • The model is made really complicated, that it is very sensitive to minimal changes
  • By complicating the model the variance of the parameters estimates inflates
  • Model tries to fit the irrelevant characteristics in the data
  • Over fitting
  • The model is super good on training data but not so good on test data
  • We fit the model for the noise in the data
  • Less training error, high testing error
  • The model is over complicated with too many predictors
  • Model need to be simplified
  • A model with lot of variance

LAB: Model with huge Variance

  • Data: Fiberbits/Fiberbits.csv
  • Take initial 90% of the data. Consider it as training data. Keep the final 10% of the records for validation.
  • Build the best model(5% error) model on training data.
  • Use the validation data to verify the error rate. Is the error rate on the training data and validation data same?



Model on training data

Fiber_bits_tree3<-rpart(active_cust~., method="class", control=rpart.control(minsplit=5, cp=0.000001), data=fiber_bits_train)
Fbits_pred3<-predict(Fiber_bits_tree3, type="class")
## [1] 0.9524889

Validation Accuracy

fiber_bits_validation$pred <- predict(Fiber_bits_tree3, fiber_bits_validation,type="class")

## [1] 0.7116

Error rate on validation data is more than the training data error.

The Problem of Under-fitting

  • Simple models are better. Its true but is that always true? May not be always true.
  • We might have given it up too early. Did we really capture all the information?
  • Did we do enough research and future reengineering to fit the best model? Is it the best model that can be fit on this data?
  • By being over cautious about variance in the parameters, we might miss out on some patterns in the data.
  • Model need to be complicated enough to capture all the information present.
  • If the training error itself is high, how can we be so sure about the model performance on unknown data?
  • Most of the accuracy and error measuring statistics give us a clear idea on training error, this is one advantage of under fitting, we can identify it confidently.
  • Under fitting
  • A model that is too simple
  • A mode with a scope for improvement
  • A model with lot of bias

LAB: Model with huge Bias

  • Lets simplify the model.
  • Take the high variance model and prune it.
  • Make it as simple as possible.
  • Find the training error and validation error.


  • Simple Model
Fiber_bits_tree4<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.25), data=fiber_bits_train)

Fbits_pred4<-predict(Fiber_bits_tree4, type="class")
## Fbits_pred4     0     1
##           0 11209   921
##           1 25004 52866
## [1] 0.7119444
  • Validation accuracy
fiber_bits_validation$pred1 <- predict(Fiber_bits_tree4, fiber_bits_validation,type="class")

## [1] 0.4224

The next post is about Model Bias Variance Tradeoff.

0 responses on "203.4.5 Type of Datasets, Type of Errors and Problem of Overfitting"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?