204.4.6 Type of Datasets, Type of Errors and Problem of Overfitting

Link to the previous post : https://statinfer.com/204-4-5-what-is-a-best-model/

Different Type of Datasets and Errors

The accuracy of our best model is 95%. Is the 5% error model really good?
The error on the training data is known as training error.
A low error rate on training data may not always mean the model is good.
What really matters is how the model is going to perform on unknown data or test data.
We need to find out a way to get an idea on error rate of test data.
We may have to keep aside a part of the data and use it for validation.
There are two types of datasets and two types of errors.

There are two types of datasets.
Training set: This is used in model building. The input data.
Test set: The unknown dataset. This dataset is gives the accuracy of the final model.
We may not have access to these two datasets for all machine learning problems. In some cases, we can take 90% of the available data and use it as training data and rest 10% can be treated as validation data.
Validation set: This dataset kept aside for model validation and selection. This is a temporary subsite to test dataset. It is not third type of data.
We create the validation data with the hope that the error rate on validation data will give us some basic idea on the test error.

“A good model will have both training and test error very near to each other and close to zero”

Yes, this post is quite small but we will need to know Type of Errors and Type of Dataset to validate a model.

The next post is about problem of overfitting.

Link to the next post : https://statinfer.com/204-4-7-problem-of-overfitting/

21st June 2017