Link to the previous post : https://statinfer.com/204-4-11-k-fold-cross-validation/
This will be our last post on our Model Selection and Cross Validation Series.
Bootstrap Methods
- Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error.
- Can estimate the likely future performance of a given modeling procedure, on new data not yet realized.
- The Algorithm
- We have a training data is of size N
- Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
- Create B such new datasets. These are called boot strap datasets.
- Build the model on these B datasets, we can test the models on the original training dataset.
Bootstrap Example
- We have a training data is of size 500
- Boot Strap Data-1:
- Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1.
- Multiple Boot Strap datasets
- Repeat, the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets.
- We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models.
LAB: Bootstrap Cross Validation
- Draw a boot strap sample with sufficient sample size
- Build a tree model and get an estimate on true accuracy of the model
Solution
In [42]:
# Defining the tree parameters
tree_BS = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=30,
min_samples_split=30,
min_samples_leaf=30,
max_leaf_nodes=60)
In [43]:
# Defining the bootstrap variable for 10 random samples
bootstrap=cross_validation.ShuffleSplit(n=len(Fiber_df),
n_iter=10,
random_state=0)
In [44]:
###checking the error in the Boot Strap models###
BS_score = cross_validation.cross_val_score(tree_BS,X, y,cv=bootstrap)
BS_score
Out[44]:
In [45]:
#Expected accuracy according to bootstrap validation
###checking the error in the Boot Strap models###
BS_score.mean()
Out[45]:
With 10 bootstrap samples we can expect an Accuracy : 77.98%.
Conclusion
- We studied
- Validating a model, Types of data & Types of errors
- The problem of over fitting & The problem of under fitting
- Bias Variance Tradeoff
- Cross validation & Boot strapping
- Training error is what we see and that is not the true performance metric
- Test error plays vital role in model selection
- R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
- Cross Validation and Boot strapping techniques give us an idea on test error.
- Choose the model based on the combination of AIC, Cross Validation and Boot strapping results.
- Bootstrap is widely used in ensemble models & random forests.