Link to the previous post : https://statinfer.com/204-4-11-k-fold-cross-validation/

This will be our last post on our Model Selection and Cross Validation Series.

Bootstrap Methods

Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error.
Can estimate the likely future performance of a given modeling procedure, on new data not yet realized.
The Algorithm
- We have a training data is of size N
- Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
- Create B such new datasets. These are called boot strap datasets.
- Build the model on these B datasets, we can test the models on the original training dataset.

Bootstrap Example

We have a training data is of size 500
Boot Strap Data-1:
- Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1.
Multiple Boot Strap datasets
- Repeat, the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets.
We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models.

LAB: Bootstrap Cross Validation

Draw a boot strap sample with sufficient sample size
Build a tree model and get an estimate on true accuracy of the model

Solution

In [42]:

# Defining the tree parameters
tree_BS = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=30, 
                                              min_samples_split=30, 
                                              min_samples_leaf=30, 
                                              max_leaf_nodes=60)

In [43]:

# Defining the bootstrap variable for 10 random samples
bootstrap=cross_validation.ShuffleSplit(n=len(Fiber_df), 
                                        n_iter=10, 
                                        random_state=0)

In [44]:

###checking the error in the Boot Strap models###
BS_score = cross_validation.cross_val_score(tree_BS,X, y,cv=bootstrap)
BS_score

Out[44]:

array([ 0.8658,  0.8699,  0.8658,  0.8655,  0.8694,  0.8741,  0.8689,
        0.8689,  0.8639,  0.8672])

In [45]:

#Expected accuracy according to bootstrap validation
###checking the error in the Boot Strap models###
BS_score.mean()

Out[45]:

0.86793999999999993

With 10 bootstrap samples we can expect an Accuracy : 77.98%.

Conclusion

We studied
- Validating a model, Types of data & Types of errors
- The problem of over fitting & The problem of under fitting
- Bias Variance Tradeoff
- Cross validation & Boot strapping
Training error is what we see and that is not the true performance metric
Test error plays vital role in model selection
R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
Cross Validation and Boot strapping techniques give us an idea on test error.
Choose the model based on the combination of AIC, Cross Validation and Boot strapping results.
Bootstrap is widely used in ensemble models & random forests.

21st June 2017

204.4.12 Bootstrap Cross Validation

Understanding and implementing Bootstrap Cross Validation

Bootstrap Methods

Bootstrap Example

LAB: Bootstrap Cross Validation

Conclusion