• No products in the cart.

204.4.12 Bootstrap Cross Validation

Understanding and implementing Bootstrap Cross Validation

Link to the previous post : https://statinfer.com/204-4-11-k-fold-cross-validation/

 

This will be our last post on our Model Selection and Cross Validation Series.

Bootstrap Methods

  • Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error.
  • Can estimate the likely future performance of a given modeling procedure, on new data not yet realized.
  • The Algorithm
    • We have a training data is of size N
    • Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
    • Create B such new datasets. These are called boot strap datasets.
    • Build the model on these B datasets, we can test the models on the original training dataset.

Bootstrap Example

 

  • We have a training data is of size 500
  • Boot Strap Data-1:
    • Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1.
  • Multiple Boot Strap datasets
    • Repeat, the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets.
  • We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models.

LAB: Bootstrap Cross Validation

  • Draw a boot strap sample with sufficient sample size
  • Build a tree model and get an estimate on true accuracy of the model

Solution

In [42]:
# Defining the tree parameters
tree_BS = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=30, 
                                              min_samples_split=30, 
                                              min_samples_leaf=30, 
                                              max_leaf_nodes=60)
In [43]:
# Defining the bootstrap variable for 10 random samples
bootstrap=cross_validation.ShuffleSplit(n=len(Fiber_df), 
                                        n_iter=10, 
                                        random_state=0)
In [44]:
###checking the error in the Boot Strap models###
BS_score = cross_validation.cross_val_score(tree_BS,X, y,cv=bootstrap)
BS_score
Out[44]:
array([ 0.8658,  0.8699,  0.8658,  0.8655,  0.8694,  0.8741,  0.8689,
        0.8689,  0.8639,  0.8672])
In [45]:
#Expected accuracy according to bootstrap validation
###checking the error in the Boot Strap models###
BS_score.mean()
Out[45]:
0.86793999999999993

With 10 bootstrap samples we can expect an Accuracy : 77.98%.

Conclusion

  • We studied
    • Validating a model, Types of data & Types of errors
    • The problem of over fitting & The problem of under fitting
    • Bias Variance Tradeoff
    • Cross validation & Boot strapping
  • Training error is what we see and that is not the true performance metric
  • Test error plays vital role in model selection
  • R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
  • Cross Validation and Boot strapping techniques give us an idea on test error.
  • Choose the model based on the combination of AIC, Cross Validation and Boot strapping results.
  • Bootstrap is widely used in ensemble models & random forests.

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.