# 204.4.12 Bootstrap Cross Validation

##### Understanding and implementing Bootstrap Cross Validation

Link to the previous post : https://statinfer.com/204-4-11-k-fold-cross-validation/

This will be our last post on our Model Selection and Cross Validation Series.

### Bootstrap Methods

• Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error.
• Can estimate the likely future performance of a given modeling procedure, on new data not yet realized.
• The Algorithm
• We have a training data is of size N
• Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
• Create B such new datasets. These are called boot strap datasets.
• Build the model on these B datasets, we can test the models on the original training dataset.

### Bootstrap Example

• We have a training data is of size 500
• Boot Strap Data-1:
• Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1.
• Multiple Boot Strap datasets
• Repeat, the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets.
• We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models.

## LAB: Bootstrap Cross Validation

• Draw a boot strap sample with sufficient sample size
• Build a tree model and get an estimate on true accuracy of the model

Solution

In [42]:
# Defining the tree parameters
tree_BS = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=30,
min_samples_split=30,
min_samples_leaf=30,
max_leaf_nodes=60)

In [43]:
# Defining the bootstrap variable for 10 random samples
bootstrap=cross_validation.ShuffleSplit(n=len(Fiber_df),
n_iter=10,
random_state=0)

In [44]:
###checking the error in the Boot Strap models###
BS_score = cross_validation.cross_val_score(tree_BS,X, y,cv=bootstrap)
BS_score

Out[44]:
array([ 0.8658,  0.8699,  0.8658,  0.8655,  0.8694,  0.8741,  0.8689,
0.8689,  0.8639,  0.8672])
In [45]:
#Expected accuracy according to bootstrap validation
###checking the error in the Boot Strap models###
BS_score.mean()

Out[45]:
0.86793999999999993

With 10 bootstrap samples we can expect an Accuracy : 77.98%.

## Conclusion

• We studied
• Validating a model, Types of data & Types of errors
• The problem of over fitting & The problem of under fitting
• Cross validation & Boot strapping
• Training error is what we see and that is not the true performance metric
• Test error plays vital role in model selection
• R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
• Cross Validation and Boot strapping techniques give us an idea on test error.
• Choose the model based on the combination of AIC, Cross Validation and Boot strapping results.
• Bootstrap is widely used in ensemble models & random forests.

### 0 responses on "204.4.12 Bootstrap Cross Validation"

• #### 204.4.5 What is a Best Model?

Link to the previous post :...

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.