Link to the previous post : https://statinfer.com/204-4-11-k-fold-cross-validation/
This will be our last post on our Model Selection and Cross Validation Series.
- Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error.
- Can estimate the likely future performance of a given modeling procedure, on new data not yet realized.
- The Algorithm
- We have a training data is of size N
- Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
- Create B such new datasets. These are called boot strap datasets.
- Build the model on these B datasets, we can test the models on the original training dataset.
- We have a training data is of size 500
- Boot Strap Data-1:
- Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1.
- Multiple Boot Strap datasets
- Repeat, the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets.
- We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models.
LAB: Bootstrap Cross Validation
- Draw a boot strap sample with sufficient sample size
- Build a tree model and get an estimate on true accuracy of the model
# Defining the tree parameters tree_BS = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=30, min_samples_split=30, min_samples_leaf=30, max_leaf_nodes=60)
# Defining the bootstrap variable for 10 random samples bootstrap=cross_validation.ShuffleSplit(n=len(Fiber_df), n_iter=10, random_state=0)
###checking the error in the Boot Strap models### BS_score = cross_validation.cross_val_score(tree_BS,X, y,cv=bootstrap) BS_score
array([ 0.8658, 0.8699, 0.8658, 0.8655, 0.8694, 0.8741, 0.8689, 0.8689, 0.8639, 0.8672])
#Expected accuracy according to bootstrap validation ###checking the error in the Boot Strap models### BS_score.mean()
With 10 bootstrap samples we can expect an Accuracy : 77.98%.
- We studied
- Validating a model, Types of data & Types of errors
- The problem of over fitting & The problem of under fitting
- Bias Variance Tradeoff
- Cross validation & Boot strapping
- Training error is what we see and that is not the true performance metric
- Test error plays vital role in model selection
- R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
- Cross Validation and Boot strapping techniques give us an idea on test error.
- Choose the model based on the combination of AIC, Cross Validation and Boot strapping results.
- Bootstrap is widely used in ensemble models & random forests.