In previous section, we studied about Types of Ensemble Models
Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.
We will cover the concept behind Bagging and implement it using R.
The Bagging Algorithm
- The training dataset D
- Draw k boot strap sample sets from dataset D
- For each boot strap sample i
- Build a classifier model \(M_i\)
- We will have total of k classifiers \(M_1 , M_2 ,… M_k\)
- Vote over for the final classifier output and take the average for regression output
Why Bagging Works
- We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again.
- Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
- In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
- There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
- So the data used in each of these models is not exactly the same, this makes our learning models independent. This helps our predictors have the uncorrelated errors.
- Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
- Bagging is really useful when there is lot of variance in our data.
LAB: Bagging Models
- Import Boston house price data. It is part of MASS package
- Get some basic meta details of the data
- Take 90% data use it for training and take rest 10% as holdout data
- Build a single linear regression model on the training data.
- On the hold out data, calculate the error (squared deviation) for the regression model.
- Build the regression model using bagging technique. Build at least 25 models
- On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
- What is the improvement of the bagged model when compared with the single model?
Solution
#Importing Boston house pricing data.
library(MASS)
data(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
dim(Boston)
## [1] 506 14
##Training and holdout sample
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)
train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]
###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)
##
## Call:
## lm(formula = medv ~ ., data = train_boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4763 -2.7684 -0.4912 1.9030 26.4569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.637e+01 5.534e+00 6.572 1.40e-10 ***
## crim -1.042e-01 3.513e-02 -2.965 0.003195 **
## zn 4.482e-02 1.459e-02 3.073 0.002248 **
## indus 1.986e-02 6.566e-02 0.302 0.762462
## chas 2.733e+00 8.765e-01 3.118 0.001939 **
## nox -1.844e+01 4.018e+00 -4.590 5.79e-06 ***
## rm 3.845e+00 4.670e-01 8.234 2.04e-15 ***
## age 8.782e-04 1.434e-02 0.061 0.951211
## dis -1.488e+00 2.096e-01 -7.101 4.94e-12 ***
## rad 2.770e-01 6.993e-02 3.960 8.71e-05 ***
## tax -1.062e-02 3.944e-03 -2.693 0.007348 **
## ptratio -9.799e-01 1.385e-01 -7.073 5.92e-12 ***
## black 9.620e-03 2.827e-03 3.403 0.000726 ***
## lstat -5.051e-01 5.706e-02 -8.852 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.723
## F-statistic: 92.75 on 13 and 444 DF, p-value: < 2.2e-16
###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err
## [1] 918.5927
###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)
###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err
## [1] 390.9028
###Overall Improvement
reg_err
## [1] 918.5927
bgg_err
## [1] 390.9028
(reg_err-bgg_err)/reg_err
## [1] 0.5744547
We can see the error of the model has been reduced.
- The next post is about The Random Forest.