203.7.4 The Bagging Algorithm

In previous section, we studied about Types of Ensemble Models

Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.

We will cover the concept behind Bagging and implement it using R.

The Bagging Algorithm

The training dataset D
Draw k boot strap sample sets from dataset D
For each boot strap sample i
- Build a classifier model \(M_i\)
We will have total of k classifiers \(M_1 , M_2 ,… M_k\)
Vote over for the final classifier output and take the average for regression output

Why Bagging Works

We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again.
Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
So the data used in each of these models is not exactly the same, this makes our learning models independent. This helps our predictors have the uncorrelated errors.
Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
Bagging is really useful when there is lot of variance in our data.

LAB: Bagging Models

Import Boston house price data. It is part of MASS package
Get some basic meta details of the data
Take 90% data use it for training and take rest 10% as holdout data
Build a single linear regression model on the training data.
On the hold out data, calculate the error (squared deviation) for the regression model.
Build the regression model using bagging technique. Build at least 25 models
On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
What is the improvement of the bagged model when compared with the single model?

Solution

#Importing Boston  house pricing data. 
library(MASS)
data(Boston)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

dim(Boston)

## [1] 506  14

##Training and holdout sample
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)

train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]

###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)

## 
## Call:
## lm(formula = medv ~ ., data = train_boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4763  -2.7684  -0.4912   1.9030  26.4569 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.637e+01  5.534e+00   6.572 1.40e-10 ***
## crim        -1.042e-01  3.513e-02  -2.965 0.003195 ** 
## zn           4.482e-02  1.459e-02   3.073 0.002248 ** 
## indus        1.986e-02  6.566e-02   0.302 0.762462    
## chas         2.733e+00  8.765e-01   3.118 0.001939 ** 
## nox         -1.844e+01  4.018e+00  -4.590 5.79e-06 ***
## rm           3.845e+00  4.670e-01   8.234 2.04e-15 ***
## age          8.782e-04  1.434e-02   0.061 0.951211    
## dis         -1.488e+00  2.096e-01  -7.101 4.94e-12 ***
## rad          2.770e-01  6.993e-02   3.960 8.71e-05 ***
## tax         -1.062e-02  3.944e-03  -2.693 0.007348 ** 
## ptratio     -9.799e-01  1.385e-01  -7.073 5.92e-12 ***
## black        9.620e-03  2.827e-03   3.403 0.000726 ***
## lstat       -5.051e-01  5.706e-02  -8.852  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared:  0.7309, Adjusted R-squared:  0.723 
## F-statistic: 92.75 on 13 and 444 DF,  p-value: < 2.2e-16

###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err

## [1] 918.5927

###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)

###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err

## [1] 390.9028

###Overall Improvement
reg_err

## [1] 918.5927

bgg_err

## [1] 390.9028

(reg_err-bgg_err)/reg_err

## [1] 0.5744547
We can see the error of the model has been reduced.

The next post is about The Random Forest.

21st June 2017

203.7.4 The Bagging Algorithm

The Bagging Algorithm.

The Bagging Algorithm

Why Bagging Works

LAB: Bagging Models

Solution