Statinfer

203.7.4 The Bagging Algorithm

The Bagging Algorithm.

In previous section, we studied about Types of Ensemble Models

Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.

We will cover the concept behind Bagging and implement it using R.

The Bagging Algorithm

  • The training dataset D
  • Draw k boot strap sample sets from dataset D
  • For each boot strap sample i
    • Build a classifier model \(M_i\)
  • We will have total of k classifiers \(M_1 , M_2 ,… M_k\)
  • Vote over for the final classifier output and take the average for regression output

Why Bagging Works

  • We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again.
  • Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
  • In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
  • There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
  • So the data used in each of these models is not exactly the same, this makes our learning models independent. This helps our predictors have the uncorrelated errors.
  • Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
  • Bagging is really useful when there is lot of variance in our data.

LAB: Bagging Models

  • Import Boston house price data. It is part of MASS package
  • Get some basic meta details of the data
  • Take 90% data use it for training and take rest 10% as holdout data
  • Build a single linear regression model on the training data.
  • On the hold out data, calculate the error (squared deviation) for the regression model.
  • Build the regression model using bagging technique. Build at least 25 models
  • On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
  • What is the improvement of the bagged model when compared with the single model?

Solution

#Importing Boston  house pricing data. 
library(MASS)
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
dim(Boston)
## [1] 506  14
##Training and holdout sample
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)

train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]

###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)
## 
## Call:
## lm(formula = medv ~ ., data = train_boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4763  -2.7684  -0.4912   1.9030  26.4569 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.637e+01  5.534e+00   6.572 1.40e-10 ***
## crim        -1.042e-01  3.513e-02  -2.965 0.003195 ** 
## zn           4.482e-02  1.459e-02   3.073 0.002248 ** 
## indus        1.986e-02  6.566e-02   0.302 0.762462    
## chas         2.733e+00  8.765e-01   3.118 0.001939 ** 
## nox         -1.844e+01  4.018e+00  -4.590 5.79e-06 ***
## rm           3.845e+00  4.670e-01   8.234 2.04e-15 ***
## age          8.782e-04  1.434e-02   0.061 0.951211    
## dis         -1.488e+00  2.096e-01  -7.101 4.94e-12 ***
## rad          2.770e-01  6.993e-02   3.960 8.71e-05 ***
## tax         -1.062e-02  3.944e-03  -2.693 0.007348 ** 
## ptratio     -9.799e-01  1.385e-01  -7.073 5.92e-12 ***
## black        9.620e-03  2.827e-03   3.403 0.000726 ***
## lstat       -5.051e-01  5.706e-02  -8.852  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared:  0.7309, Adjusted R-squared:  0.723 
## F-statistic: 92.75 on 13 and 444 DF,  p-value: < 2.2e-16
###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err
## [1] 918.5927
###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)

###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err
## [1] 390.9028
###Overall Improvement
reg_err
## [1] 918.5927
bgg_err
## [1] 390.9028
(reg_err-bgg_err)/reg_err
## [1] 0.5744547
We can see the error of the model has been reduced.

0 responses on "203.7.4 The Bagging Algorithm"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?

top