In previous section, we studied about Types of Ensemble Models
Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.
We will cover the concept behind Bagging and implement it using R.
#Importing Boston house pricing data.
library(MASS)
data(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
dim(Boston)
## [1] 506 14
##Training and holdout sample
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)
train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]
###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)
##
## Call:
## lm(formula = medv ~ ., data = train_boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4763 -2.7684 -0.4912 1.9030 26.4569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.637e+01 5.534e+00 6.572 1.40e-10 ***
## crim -1.042e-01 3.513e-02 -2.965 0.003195 **
## zn 4.482e-02 1.459e-02 3.073 0.002248 **
## indus 1.986e-02 6.566e-02 0.302 0.762462
## chas 2.733e+00 8.765e-01 3.118 0.001939 **
## nox -1.844e+01 4.018e+00 -4.590 5.79e-06 ***
## rm 3.845e+00 4.670e-01 8.234 2.04e-15 ***
## age 8.782e-04 1.434e-02 0.061 0.951211
## dis -1.488e+00 2.096e-01 -7.101 4.94e-12 ***
## rad 2.770e-01 6.993e-02 3.960 8.71e-05 ***
## tax -1.062e-02 3.944e-03 -2.693 0.007348 **
## ptratio -9.799e-01 1.385e-01 -7.073 5.92e-12 ***
## black 9.620e-03 2.827e-03 3.403 0.000726 ***
## lstat -5.051e-01 5.706e-02 -8.852 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.723
## F-statistic: 92.75 on 13 and 444 DF, p-value: < 2.2e-16
###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err
## [1] 918.5927
###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)
###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err
## [1] 390.9028
###Overall Improvement
reg_err
## [1] 918.5927
bgg_err
## [1] 390.9028
(reg_err-bgg_err)/reg_err
## [1] 0.5744547
We can see the error of the model has been reduced.