Link to the previous post : https://statinfer.com/204-7-3-types-of-ensemble-models/
Let’s, move forward to the first type of Ensemble Methodology, the Bagging Algorithm.
We will cover the concept behind Bagging and implement it using python.
The Bagging Algorithm
- The training dataset D
- Draw k boot strap sample sets from dataset D
- For each boot strap sample i
- Build a classifier model Mi
- We will have total of k classifiers M1,M2,...Mk
- Vote over for the final classifier output and take the average for regression output.
Why Bagging Works
- We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again.
- Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
- In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
- There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
- So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors..
- Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
- Bagging is really useful when there is lot of variance in our data.
And now, lets put everything into practice.
Practice : Bagging Models
- Import Boston house price data.
- Get some basic meta details of the data
- Take 90% data use it for training and take rest 10% as holdout data
- Build a single linear regression model on the training data.
- On the hold out data, calculate the error (squared deviation) for the regression model.
- Build the regression model using bagging technique. Build at least 25 models
- On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
- What is the improvement of the bagged model when compared with the single model?
In [1]:
#Importing Boston house price data
import pandas as pd
import sklearn as sk
import numpy as np
import scipy as sp
house=pd.read_csv("datasets/Housing/Boston.csv")
In [2]:
house.head(5)
Out[2]:
In [3]:
###columns of the dataset##
house.info()
In [4]:
###Splitting the dataset into training and testing datasets
from sklearn.cross_validation import train_test_split
house_train,house_test=train_test_split(house,train_size=0.9)
In [5]:
###Building a linear Regression with medv as the predictor variable on the traiing dadaset ###
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(house_train[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']],house_train[['medv']])
Out[5]:
In [6]:
###predicting the model on test dataset
predict_test=lr.predict(house_test[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']])
In [7]:
from sklearn.metrics import mean_squared_error
###error in linear regression model ###
mean_squared_error(house_test['medv'],predict_test, sample_weight=None, multioutput='uniform_average')
Out[7]:
In [8]:
#Build the regression model using bagging technique.
from sklearn.ensemble import BaggingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Bag=BaggingRegressor(base_estimator=LinearRegression(), n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=None, verbose=0)
features = list(house.columns[:13])
X = house_train[features]
y = house_train['medv']
Bag.fit(X,y)
bagpredict_test=Bag.predict(house_test[features])
z=(house_test[['medv']])
In [9]:
### to estimate the accuracy of the Bagging model ###
mean_squared_error(z, bagpredict_test, sample_weight=None, multioutput='uniform_average')
Out[9]:
We can see the error of the model has been reduced.
The next post is about the random forest.
Link to the previous post : https://statinfer.com/204-7-5-the-random-forest/