When Ensemble doesn’t work?
In previous section, we studied about Practice : Boosting
- The models have to be independent, we can’t build the same model multiple times and expect the error to reduce.
- We may have to bring in the independence by choosing subsets of data, or subset of features while building the individual models.
- Ensemble may backfire if we use dependent models that are already less accurate. The final ensemble might turn out to be even worse model.
- Yes, there is a small disclaimer in “Wisdom of Crowd” theory. We need good independent individuals. If we collate any dependent individuals with poor knowledge, then we might end with an even worse ensemble.
- For example, we built three models, model-1 , model-2 are bad, model-3 is good. Most of the times ensemble will result the combined output of model-1 and model-2, based on voting.
Conclusion
- Ensemble methods are most widely used methods these days. With advanced machines, its not really a huge task to build multiple models.
- Both bagging and boosting does a good job of reducing bias and variance.
- Random forests are relatively fast, since we are building many small trees, it doesn’t put lot of pressure on the computing machine.
- Random forest can also give the variable importance. We need to be careful with categorical features, random forests tend to give higher importance to variables with higher number of levels.
- In Boosted algorithms we may have to restrict the number of iterations to avoid overfitting.
- Ensemble models are the final effort of a data scientist, while building the most suitable predictive model for the data.