204.3.6 The Decision Tree Algorithm

Link to the previous post : https://statinfer.com/204-3-5-information-gain-in-decision-tree-split/

In this post we will understand the decision tree algorithm step by step, how the split criterion and stop criterion are decided.

The major step is to identify the best split variables and best split criteria.
Once we have the split then we have to go to segment level and drill down further.

Until stopped:

Stopping criteria:

Entropy([4+,10-]) Ovearll = 86.3% (Impurity)

Entropy([4+,10-]) Ovearll = 86.3% (Impurity)

Entropy([0+,9-]) Married = 0%
Entropy([4+,1-]) Un Married= 72.1%
Information Gain for Marital Status=86.3-((9/14)0+(5/14)72.1)=60.5
The information gain for Marital Status is high, so it has to be the first variable for segmentation

Now we consider the segment “Married” and repeat the same process of looking for the best splitting variable for this sub segment ### The Decision tree Algorithm

Until stopped:

Sometimes we may find multiple values taken by a variable
- which will lead to multiple split options for a single variable
- that will give us multiple information gain values for a single variable

What is the information gain for income?

What is the information gain for income?
There are multiple options to calculate Information gain.
For income, we will consider all possible scenarios and calculate the information gain for each scenario.
The best split is the one with highest information gain.
Within income, out of all the options, the split with best information gain is considered.
So, node partitioning for multi class attributes need to be included in the decision tree algorithm.
We need find best splitting attribute along with best split rule.

The next post is about building a decision tree in python.

Link to the next post : https://statinfer.com/204-3-7-building-a-decision-tree-in-python/

21st June 2017