Link to the previous post : https://statinfer.com/204-3-5-information-gain-in-decision-tree-split/
In this post we will understand the decision tree algorithm step by step, how the split criterion and stop criterion are decided.
The Decision tree Algorithm
- The major step is to identify the best split variables and best split criteria.
- Once we have the split then we have to go to segment level and drill down further.
Until stopped:
- Select a leaf node
- Find the best splitting attribute
- Spilt the node using the attribute
- Go to each child node and repeat step 2 & 3
Stopping criteria:
- Each leaf-node contains examples of one type
- Algorithm ran out of attributes
- No further significant information gain
The Decision tree Algorithm – Demo
Entropy([4+,10-]) Ovearll = 86.3% (Impurity)
- Entropy([7+,1-]) Male= 54.3%
- Entropy([3+,3-]) Female = 100%
- Information Gain for Gender=86.3-((8/14)54.3+(6/14)100) =12.4
Entropy([4+,10-]) Ovearll = 86.3% (Impurity)
- Entropy([0+,9-]) Married = 0%
- Entropy([4+,1-]) Un Married= 72.1%
- Information Gain for Marital Status=86.3-((9/14)0+(5/14)72.1)=60.5
- The information gain for Marital Status is high, so it has to be the first variable for segmentation
- Now we consider the segment “Married” and repeat the same process of looking for the best splitting variable for this sub segment ### The Decision tree Algorithm
Until stopped:
- Select a leaf node
- Find the best splitting attribute
- Spilt the node using the attribute
- Go to each child node and repeat step 2 & 3 Stopping criteria:
- Each leaf-node contains examples of one type
- Algorithm ran out of attributes
- No further significant information gain
Many Splits for a Single Variable
- Sometimes we may find multiple values taken by a variable
- which will lead to multiple split options for a single variable
- that will give us multiple information gain values for a single variable
What is the information gain for income?
- What is the information gain for income?
- There are multiple options to calculate Information gain.
- For income, we will consider all possible scenarios and calculate the information gain for each scenario.
- The best split is the one with highest information gain.
- Within income, out of all the options, the split with best information gain is considered.
- So, node partitioning for multi class attributes need to be included in the decision tree algorithm.
- We need find best splitting attribute along with best split rule.
The next post is about building a decision tree in python.
Link to the next post : https://statinfer.com/204-3-7-building-a-decision-tree-in-python/