LAB: Tree Building & Model Selection
In previous section, we studied about Pruning a Decision Tree in R
- Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors
- Build a decision tree model for fiber bits data
- Prune the tree if required
- Find out the final accuracy
- Is there any 100% active/inactive customer segment?
Solution
Fiberbits <- read.csv("C:\\Amrita\\Datavedi\\Fiberbits\\Fiberbits.csv")
Fiber_bits_tree<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.001), data=Fiberbits)
Fiber_bits_tree
## n= 100000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 100000 42141 1 (0.42141000 0.57859000)
## 2) relocated>=0.5 12348 954 0 (0.92274052 0.07725948)
## 4) technical_issues_per_month>=1.5 11294 526 0 (0.95342660 0.04657340) *
## 5) technical_issues_per_month< 1.5 1054 428 0 (0.59392789 0.40607211)
## 10) number_plan_changes>=4.5 495 45 0 (0.90909091 0.09090909) *
## 11) number_plan_changes< 4.5 559 176 1 (0.31484794 0.68515206)
## 22) Speed_test_result< 79.5 45 0 0 (1.00000000 0.00000000) *
## 23) Speed_test_result>=79.5 514 131 1 (0.25486381 0.74513619) *
## 3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)
## 6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)
## 12) technical_issues_per_month>=3.5 22187 5791 0 (0.73899130 0.26100870)
## 24) number_plan_changes< 0.5 9735 1132 0 (0.88371854 0.11628146)
## 48) Speed_test_result< 77.5 7750 541 0 (0.93019355 0.06980645) *
## 49) Speed_test_result>=77.5 1985 591 0 (0.70226700 0.29773300)
## 98) income>=2008.5 1211 133 0 (0.89017341 0.10982659)
## 196) income< 2526 1163 85 0 (0.92691316 0.07308684) *
## 197) income>=2526 48 0 1 (0.00000000 1.00000000) *
## 99) income< 2008.5 774 316 1 (0.40826873 0.59173127)
## 198) income< 1785.5 270 97 0 (0.64074074 0.35925926) *
## 199) income>=1785.5 504 143 1 (0.28373016 0.71626984) *
## 25) number_plan_changes>=0.5 12452 4659 0 (0.62584324 0.37415676)
## 50) number_plan_changes>=1.5 7867 1358 0 (0.82738020 0.17261980) *
## 51) number_plan_changes< 1.5 4585 1284 1 (0.28004362 0.71995638) *
## 13) technical_issues_per_month< 3.5 5330 818 1 (0.15347092 0.84652908)
## 26) income>=1945.5 1849 619 1 (0.33477555 0.66522445)
## 52) monthly_bill>=148 167 29 0 (0.82634731 0.17365269) *
## 53) monthly_bill< 148 1682 481 1 (0.28596908 0.71403092)
## 106) income< 2362 1407 472 1 (0.33546553 0.66453447)
## 212) technical_issues_per_month>=1.5 176 25 0 (0.85795455 0.14204545) *
## 213) technical_issues_per_month< 1.5 1231 321 1 (0.26076361 0.73923639)
## 426) income>=2180.5 126 21 0 (0.83333333 0.16666667) *
## 427) income< 2180.5 1105 216 1 (0.19547511 0.80452489) *
## 107) income>=2362 275 9 1 (0.03272727 0.96727273) *
## 27) income< 1945.5 3481 199 1 (0.05716748 0.94283252) *
## 7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)
## 14) Speed_test_result< 82.5 25734 9271 1 (0.36026269 0.63973731)
## 28) Speed_test_result>=80.5 11671 5312 0 (0.54485477 0.45514523)
## 56) income>=1722.5 6306 1299 0 (0.79400571 0.20599429)
## 112) income< 1992.5 5828 888 0 (0.84763212 0.15236788)
## 224) number_plan_changes>=1.5 3053 189 0 (0.93809368 0.06190632) *
## 225) number_plan_changes< 1.5 2775 699 0 (0.74810811 0.25189189)
## 450) number_plan_changes< 0.5 2284 358 0 (0.84325744 0.15674256)
## 900) technical_issues_per_month>=3.5 1511 57 0 (0.96227664 0.03772336) *
## 901) technical_issues_per_month< 3.5 773 301 0 (0.61060802 0.38939198)
## 1802) monthly_bill>=148 364 41 0 (0.88736264 0.11263736) *
## 1803) monthly_bill< 148 409 149 1 (0.36430318 0.63569682) *
## 451) number_plan_changes>=0.5 491 150 1 (0.30549898 0.69450102) *
## 113) income>=1992.5 478 67 1 (0.14016736 0.85983264) *
## 57) income< 1722.5 5365 1352 1 (0.25200373 0.74799627)
## 114) number_plan_changes>=1.5 1586 680 1 (0.42875158 0.57124842)
## 228) Speed_test_result< 81.5 894 370 0 (0.58612975 0.41387025)
## 456) technical_issues_per_month>=3.5 301 54 0 (0.82059801 0.17940199) *
## 457) technical_issues_per_month< 3.5 593 277 1 (0.46711636 0.53288364)
## 914) income< 1604.5 261 92 0 (0.64750958 0.35249042) *
## 915) income>=1604.5 332 108 1 (0.32530120 0.67469880) *
## 229) Speed_test_result>=81.5 692 156 1 (0.22543353 0.77456647) *
## 115) number_plan_changes< 1.5 3779 672 1 (0.17782482 0.82217518) *
## 29) Speed_test_result< 80.5 14063 2912 1 (0.20706819 0.79293181)
## 58) income< 1960.5 11360 2725 1 (0.23987676 0.76012324)
## 116) Num_complaints>=4.5 292 87 0 (0.70205479 0.29794521)
## 232) technical_issues_per_month>=3.5 197 0 0 (1.00000000 0.00000000) *
## 233) technical_issues_per_month< 3.5 95 8 1 (0.08421053 0.91578947) *
## 117) Num_complaints< 4.5 11068 2520 1 (0.22768341 0.77231659)
## 234) number_plan_changes>=1.5 4003 1180 1 (0.29477892 0.70522108)
## 468) income>=1809.5 1229 582 1 (0.47355574 0.52644426)
## 936) Speed_test_result>=79.5 477 132 0 (0.72327044 0.27672956) *
## 937) Speed_test_result< 79.5 752 237 1 (0.31515957 0.68484043) *
## 469) income< 1809.5 2774 598 1 (0.21557318 0.78442682) *
## 235) number_plan_changes< 1.5 7065 1340 1 (0.18966737 0.81033263) *
## 59) income>=1960.5 2703 187 1 (0.06918239 0.93081761) *
## 15) Speed_test_result>=82.5 34401 4262 1 (0.12389175 0.87610825) *
Plotting the Tree
prp(Fiber_bits_tree,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)
Code-Choosing Cp and Cross Validation Error
printcp(Fiber_bits_tree)
##
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class",
## control = rpart.control(minsplit = 30, cp = 0.001))
##
## Variables actually used in tree construction:
## [1] income monthly_bill
## [3] Num_complaints number_plan_changes
## [5] relocated Speed_test_result
## [7] technical_issues_per_month
##
## Root node error: 42141/100000 = 0.42141
##
## n= 100000
##
## CP nsplit rel error xerror xstd
## 1 0.2477397 0 1.00000 1.00000 0.0037054
## 2 0.1639971 1 0.75226 0.75226 0.0034917
## 3 0.0876581 2 0.58826 0.58826 0.0032402
## 4 0.0293301 3 0.50061 0.50061 0.0030616
## 5 0.0239316 6 0.41261 0.41295 0.0028450
## 6 0.0081631 8 0.36475 0.37498 0.0027372
## 7 0.0024560 9 0.35659 0.35811 0.0026862
## 8 0.0022662 11 0.35168 0.35362 0.0026723
## 9 0.0018272 13 0.34714 0.34520 0.0026457
## 10 0.0016848 15 0.34349 0.34228 0.0026364
## 11 0.0014001 18 0.33832 0.33825 0.0026234
## 12 0.0013763 24 0.32859 0.33495 0.0026127
## 13 0.0013170 26 0.32583 0.33115 0.0026003
## 14 0.0012933 28 0.32320 0.32859 0.0025918
## 15 0.0011390 33 0.31563 0.32465 0.0025787
## 16 0.0010678 34 0.31449 0.32088 0.0025661
## 17 0.0010000 35 0.31342 0.31926 0.0025606
Plot-Choosing Cp and Cross Validation Error
plotcp(Fiber_bits_tree)
Pruning
Fiber_bits_tree_1<-prune(Fiber_bits_tree, cp=0.0081631)
Fiber_bits_tree_1
## n= 100000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 100000 42141 1 (0.42141000 0.57859000)
## 2) relocated>=0.5 12348 954 0 (0.92274052 0.07725948) *
## 3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)
## 6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)
## 12) technical_issues_per_month>=3.5 22187 5791 0 (0.73899130 0.26100870)
## 24) number_plan_changes< 0.5 9735 1132 0 (0.88371854 0.11628146) *
## 25) number_plan_changes>=0.5 12452 4659 0 (0.62584324 0.37415676)
## 50) number_plan_changes>=1.5 7867 1358 0 (0.82738020 0.17261980) *
## 51) number_plan_changes< 1.5 4585 1284 1 (0.28004362 0.71995638) *
## 13) technical_issues_per_month< 3.5 5330 818 1 (0.15347092 0.84652908) *
## 7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)
## 14) Speed_test_result< 82.5 25734 9271 1 (0.36026269 0.63973731)
## 28) Speed_test_result>=80.5 11671 5312 0 (0.54485477 0.45514523)
## 56) income>=1722.5 6306 1299 0 (0.79400571 0.20599429) *
## 57) income< 1722.5 5365 1352 1 (0.25200373 0.74799627) *
## 29) Speed_test_result< 80.5 14063 2912 1 (0.20706819 0.79293181) *
## 15) Speed_test_result>=82.5 34401 4262 1 (0.12389175 0.87610825) *
Plot after Pruning
prp(Fiber_bits_tree_1,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)
Choosing Cp and Cross Validation Error with New Model
printcp(Fiber_bits_tree_1)
##
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class",
## control = rpart.control(minsplit = 30, cp = 0.001))
##
## Variables actually used in tree construction:
## [1] income number_plan_changes
## [3] relocated Speed_test_result
## [5] technical_issues_per_month
##
## Root node error: 42141/100000 = 0.42141
##
## n= 100000
##
## CP nsplit rel error xerror xstd
## 1 0.2477397 0 1.00000 1.00000 0.0037054
## 2 0.1639971 1 0.75226 0.75226 0.0034917
## 3 0.0876581 2 0.58826 0.58826 0.0032402
## 4 0.0293301 3 0.50061 0.50061 0.0030616
## 5 0.0239316 6 0.41261 0.41295 0.0028450
## 6 0.0081631 8 0.36475 0.37498 0.0027372
plotcp(Fiber_bits_tree_1)
Pruning further
Fiber_bits_tree_2<-prune(Fiber_bits_tree, cp=0.0239316)
Fiber_bits_tree_2
## n= 100000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 100000 42141 1 (0.42141000 0.57859000)
## 2) relocated>=0.5 12348 954 0 (0.92274052 0.07725948) *
## 3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)
## 6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)
## 12) technical_issues_per_month>=3.5 22187 5791 0 (0.73899130 0.26100870) *
## 13) technical_issues_per_month< 3.5 5330 818 1 (0.15347092 0.84652908) *
## 7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)
## 14) Speed_test_result< 82.5 25734 9271 1 (0.36026269 0.63973731)
## 28) Speed_test_result>=80.5 11671 5312 0 (0.54485477 0.45514523)
## 56) income>=1722.5 6306 1299 0 (0.79400571 0.20599429) *
## 57) income< 1722.5 5365 1352 1 (0.25200373 0.74799627) *
## 29) Speed_test_result< 80.5 14063 2912 1 (0.20706819 0.79293181) *
## 15) Speed_test_result>=82.5 34401 4262 1 (0.12389175 0.87610825) *
Tree- After Pruning further
prp(Fiber_bits_tree_2,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)
Conclusion
- Decision trees are powerful and very simple to represent and understand.
- One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms
- Can be applied to any type of data, especially with categorical predictors
- One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments
In next section, we will be studying about Model Section and Cross Validation