• No products in the cart.

203.3.11 Practice : Tree Building & Model Selection

A conclusion of Decision Tree series.

LAB: Tree Building & Model Selection

In previous section, we studied about Pruning a Decision Tree in R

  • Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors
  • Build a decision tree model for fiber bits data
  • Prune the tree if required
  • Find out the final accuracy
  • Is there any 100% active/inactive customer segment?

Solution

Fiberbits <- read.csv("C:\\Amrita\\Datavedi\\Fiberbits\\Fiberbits.csv")
Fiber_bits_tree<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.001), data=Fiberbits)
Fiber_bits_tree
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 100000 42141 1 (0.42141000 0.57859000)  
##      2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948)  
##        4) technical_issues_per_month>=1.5 11294   526 0 (0.95342660 0.04657340) *
##        5) technical_issues_per_month< 1.5 1054   428 0 (0.59392789 0.40607211)  
##         10) number_plan_changes>=4.5 495    45 0 (0.90909091 0.09090909) *
##         11) number_plan_changes< 4.5 559   176 1 (0.31484794 0.68515206)  
##           22) Speed_test_result< 79.5 45     0 0 (1.00000000 0.00000000) *
##           23) Speed_test_result>=79.5 514   131 1 (0.25486381 0.74513619) *
##      3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##        6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)  
##         12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870)  
##           24) number_plan_changes< 0.5 9735  1132 0 (0.88371854 0.11628146)  
##             48) Speed_test_result< 77.5 7750   541 0 (0.93019355 0.06980645) *
##             49) Speed_test_result>=77.5 1985   591 0 (0.70226700 0.29773300)  
##               98) income>=2008.5 1211   133 0 (0.89017341 0.10982659)  
##                196) income< 2526 1163    85 0 (0.92691316 0.07308684) *
##                197) income>=2526 48     0 1 (0.00000000 1.00000000) *
##               99) income< 2008.5 774   316 1 (0.40826873 0.59173127)  
##                198) income< 1785.5 270    97 0 (0.64074074 0.35925926) *
##                199) income>=1785.5 504   143 1 (0.28373016 0.71626984) *
##           25) number_plan_changes>=0.5 12452  4659 0 (0.62584324 0.37415676)  
##             50) number_plan_changes>=1.5 7867  1358 0 (0.82738020 0.17261980) *
##             51) number_plan_changes< 1.5 4585  1284 1 (0.28004362 0.71995638) *
##         13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908)  
##           26) income>=1945.5 1849   619 1 (0.33477555 0.66522445)  
##             52) monthly_bill>=148 167    29 0 (0.82634731 0.17365269) *
##             53) monthly_bill< 148 1682   481 1 (0.28596908 0.71403092)  
##              106) income< 2362 1407   472 1 (0.33546553 0.66453447)  
##                212) technical_issues_per_month>=1.5 176    25 0 (0.85795455 0.14204545) *
##                213) technical_issues_per_month< 1.5 1231   321 1 (0.26076361 0.73923639)  
##                  426) income>=2180.5 126    21 0 (0.83333333 0.16666667) *
##                  427) income< 2180.5 1105   216 1 (0.19547511 0.80452489) *
##              107) income>=2362 275     9 1 (0.03272727 0.96727273) *
##           27) income< 1945.5 3481   199 1 (0.05716748 0.94283252) *
##        7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)  
##         14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)  
##           28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)  
##             56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429)  
##              112) income< 1992.5 5828   888 0 (0.84763212 0.15236788)  
##                224) number_plan_changes>=1.5 3053   189 0 (0.93809368 0.06190632) *
##                225) number_plan_changes< 1.5 2775   699 0 (0.74810811 0.25189189)  
##                  450) number_plan_changes< 0.5 2284   358 0 (0.84325744 0.15674256)  
##                    900) technical_issues_per_month>=3.5 1511    57 0 (0.96227664 0.03772336) *
##                    901) technical_issues_per_month< 3.5 773   301 0 (0.61060802 0.38939198)  
##                     1802) monthly_bill>=148 364    41 0 (0.88736264 0.11263736) *
##                     1803) monthly_bill< 148 409   149 1 (0.36430318 0.63569682) *
##                  451) number_plan_changes>=0.5 491   150 1 (0.30549898 0.69450102) *
##              113) income>=1992.5 478    67 1 (0.14016736 0.85983264) *
##             57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627)  
##              114) number_plan_changes>=1.5 1586   680 1 (0.42875158 0.57124842)  
##                228) Speed_test_result< 81.5 894   370 0 (0.58612975 0.41387025)  
##                  456) technical_issues_per_month>=3.5 301    54 0 (0.82059801 0.17940199) *
##                  457) technical_issues_per_month< 3.5 593   277 1 (0.46711636 0.53288364)  
##                    914) income< 1604.5 261    92 0 (0.64750958 0.35249042) *
##                    915) income>=1604.5 332   108 1 (0.32530120 0.67469880) *
##                229) Speed_test_result>=81.5 692   156 1 (0.22543353 0.77456647) *
##              115) number_plan_changes< 1.5 3779   672 1 (0.17782482 0.82217518) *
##           29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181)  
##             58) income< 1960.5 11360  2725 1 (0.23987676 0.76012324)  
##              116) Num_complaints>=4.5 292    87 0 (0.70205479 0.29794521)  
##                232) technical_issues_per_month>=3.5 197     0 0 (1.00000000 0.00000000) *
##                233) technical_issues_per_month< 3.5 95     8 1 (0.08421053 0.91578947) *
##              117) Num_complaints< 4.5 11068  2520 1 (0.22768341 0.77231659)  
##                234) number_plan_changes>=1.5 4003  1180 1 (0.29477892 0.70522108)  
##                  468) income>=1809.5 1229   582 1 (0.47355574 0.52644426)  
##                    936) Speed_test_result>=79.5 477   132 0 (0.72327044 0.27672956) *
##                    937) Speed_test_result< 79.5 752   237 1 (0.31515957 0.68484043) *
##                  469) income< 1809.5 2774   598 1 (0.21557318 0.78442682) *
##                235) number_plan_changes< 1.5 7065  1340 1 (0.18966737 0.81033263) *
##             59) income>=1960.5 2703   187 1 (0.06918239 0.93081761) *
##         15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *

Plotting the Tree

prp(Fiber_bits_tree,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)

Code-Choosing Cp and Cross Validation Error

printcp(Fiber_bits_tree)
## 
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class", 
##     control = rpart.control(minsplit = 30, cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] income                     monthly_bill              
## [3] Num_complaints             number_plan_changes       
## [5] relocated                  Speed_test_result         
## [7] technical_issues_per_month
## 
## Root node error: 42141/100000 = 0.42141
## 
## n= 100000 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.2477397      0   1.00000 1.00000 0.0037054
## 2  0.1639971      1   0.75226 0.75226 0.0034917
## 3  0.0876581      2   0.58826 0.58826 0.0032402
## 4  0.0293301      3   0.50061 0.50061 0.0030616
## 5  0.0239316      6   0.41261 0.41295 0.0028450
## 6  0.0081631      8   0.36475 0.37498 0.0027372
## 7  0.0024560      9   0.35659 0.35811 0.0026862
## 8  0.0022662     11   0.35168 0.35362 0.0026723
## 9  0.0018272     13   0.34714 0.34520 0.0026457
## 10 0.0016848     15   0.34349 0.34228 0.0026364
## 11 0.0014001     18   0.33832 0.33825 0.0026234
## 12 0.0013763     24   0.32859 0.33495 0.0026127
## 13 0.0013170     26   0.32583 0.33115 0.0026003
## 14 0.0012933     28   0.32320 0.32859 0.0025918
## 15 0.0011390     33   0.31563 0.32465 0.0025787
## 16 0.0010678     34   0.31449 0.32088 0.0025661
## 17 0.0010000     35   0.31342 0.31926 0.0025606

Plot-Choosing Cp and Cross Validation Error

plotcp(Fiber_bits_tree) 

Pruning

Fiber_bits_tree_1<-prune(Fiber_bits_tree, cp=0.0081631)
Fiber_bits_tree_1
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 100000 42141 1 (0.42141000 0.57859000)  
##    2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##    3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##      6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)  
##       12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870)  
##         24) number_plan_changes< 0.5 9735  1132 0 (0.88371854 0.11628146) *
##         25) number_plan_changes>=0.5 12452  4659 0 (0.62584324 0.37415676)  
##           50) number_plan_changes>=1.5 7867  1358 0 (0.82738020 0.17261980) *
##           51) number_plan_changes< 1.5 4585  1284 1 (0.28004362 0.71995638) *
##       13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908) *
##      7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)  
##       14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)  
##         28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)  
##           56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429) *
##           57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627) *
##         29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181) *
##       15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *

Plot after Pruning

prp(Fiber_bits_tree_1,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)

Choosing Cp and Cross Validation Error with New Model

printcp(Fiber_bits_tree_1) 
## 
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class", 
##     control = rpart.control(minsplit = 30, cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] income                     number_plan_changes       
## [3] relocated                  Speed_test_result         
## [5] technical_issues_per_month
## 
## Root node error: 42141/100000 = 0.42141
## 
## n= 100000 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.2477397      0   1.00000 1.00000 0.0037054
## 2 0.1639971      1   0.75226 0.75226 0.0034917
## 3 0.0876581      2   0.58826 0.58826 0.0032402
## 4 0.0293301      3   0.50061 0.50061 0.0030616
## 5 0.0239316      6   0.41261 0.41295 0.0028450
## 6 0.0081631      8   0.36475 0.37498 0.0027372
plotcp(Fiber_bits_tree_1) 

Pruning further

Fiber_bits_tree_2<-prune(Fiber_bits_tree, cp=0.0239316)
Fiber_bits_tree_2
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 100000 42141 1 (0.42141000 0.57859000)  
##    2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##    3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##      6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)  
##       12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870) *
##       13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908) *
##      7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)  
##       14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)  
##         28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)  
##           56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429) *
##           57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627) *
##         29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181) *
##       15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *

Tree- After Pruning further

prp(Fiber_bits_tree_2,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)

Conclusion

  • Decision trees are powerful and very simple to represent and understand.
  • One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms
  • Can be applied to any type of data, especially with categorical predictors
  • One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments

 

In next section, we will be studying about  Model Section and Cross Validation

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.