• LOGIN
  • No products in the cart.

204.3.11 Practice : Tree Building & Model Selection

A conclusion of Decision Tree series.

Link to the previous post : https://statinfer.com/204-3-10-pruning-a-decision-tree-in-python/

 

This is our last post in this series. We will again build a model and validate the accuracy of the model on test data and prune the tree if needed.

Tree Building & Model Selection

  • Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors.
  • Build a decision tree model for fiber bits data.
  • Prune the tree if required.
  • Find out the final accuracy.
  • Is there any 100% active/inactive customer segment?

Solution

import pandas as pd
import numpy as np

Fiber_df = pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv", header=0)

Fiber_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
active_cust                   100000 non-null int64
income                        100000 non-null int64
months_on_network             100000 non-null int64
Num_complaints                100000 non-null int64
number_plan_changes           100000 non-null int64
relocated                     100000 non-null int64
monthly_bill                  100000 non-null int64
technical_issues_per_month    100000 non-null int64
Speed_test_result             100000 non-null int64
dtypes: int64(9)
memory usage: 6.9 MB
from sklearn import cross_validation, tree
#Defining Features and lables
features = list(Fiber_df.drop(['active_cust'],1).columns) #this code gives a list of column names except 'active_cust'
features
['income',
 'months_on_network',
 'Num_complaints',
 'number_plan_changes',
 'relocated',
 'monthly_bill',
 'technical_issues_per_month',
 'Speed_test_result']
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])

We will split the training data into train and test to validate the model accuracy

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, train_size = 0.8)
clf1 = tree.DecisionTreeClassifier()
clf1.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
#Accuracy of the model clf1
clf1.score(X_test,y_test)
0.84550000000000003

We can tweak the parameters and build a new model.

#Let's make a model by chnaging the parameters.
clf2 = tree.DecisionTreeClassifier(criterion='gini', 
                                   splitter='best', 
                                   max_depth=10, 
                                   min_samples_split=5, 
                                   min_samples_leaf=5, 
                                   min_weight_fraction_leaf=0.5, 
                                   max_leaf_nodes=10)
clf2.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=10, min_samples_leaf=5,
            min_samples_split=5, min_weight_fraction_leaf=0.5,
            presort=False, random_state=None, splitter='best')
clf2.score(X_test,y_test)
0.84824999999999995

Conclusion

  • Decision trees are powerful and very simple to represent and understand.
  • One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms.
  • Can be applied to any type of data, especially with categorical predictors.
  • One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments.

0 responses on "204.3.11 Practice : Tree Building & Model Selection"

Leave a Message