Link to the previous post : https://statinfer.com/204-3-10-pruning-a-decision-tree-in-python/

This is our last post in this series. We will again build a model and validate the accuracy of the model on test data and prune the tree if needed.

Tree Building & Model Selection

Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors.
Build a decision tree model for fiber bits data.
Prune the tree if required.
Find out the final accuracy.
Is there any 100% active/inactive customer segment?

Solution

import pandas as pd
import numpy as np

Fiber_df = pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv", header=0)

Fiber_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
active_cust                   100000 non-null int64
income                        100000 non-null int64
months_on_network             100000 non-null int64
Num_complaints                100000 non-null int64
number_plan_changes           100000 non-null int64
relocated                     100000 non-null int64
monthly_bill                  100000 non-null int64
technical_issues_per_month    100000 non-null int64
Speed_test_result             100000 non-null int64
dtypes: int64(9)
memory usage: 6.9 MB

from sklearn import cross_validation, tree
#Defining Features and lables
features = list(Fiber_df.drop(['active_cust'],1).columns) #this code gives a list of column names except 'active_cust'
features

['income',
 'months_on_network',
 'Num_complaints',
 'number_plan_changes',
 'relocated',
 'monthly_bill',
 'technical_issues_per_month',
 'Speed_test_result']

X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])

We will split the training data into train and test to validate the model accuracy

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, train_size = 0.8)

clf1 = tree.DecisionTreeClassifier()
clf1.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

#Accuracy of the model clf1
clf1.score(X_test,y_test)

0.84550000000000003

We can tweak the parameters and build a new model.

#Let's make a model by chnaging the parameters.
clf2 = tree.DecisionTreeClassifier(criterion='gini', 
                                   splitter='best', 
                                   max_depth=10, 
                                   min_samples_split=5, 
                                   min_samples_leaf=5, 
                                   min_weight_fraction_leaf=0.5, 
                                   max_leaf_nodes=10)
clf2.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=10, min_samples_leaf=5,
            min_samples_split=5, min_weight_fraction_leaf=0.5,
            presort=False, random_state=None, splitter='best')

clf2.score(X_test,y_test)

0.84824999999999995

Conclusion

Decision trees are powerful and very simple to represent and understand.
One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms.
Can be applied to any type of data, especially with categorical predictors.
One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments.

21st June 2017

204.3.11 Practice : Tree Building & Model Selection

A conclusion of Decision Tree series.

Tree Building & Model Selection

Solution

Conclusion