Link to the previous post : https://statinfer.com/204-3-10-pruning-a-decision-tree-in-python/
This is our last post in this series. We will again build a model and validate the accuracy of the model on test data and prune the tree if needed.
Tree Building & Model Selection
- Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors.
- Build a decision tree model for fiber bits data.
- Prune the tree if required.
- Find out the final accuracy.
- Is there any 100% active/inactive customer segment?
import pandas as pd import numpy as np Fiber_df = pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv", header=0) Fiber_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 9 columns): active_cust 100000 non-null int64 income 100000 non-null int64 months_on_network 100000 non-null int64 Num_complaints 100000 non-null int64 number_plan_changes 100000 non-null int64 relocated 100000 non-null int64 monthly_bill 100000 non-null int64 technical_issues_per_month 100000 non-null int64 Speed_test_result 100000 non-null int64 dtypes: int64(9) memory usage: 6.9 MB
from sklearn import cross_validation, tree #Defining Features and lables features = list(Fiber_df.drop(['active_cust'],1).columns) #this code gives a list of column names except 'active_cust' features
['income', 'months_on_network', 'Num_complaints', 'number_plan_changes', 'relocated', 'monthly_bill', 'technical_issues_per_month', 'Speed_test_result']
X = np.array(Fiber_df[features]) y = np.array(Fiber_df['active_cust'])
We will split the training data into train and test to validate the model accuracy
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, train_size = 0.8)
clf1 = tree.DecisionTreeClassifier() clf1.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
#Accuracy of the model clf1 clf1.score(X_test,y_test)
We can tweak the parameters and build a new model.
#Let's make a model by chnaging the parameters. clf2 = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=10, min_samples_split=5, min_samples_leaf=5, min_weight_fraction_leaf=0.5, max_leaf_nodes=10) clf2.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10, max_features=None, max_leaf_nodes=10, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.5, presort=False, random_state=None, splitter='best')
- Decision trees are powerful and very simple to represent and understand.
- One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms.
- Can be applied to any type of data, especially with categorical predictors.
- One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments.