Link to the previous post : https://statinfer.com/204-3-10-pruning-a-decision-tree-in-python/
This is our last post in this series. We will again build a model and validate the accuracy of the model on test data and prune the tree if needed.
Tree Building & Model Selection
- Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors.
- Build a decision tree model for fiber bits data.
- Prune the tree if required.
- Find out the final accuracy.
- Is there any 100% active/inactive customer segment?
Solution
import pandas as pd
import numpy as np
Fiber_df = pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv", header=0)
Fiber_df.info()
from sklearn import cross_validation, tree
#Defining Features and lables
features = list(Fiber_df.drop(['active_cust'],1).columns) #this code gives a list of column names except 'active_cust'
features
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
We will split the training data into train and test to validate the model accuracy
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, train_size = 0.8)
clf1 = tree.DecisionTreeClassifier()
clf1.fit(X_train,y_train)
#Accuracy of the model clf1
clf1.score(X_test,y_test)
We can tweak the parameters and build a new model.
#Let's make a model by chnaging the parameters.
clf2 = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=10,
min_samples_split=5,
min_samples_leaf=5,
min_weight_fraction_leaf=0.5,
max_leaf_nodes=10)
clf2.fit(X_train,y_train)
clf2.score(X_test,y_test)
Conclusion
- Decision trees are powerful and very simple to represent and understand.
- One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms.
- Can be applied to any type of data, especially with categorical predictors.
- One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments.