• No products in the cart.

204.3.11 Practice : Tree Building & Model Selection

Link to the previous post : https://statinfer.com/204-3-10-pruning-a-decision-tree-in-python/


This is our last post in this series. We will again build a model and validate the accuracy of the model on test data and prune the tree if needed.

Tree Building & Model Selection

  • Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors.
  • Build a decision tree model for fiber bits data.
  • Prune the tree if required.
  • Find out the final accuracy.
  • Is there any 100% active/inactive customer segment?


import pandas as pd
import numpy as np

Fiber_df = pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv", header=0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
active_cust                   100000 non-null int64
income                        100000 non-null int64
months_on_network             100000 non-null int64
Num_complaints                100000 non-null int64
number_plan_changes           100000 non-null int64
relocated                     100000 non-null int64
monthly_bill                  100000 non-null int64
technical_issues_per_month    100000 non-null int64
Speed_test_result             100000 non-null int64
dtypes: int64(9)
memory usage: 6.9 MB
from sklearn import cross_validation, tree
#Defining Features and lables
features = list(Fiber_df.drop(['active_cust'],1).columns) #this code gives a list of column names except 'active_cust'
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])

We will split the training data into train and test to validate the model accuracy

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, train_size = 0.8)
clf1 = tree.DecisionTreeClassifier()
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
#Accuracy of the model clf1

We can tweak the parameters and build a new model.

#Let's make a model by chnaging the parameters.
clf2 = tree.DecisionTreeClassifier(criterion='gini', 
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=10, min_samples_leaf=5,
            min_samples_split=5, min_weight_fraction_leaf=0.5,
            presort=False, random_state=None, splitter='best')


  • Decision trees are powerful and very simple to represent and understand.
  • One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms.
  • Can be applied to any type of data, especially with categorical predictors.
  • One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments.
0 responses on "204.3.11 Practice : Tree Building & Model Selection"

Leave a Message


Statinfer derived from Statistical inference is a company that focuses on the data science training and R&D.We offer training on Machine Learning, Deep Learning and Artificial Intelligence using tools like R, Python and TensorFlow

Contact Us

We Accept

Our Social Links

How to Become a Data Scientist?

© 2020. All Rights Reserved.