Link to the previous post : https://statinfer.com/204-3-10-pruning-a-decision-tree-in-python/
This is our last post in this series. We will again build a model and validate the accuracy of the model on test data and prune the tree if needed.
import pandas as pd
import numpy as np
Fiber_df = pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv", header=0)
Fiber_df.info()
from sklearn import cross_validation, tree
#Defining Features and lables
features = list(Fiber_df.drop(['active_cust'],1).columns) #this code gives a list of column names except 'active_cust'
features
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
We will split the training data into train and test to validate the model accuracy
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, train_size = 0.8)
clf1 = tree.DecisionTreeClassifier()
clf1.fit(X_train,y_train)
#Accuracy of the model clf1
clf1.score(X_test,y_test)
We can tweak the parameters and build a new model.
#Let's make a model by chnaging the parameters.
clf2 = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=10,
min_samples_split=5,
min_samples_leaf=5,
min_weight_fraction_leaf=0.5,
max_leaf_nodes=10)
clf2.fit(X_train,y_train)
clf2.score(X_test,y_test)