Link to the previous post : https://statinfer.com/204-4-4-roc-and-auc/

What is a best model? How to build?

A model with maximum accuracy /least error.
A model that uses maximum information available in the given data.
A model that has minimum squared error.
A model that captures all the hidden patterns in the data.
A model that produces the best perdition results.

Model Selection

How to build/choose a best model?
Error on the training data is not a good meter of performance on future data.
How to select the best model out of the set of available models ?
Are there any methods/metrics to choose best model?
What is training error? What is testing error? What is hold out sample error?

Practice : The Most Accurate Model

Data: Fiberbits/Fiberbits.csv
Build a decision tree to predict active_user
What is the accuracy of your model?
Grow the tree as much as you can and achieve 95% accuracy.

Solution

In [13]:

#Preparing the X and y to train the model
features = list(Fiber_df.drop(['active_cust'],1).columns)

X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])

In [14]:

#Let's make a model by choosing some initial  parameters.
from sklearn import tree

tree_config = tree.DecisionTreeClassifier(criterion='gini', 
                                   splitter='best', 
                                   max_depth=10, 
                                   min_samples_split=1, 
                                   min_samples_leaf=30, 
                                   max_leaf_nodes=10)

In [15]:

#Training the model and finding the accuracy of the model                 
tree_config.fit(X,y)
tree_config.score(X,y)

Out[15]:

0.84972999999999999

The first decision tree we have built is giving us an accuracy of 84.97% on the training data. We will grow the tree to achieve 95% accuracy.

In [16]:

tree_config_new = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=None, 
                                              min_samples_split=2, 
                                              min_samples_leaf=1, 
                                              max_leaf_nodes=None)

In [17]:

#Training the model and accuracy
tree_config_new.fit(X,y)
tree_config_new.score(X,y)

Out[17]:

0.99668999999999996

This seem to be a matter of accuracy, the high the accuracy is good a model becomes. But, high accuracy comes with a price too. We might get to see it in next posts.

The next post is about type of datasets ,type of errors and problem of overfitting.

Link to the next post : https://statinfer.com/204-4-6-type-of-datasets-type-of-errors-and-problem-of-overfitting/

21st June 2017

204.4.5 What is a Best Model?

What quantifies a model to be the best?

What is a best model? How to build?

Model Selection

Practice : The Most Accurate Model

Statinfer

Statinfer

Statinfer

204.4.5 What is a Best Model?

What quantifies a model to be the best?

What is a best model? How to build?

Model Selection

Practice : The Most Accurate Model

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer