Link to the previous post : https://statinfer.com/204-4-4-roc-and-auc/
What is a best model? How to build?
- A model with maximum accuracy /least error.
- A model that uses maximum information available in the given data.
- A model that has minimum squared error.
- A model that captures all the hidden patterns in the data.
- A model that produces the best perdition results.
Model Selection
- How to build/choose a best model?
- Error on the training data is not a good meter of performance on future data.
- How to select the best model out of the set of available models ?
- Are there any methods/metrics to choose best model?
- What is training error? What is testing error? What is hold out sample error?
Practice : The Most Accurate Model
- Data: Fiberbits/Fiberbits.csv
- Build a decision tree to predict active_user
- What is the accuracy of your model?
- Grow the tree as much as you can and achieve 95% accuracy.
Solution
In [13]:
#Preparing the X and y to train the model
features = list(Fiber_df.drop(['active_cust'],1).columns)
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
In [14]:
#Let's make a model by choosing some initial parameters.
from sklearn import tree
tree_config = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=10,
min_samples_split=1,
min_samples_leaf=30,
max_leaf_nodes=10)
In [15]:
#Training the model and finding the accuracy of the model
tree_config.fit(X,y)
tree_config.score(X,y)
Out[15]:
The first decision tree we have built is giving us an accuracy of 84.97% on the training data. We will grow the tree to achieve 95% accuracy.
In [16]:
tree_config_new = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
max_leaf_nodes=None)
In [17]:
#Training the model and accuracy
tree_config_new.fit(X,y)
tree_config_new.score(X,y)
Out[17]:
This seem to be a matter of accuracy, the high the accuracy is good a model becomes. But, high accuracy comes with a price too. We might get to see it in next posts.
The next post is about type of datasets ,type of errors and problem of overfitting.
Link to the next post : https://statinfer.com/204-4-6-type-of-datasets-type-of-errors-and-problem-of-overfitting/