Link to the previous post : https://statinfer.com/204-7-5-the-random-forest/
Let’s implement the concept of Random Forest into practice using Python.
Practice : Random Forest
- Dataset: /Car Accidents IOT/Train.csv
- Build a decision tree model to predict the fatality of accident
- Build a decision tree model on the training data.
- On the test data, calculate the classification error and accuracy.
- Build a random forest model on the training data.
- On the test data, calculate the classification error and accuracy.
- What is the improvement of the Random Forest model when compared with the single tree?
In [10]:
#Importing dataset
car_train=pd.read_csv("datasets\\Car Accidents IOT\\train.csv")
car_test=pd.read_csv("datasets\\Car Accidents IOT\\test.csv")
In [11]:
from sklearn import tree
var=list(car_train.columns[1:22])
c=car_train[var]
d=car_train['Fatal']
###buildng Decision tree on the training data ####
clf = tree.DecisionTreeClassifier()
clf.fit(c,d)
Out[11]:
In [12]:
#####predicting on test data ####
tree_predict=clf.predict(car_test[var])
In [13]:
from sklearn.metrics import confusion_matrix###for using confusion matrix###
cm1 = confusion_matrix(car_test[['Fatal']],tree_predict)
print(cm1)
In [14]:
#####from confusion matrix calculate accuracy
total1=sum(sum(cm1))
accuracy_tree=(cm1[0,0]+cm1[1,1])/total1
accuracy_tree
Out[14]:
In [15]:
from sklearn.metrics import confusion_matrix###for using confusion matrix###
cm1 = confusion_matrix(car_test[['Fatal']],tree_predict)
print(cm1)
total1=sum(sum(cm1))
#####from confusion matrix calculate accuracy
accuracy_tree=(cm1[0,0]+cm1[1,1])/total1
accuracy_tree
Out[15]:
In [16]:
### accuracy_score() also gives the same result[using confusion matrix]
from sklearn.metrics import accuracy_score
accuracy_score(car_test[['Fatal']],tree_predict, normalize=True, sample_weight=None)
Out[16]:
In [17]:
####buliding a random forest classifier on training data#####
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
forest.fit(c,d)
Out[17]:
In [18]:
###predicting on test data with RF model
forestpredict_test=forest.predict(car_test[var])
e=car_test['Fatal']
In [19]:
###check the accuracy on test data
from sklearn.metrics import confusion_matrix###for using confusion matrix###
cm2 = confusion_matrix(car_test[['Fatal']],forestpredict_test)
print(cm2)
total2=sum(sum(cm2))
#####from confusion matrix calculate accuracy
accuracy_forest=(cm2[0,0]+cm2[1,1])/total2
accuracy_forest
Out[19]:
- We can see an improvement in the Accuracy
The next post is about boosting.
Link to the next post : https://statinfer.com/204-7-7-boosting/