Link to the previous post: https://statinfer.com/204-5-2-decision-boundary-logistic-regression/
Linear decision boundaries is not always way to go, as our data can have polynomial boundary too. In this post, we will just see what happens if we try to use a linear function to classify a bit complex data.
Non-Linear Decision Boundaries
- Dataset: “Emp_Productivity/ Emp_Productivity.csv”
- Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try, to distinguish the two classes with colors or shapes (visualizing the classes).
- Build a logistic regression model to predict Productivity using age and experience.
- Finally, draw the decision boundary for this logistic regression model.
- Create, the confusion matrix.
- Calculate, the accuracy and error rates.
We are considering, the entire data not just the subset
In [12]:
Emp_Productivity_raw = pd.read_csv("datasets\\Emp_Productivity\\Emp_Productivity.csv")
In [13]:
#plotting the overall data
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');
plt.show()
In [14]:
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity_raw)
fitted = model.fit()
fitted.summary()
Out[14]:
In [15]:
#coefficients
coef=fitted.normalized_cov_params
coef
Out[15]:
In [16]:
# getting slope and intercept of the line
slope=coef.Intercept[1]/(-coef.Intercept[2])
intercept=coef.Intercept[0]/(-coef.Intercept[2])
print('Slope :', slope)
print('Intercept :', intercept)
In [17]:
#Finally draw the decision boundary for this logistic regression model
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');
x_min, x_max = ax.get_xlim()
ax.plot([0, x_max], [intercept, x_max*slope+intercept])
plt.show()
- We can see above that the linear boundary layer is so bad in distinguising the classes.
accuracy and error
In [18]:
#Create the confusion matrix
#predicting values
predicted_values=fitted.predict(Emp_Productivity_raw[["Age"]+["Experience"]])
predicted_values[1:10]
#Lets convert them to classes using a threshold
threshold=0.5
threshold
import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1
#Predcited Classes
predicted_class[1:10]
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
ConfusionMatrix
Out[18]:
In [19]:
#Accuracy and Error
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)
error=1-accuracy
print('Error: ',error)
- We can see we have achieved a very bad Accuracy on this model, this is due to classes not having a linear boundary.
In next post we will see the issues with non linear decision boundaries.
Link to the next post : https://statinfer.com/204-5-4-issue-with-non-linear-decision-boundary/