• No products in the cart.

204.5.3 Practice : Non Linear Decision Boundary

Let's see what happens when we our data cannot be classified using a non linear boundary.

Link to the previous post: https://statinfer.com/204-5-2-decision-boundary-logistic-regression/

Linear decision boundaries is not always way to go, as our data can have polynomial boundary too. In this post, we will just see what happens if we try to use a linear function to classify a bit complex data.

Non-Linear Decision Boundaries

  • Dataset: “Emp_Productivity/ Emp_Productivity.csv”
  • Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try, to distinguish the two classes with colors or shapes (visualizing the classes).
  • Build a logistic regression model to predict Productivity using age and experience.
  • Finally, draw the decision boundary for this logistic regression model.
  • Create, the confusion matrix.
  • Calculate, the accuracy and error rates.

We are considering, the entire data not just the subset

In [12]:
Emp_Productivity_raw = pd.read_csv("datasets\\Emp_Productivity\\Emp_Productivity.csv")
In [13]:
#plotting the overall data
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');
plt.show()
In [14]:
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity_raw)
fitted = model.fit()
fitted.summary()
Optimization terminated successfully.
         Current function value: 0.632202
         Iterations 5
Out[14]:
Logit Regression Results
Dep. Variable: Productivity No. Observations: 119
Model: Logit Df Residuals: 116
Method: MLE Df Model: 2
Date: Tue, 15 Nov 2016 Pseudo R-squ.: 0.03361
Time: 16:08:50 Log-Likelihood: -75.232
converged: True LL-Null: -77.848
LLR p-value: 0.07307
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 0.4478 0.699 0.641 0.522 -0.921 1.817
Age -0.0176 0.038 -0.459 0.646 -0.092 0.057
Experience -0.0632 0.091 -0.698 0.485 -0.241 0.114
In [15]:
#coefficients
coef=fitted.normalized_cov_params
coef
Out[15]:
Intercept Age Experience
Intercept 0.488120 -0.022329 0.030775
Age -0.022329 0.001461 -0.002995
Experience 0.030775 -0.002995 0.008210
In [16]:
# getting slope and intercept of the line
slope=coef.Intercept[1]/(-coef.Intercept[2])
intercept=coef.Intercept[0]/(-coef.Intercept[2])
print('Slope :', slope)
print('Intercept :', intercept)
Slope : 0.725542552217
Intercept : -15.8607950797
In [17]:
#Finally draw the decision boundary for this logistic regression model
fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');

x_min, x_max = ax.get_xlim()
ax.plot([0, x_max], [intercept, x_max*slope+intercept])
plt.show()
  • We can see above that the linear boundary layer is so bad in distinguising the classes.

accuracy and error

In [18]:
#Create the confusion matrix
#predicting values
predicted_values=fitted.predict(Emp_Productivity_raw[["Age"]+["Experience"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
ConfusionMatrix
Out[18]:
array([[69,  7],
       [43,  0]])
In [19]:
#Accuracy and Error
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error: ',error)
Accuracy :  0.579831932773
Error:  0.420168067227
  • We can see we have achieved a very bad Accuracy on this model, this is due to classes not having a linear boundary.

In next post we will see the issues with non linear decision boundaries.

Link to the next post : https://statinfer.com/204-5-4-issue-with-non-linear-decision-boundary/

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.