Link to the previous post: https://statinfer.com/204-5-2-decision-boundary-logistic-regression/

Linear decision boundaries is not always way to go, as our data can have polynomial boundary too. In this post, we will just see what happens if we try to use a linear function to classify a bit complex data.

Non-Linear Decision Boundaries

Dataset: “Emp_Productivity/ Emp_Productivity.csv”
Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try, to distinguish the two classes with colors or shapes (visualizing the classes).
Build a logistic regression model to predict Productivity using age and experience.
Finally, draw the decision boundary for this logistic regression model.
Create, the confusion matrix.
Calculate, the accuracy and error rates.

We are considering, the entire data not just the subset

In [12]:

Emp_Productivity_raw = pd.read_csv("datasets\\Emp_Productivity\\Emp_Productivity.csv")

In [13]:

#plotting the overall data
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');
plt.show()

In [14]:

###Logistic Regerssion model1
import statsmodels.formula.api as sm
model = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity_raw)
fitted = model.fit()
fitted.summary()

Optimization terminated successfully.
         Current function value: 0.632202
         Iterations 5

Out[14]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	119
Model:	Logit	Df Residuals:	116
Method:	MLE	Df Model:	2
Date:	Tue, 15 Nov 2016	Pseudo R-squ.:	0.03361
Time:	16:08:50	Log-Likelihood:	-75.232
converged:	True	LL-Null:	-77.848
		LLR p-value:	0.07307

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.4478	0.699	0.641	0.522	-0.921 1.817
Age	-0.0176	0.038	-0.459	0.646	-0.092 0.057
Experience	-0.0632	0.091	-0.698	0.485	-0.241 0.114

In [15]:

#coefficients
coef=fitted.normalized_cov_params
coef

Out[15]:

	Intercept	Age	Experience
Intercept	0.488120	-0.022329	0.030775
Age	-0.022329	0.001461	-0.002995
Experience	0.030775	-0.002995	0.008210

In [16]:

# getting slope and intercept of the line
slope=coef.Intercept[1]/(-coef.Intercept[2])
intercept=coef.Intercept[0]/(-coef.Intercept[2])
print('Slope :', slope)
print('Intercept :', intercept)

Slope : 0.725542552217
Intercept : -15.8607950797

In [17]:

#Finally draw the decision boundary for this logistic regression model
fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');

x_min, x_max = ax.get_xlim()
ax.plot([0, x_max], [intercept, x_max*slope+intercept])
plt.show()

We can see above that the linear boundary layer is so bad in distinguising the classes.

accuracy and error

In [18]:

#Create the confusion matrix
#predicting values
predicted_values=fitted.predict(Emp_Productivity_raw[["Age"]+["Experience"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
ConfusionMatrix

Out[18]:

array([[69,  7],
       [43,  0]])

In [19]:

#Accuracy and Error
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error: ',error)

Accuracy :  0.579831932773
Error:  0.420168067227

We can see we have achieved a very bad Accuracy on this model, this is due to classes not having a linear boundary.

In next post we will see the issues with non linear decision boundaries.

Link to the next post : https://statinfer.com/204-5-4-issue-with-non-linear-decision-boundary/

21st June 2017

204.5.3 Practice : Non Linear Decision Boundary

Let's see what happens when we our data cannot be classified using a non linear boundary.

Non-Linear Decision Boundaries