Link to the previous post : https://statinfer.com/204-5-4-issue-with-non-linear-decision-boundary/
In this post, we will learn how to implement the concept of intermediate outputs using python. We will cover many things in this session.
Practice : Intermediate output
- Dataset: Emp_Productivity/ Emp_Productivity_All_Sites.csv
- Filter the data and take first 74 observations from above dataset . Filter condition is Sample_Set<3
- Build a logistic regression model to predict Productivity using age and experience.
- Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable.
- Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
- Build a logistic regression model to predict Productivity using age and experience.
- Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable.
- Build a consolidated model to predict productivity using inter-1 and inter-2 variables.
- Create the confusion matrix and find the accuracy and error rates for the consolidated model.
Our sampled data Emp_Productivity1 has first 74 observations. Lets build the model on this sample data(sample-1)
In [20]:
#Filter the data and take a subset from whole dataset . Filter condition is Sample_Set<3
Emp_Productivity1=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set<3]
Emp_Productivity1.shape
Out[20]:
In [21]:
#Building a Logistic regression model1 to predict Productivity using age and experience
import statsmodels.formula.api as sm
model1 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity1)
fitted1 = model1.fit()
fitted1.summary()
Out[21]:
In [22]:
#Drawing the decision boundary for this logistic regression model
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==0],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==1],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity1.Age), max(Emp_Productivity1.Age))
plt.ylim(min(Emp_Productivity1.Experience), max(Emp_Productivity1.Experience))
plt.legend(loc='upper left');
x_min, x_max = ax1.get_xlim()
ax1.plot([0, x_max], [intercept1, x_max*slope1+intercept1])
plt.show()
In [23]:
# Calculating and Storing prediction probabilities in inter1 variable for data Emp_Productivity1
Emp_Productivity_raw['inter1'] = fitted1.predict(Emp_Productivity_raw[["Age"]+["Experience"]])
For Sample_Set < 1 :
In [24]:
# Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
Emp_Productivity2=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set>1]
Emp_Productivity2.shape
Out[24]:
In [25]:
Emp_Productivity2.head()
Out[25]:
In [26]:
####The clasification graph
#Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes.
fig = plt.figure()
ax2 = fig.add_subplot(111)
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');
plt.show()
In [27]:
#Build a logistic regression model to predict Productivity using age and experience of data Emp_Productivity2
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model2 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity2)
fitted2 = model2.fit(method="bfgs")
fitted2.summary()
Out[27]:
In [28]:
#coefficients
coef=fitted2.normalized_cov_params
print(coef)
#getting slope and intercept of the line
slope2=fitted2.params[1]/(-fitted2.params[2])
intercept2=fitted2.params[0]/(-fitted2.params[2])
In [29]:
#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt
fig = plt.figure()
ax2 = fig.add_subplot(111)
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');
x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope2+intercept2, x_max*slope2+intercept2])
plt.show()
In [30]:
#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable for data Emp_Productivity2
Emp_Productivity_raw['inter2']=fitted2.predict(Emp_Productivity_raw[["Age"]+["Experience"]])
In [31]:
###Confusion matrix, Accuracy and error of the model2
#Predciting Values
predicted_values=fitted2.predict(Emp_Productivity2[["Age"]+["Experience"]])
predicted_values[1:10]
#Lets convert them to classes using a threshold
threshold=0.5
threshold
import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1
#Predcited Classes
predicted_class[1:10]
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity2[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)
error=1-accuracy
print('Error : ', error)
Now that both models have been created lets try to combine them.
In [32]:
#plotting the new columns
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=50, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=50, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)
plt.legend(loc='lower left');
plt.show()
In [33]:
###Logistic Regerssion model with Intermediate outputs as input
import statsmodels.formula.api as sm
model_combined = sm.logit(formula='Productivity ~ inter1+inter2', data=Emp_Productivity_raw)
fitted_combined = model_combined.fit(method="bfgs")
fitted_combined.summary()
Out[33]:
In [34]:
#coefficients
coef=fitted_combined.normalized_cov_params
print(coef)
# getting slope and intercept of the line
slope_combined=fitted_combined.params[1]/(-fitted_combined.params[2])
intercept_combined=fitted_combined.params[0]/(-fitted_combined.params[2])
In [35]:
#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt
fig = plt.figure()
ax2 = fig.add_subplot(111)
ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)
plt.legend(loc='lower left');
x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope_combined+intercept_combined, x_max*slope_combined+intercept_combined])
plt.show()
In [36]:
#### Confusion Matrix, Accuracy and Error of the Intermediate
#Predciting Values
predicted_values=fitted_combined.predict(Emp_Productivity_raw[["inter1"]+["inter2"]])
predicted_values[1:10]
#Lets convert them to classes using a threshold
threshold=0.5
threshold
import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1
#Predcited Classes
predicted_class[1:10]
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)
error=1-accuracy
print('Error : ', error)
We got an accuracy of 94.95% with an Intermediate model.
Next post is about neural network intuition.
Link to the next post : https://statinfer.com/204-5-6-neural-network-intuition/