• LOGIN
  • No products in the cart.

204.5.5 Practice : Implementing Intermediate outputs in Python

Putting intermediate outputs into practice.

Link to the previous post : https://statinfer.com/204-5-4-issue-with-non-linear-decision-boundary/

In this post, we will learn how to implement the concept of intermediate outputs using python. We will cover many things in this session.

Practice : Intermediate output

  • Dataset: Emp_Productivity/ Emp_Productivity_All_Sites.csv
  • Filter the data and take first 74 observations from above dataset . Filter condition is Sample_Set<3
  • Build a logistic regression model to predict Productivity using age and experience.
  • Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable.
  • Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
  • Build a logistic regression model to predict Productivity using age and experience.
  • Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable.
  • Build a consolidated model to predict productivity using inter-1 and inter-2 variables.
  • Create the confusion matrix and find the accuracy and error rates for the consolidated model.

Our sampled data Emp_Productivity1 has first 74 observations. Lets build the model on this sample data(sample-1)

In [20]:
#Filter the data and take a subset from whole dataset . Filter condition is Sample_Set<3
Emp_Productivity1=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set<3]
Emp_Productivity1.shape
Out[20]:
(74, 4)
In [21]:
#Building a Logistic regression model1 to predict Productivity using age and experience
import statsmodels.formula.api as sm
model1 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity1)
fitted1 = model1.fit()
fitted1.summary()
Optimization terminated successfully.
         Current function value: 0.315987
         Iterations 7
Out[21]:
Logit Regression Results
Dep. Variable: Productivity No. Observations: 74
Model: Logit Df Residuals: 71
Method: MLE Df Model: 2
Date: Tue, 15 Nov 2016 Pseudo R-squ.: 0.5402
Time: 16:09:16 Log-Likelihood: -23.383
converged: True LL-Null: -50.860
LLR p-value: 1.167e-12
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -8.9361 2.061 -4.335 0.000 -12.976 -4.896
Age 0.2763 0.105 2.620 0.009 0.070 0.483
Experience 0.5923 0.298 1.988 0.047 0.008 1.176
In [22]:
#Drawing the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==0],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==1],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity1.Age), max(Emp_Productivity1.Age))
plt.ylim(min(Emp_Productivity1.Experience), max(Emp_Productivity1.Experience))
plt.legend(loc='upper left');

x_min, x_max = ax1.get_xlim()
ax1.plot([0, x_max], [intercept1, x_max*slope1+intercept1])
plt.show()
In [23]:
# Calculating and Storing prediction probabilities in inter1 variable for data Emp_Productivity1
Emp_Productivity_raw['inter1'] = fitted1.predict(Emp_Productivity_raw[["Age"]+["Experience"]])

For Sample_Set < 1 :

In [24]:
# Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
Emp_Productivity2=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set>1]
Emp_Productivity2.shape
Out[24]:
(86, 5)
In [25]:
Emp_Productivity2.head()
Out[25]:
Age Experience Productivity Sample_Set inter1
33 33.9 6.2 1 2 0.983732
34 29.3 5.5 1 2 0.918087
35 27.8 3.4 1 2 0.680985
36 30.7 8.6 1 2 0.990432
37 28.4 8.2 1 2 0.977408
In [26]:
####The clasification graph
#Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes.
fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');
plt.show()
In [27]:
#Build a logistic regression model to predict Productivity using age and experience of data Emp_Productivity2
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model2 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity2)
fitted2 = model2.fit(method="bfgs")
fitted2.summary()
Optimization terminated successfully.
         Current function value: 0.198139
         Iterations: 25
         Function evaluations: 28
         Gradient evaluations: 28
Out[27]:
Logit Regression Results
Dep. Variable: Productivity No. Observations: 86
Model: Logit Df Residuals: 83
Method: MLE Df Model: 2
Date: Tue, 15 Nov 2016 Pseudo R-squ.: 0.7137
Time: 16:09:35 Log-Likelihood: -17.040
converged: True LL-Null: -59.518
LLR p-value: 3.566e-19
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 16.3184 3.966 4.114 0.000 8.545 24.092
Age -0.3994 0.135 -2.949 0.003 -0.665 -0.134
Experience -0.2440 0.189 -1.288 0.198 -0.615 0.127
In [28]:
#coefficients
coef=fitted2.normalized_cov_params
print(coef)
#getting slope and intercept of the line
slope2=fitted2.params[1]/(-fitted2.params[2])
intercept2=fitted2.params[0]/(-fitted2.params[2])
            Intercept       Age  Experience
Intercept   15.730481 -0.497397    0.183390
Age         -0.497397  0.018339   -0.014873
Experience   0.183390 -0.014873    0.035860
In [29]:
#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');

x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope2+intercept2, x_max*slope2+intercept2])
plt.show()
In [30]:
#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable for data Emp_Productivity2
Emp_Productivity_raw['inter2']=fitted2.predict(Emp_Productivity_raw[["Age"]+["Experience"]])
In [31]:
###Confusion matrix, Accuracy and error of the model2

#Predciting Values
predicted_values=fitted2.predict(Emp_Productivity2[["Age"]+["Experience"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity2[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)
Confusion Matrix :  [[43  2]
 [ 2 39]]
Accuracy :  0.953488372093
Error :  0.046511627907

Now that both models have been created lets try to combine them.

In [32]:
#plotting the new columns
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=50, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=50, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)


plt.legend(loc='lower left');
plt.show()
In [33]:
###Logistic Regerssion model with Intermediate outputs as input
import statsmodels.formula.api as sm

model_combined = sm.logit(formula='Productivity ~ inter1+inter2', data=Emp_Productivity_raw)
fitted_combined = model_combined.fit(method="bfgs")
fitted_combined.summary()
Optimization terminated successfully.
         Current function value: 0.208985
         Iterations: 26
         Function evaluations: 27
         Gradient evaluations: 27
Out[33]:
Logit Regression Results
Dep. Variable: Productivity No. Observations: 119
Model: Logit Df Residuals: 116
Method: MLE Df Model: 2
Date: Tue, 15 Nov 2016 Pseudo R-squ.: 0.6805
Time: 16:09:54 Log-Likelihood: -24.869
converged: True LL-Null: -77.848
LLR p-value: 9.805e-24
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -12.2134 1.907 -6.405 0.000 -15.951 -8.476
inter1 8.0193 1.409 5.693 0.000 5.258 10.780
inter2 8.5983 1.509 5.697 0.000 5.640 11.556
In [34]:
#coefficients
coef=fitted_combined.normalized_cov_params
print(coef)
# getting slope and intercept of the line
slope_combined=fitted_combined.params[1]/(-fitted_combined.params[2])
intercept_combined=fitted_combined.params[0]/(-fitted_combined.params[2])
           Intercept    inter1    inter2
Intercept   3.635572 -2.326774 -2.637054
inter1     -2.326774  1.984538  1.413297
inter2     -2.637054  1.413297  2.277541
In [35]:
#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)

plt.legend(loc='lower left');

x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope_combined+intercept_combined, x_max*slope_combined+intercept_combined])
plt.show()
In [36]:
#### Confusion Matrix, Accuracy and Error of the Intermediate
#Predciting Values
predicted_values=fitted_combined.predict(Emp_Productivity_raw[["inter1"]+["inter2"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)

accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)
Confusion Matrix :  [[74  2]
 [ 4 39]]
Accuracy :  0.949579831933
Error :  0.0504201680672

We got an accuracy of 94.95% with an Intermediate model.

Next post is about neural network intuition.

Link to the next post : https://statinfer.com/204-5-6-neural-network-intuition/

0 responses on "204.5.5 Practice : Implementing Intermediate outputs in Python"

Leave a Message