Link to the previous post : https://statinfer.com/204-5-4-issue-with-non-linear-decision-boundary/

In this post, we will learn how to implement the concept of intermediate outputs using python. We will cover many things in this session.

Practice : Intermediate output

Dataset: Emp_Productivity/ Emp_Productivity_All_Sites.csv
Filter the data and take first 74 observations from above dataset . Filter condition is Sample_Set<3
Build a logistic regression model to predict Productivity using age and experience.
Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable.
Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
Build a logistic regression model to predict Productivity using age and experience.
Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable.
Build a consolidated model to predict productivity using inter-1 and inter-2 variables.
Create the confusion matrix and find the accuracy and error rates for the consolidated model.

Our sampled data Emp_Productivity1 has first 74 observations. Lets build the model on this sample data(sample-1)

In [20]:

#Filter the data and take a subset from whole dataset . Filter condition is Sample_Set<3
Emp_Productivity1=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set<3]
Emp_Productivity1.shape

Out[20]:

(74, 4)

In [21]:

#Building a Logistic regression model1 to predict Productivity using age and experience
import statsmodels.formula.api as sm
model1 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity1)
fitted1 = model1.fit()
fitted1.summary()

Optimization terminated successfully.
         Current function value: 0.315987
         Iterations 7

Out[21]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	74
Model:	Logit	Df Residuals:	71
Method:	MLE	Df Model:	2
Date:	Tue, 15 Nov 2016	Pseudo R-squ.:	0.5402
Time:	16:09:16	Log-Likelihood:	-23.383
converged:	True	LL-Null:	-50.860
		LLR p-value:	1.167e-12

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-8.9361	2.061	-4.335	0.000	-12.976 -4.896
Age	0.2763	0.105	2.620	0.009	0.070 0.483
Experience	0.5923	0.298	1.988	0.047	0.008 1.176

In [22]:

#Drawing the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==0],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==1],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity1.Age), max(Emp_Productivity1.Age))
plt.ylim(min(Emp_Productivity1.Experience), max(Emp_Productivity1.Experience))
plt.legend(loc='upper left');

x_min, x_max = ax1.get_xlim()
ax1.plot([0, x_max], [intercept1, x_max*slope1+intercept1])
plt.show()

In [23]:

# Calculating and Storing prediction probabilities in inter1 variable for data Emp_Productivity1
Emp_Productivity_raw['inter1'] = fitted1.predict(Emp_Productivity_raw[["Age"]+["Experience"]])

For Sample_Set < 1 :

In [24]:

# Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
Emp_Productivity2=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set>1]
Emp_Productivity2.shape

Out[24]:

(86, 5)

In [25]:

Emp_Productivity2.head()

Out[25]:

	Age	Experience	Productivity	Sample_Set	inter1
33	33.9	6.2	1	2	0.983732
34	29.3	5.5	1	2	0.918087
35	27.8	3.4	1	2	0.680985
36	30.7	8.6	1	2	0.990432
37	28.4	8.2	1	2	0.977408

In [26]:

####The clasification graph
#Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes.
fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');
plt.show()

In [27]:

#Build a logistic regression model to predict Productivity using age and experience of data Emp_Productivity2
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model2 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity2)
fitted2 = model2.fit(method="bfgs")
fitted2.summary()

Optimization terminated successfully.
         Current function value: 0.198139
         Iterations: 25
         Function evaluations: 28
         Gradient evaluations: 28

Out[27]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	86
Model:	Logit	Df Residuals:	83
Method:	MLE	Df Model:	2
Date:	Tue, 15 Nov 2016	Pseudo R-squ.:	0.7137
Time:	16:09:35	Log-Likelihood:	-17.040
converged:	True	LL-Null:	-59.518
		LLR p-value:	3.566e-19

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	16.3184	3.966	4.114	0.000	8.545 24.092
Age	-0.3994	0.135	-2.949	0.003	-0.665 -0.134
Experience	-0.2440	0.189	-1.288	0.198	-0.615 0.127

In [28]:

#coefficients
coef=fitted2.normalized_cov_params
print(coef)
#getting slope and intercept of the line
slope2=fitted2.params[1]/(-fitted2.params[2])
intercept2=fitted2.params[0]/(-fitted2.params[2])

            Intercept       Age  Experience
Intercept   15.730481 -0.497397    0.183390
Age         -0.497397  0.018339   -0.014873
Experience   0.183390 -0.014873    0.035860

In [29]:

#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');

x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope2+intercept2, x_max*slope2+intercept2])
plt.show()

In [30]:

#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable for data Emp_Productivity2
Emp_Productivity_raw['inter2']=fitted2.predict(Emp_Productivity_raw[["Age"]+["Experience"]])

In [31]:

###Confusion matrix, Accuracy and error of the model2

#Predciting Values
predicted_values=fitted2.predict(Emp_Productivity2[["Age"]+["Experience"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity2[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)

Confusion Matrix :  [[43  2]
 [ 2 39]]
Accuracy :  0.953488372093
Error :  0.046511627907

Now that both models have been created lets try to combine them.

In [32]:

#plotting the new columns
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=50, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=50, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)


plt.legend(loc='lower left');
plt.show()

In [33]:

###Logistic Regerssion model with Intermediate outputs as input
import statsmodels.formula.api as sm

model_combined = sm.logit(formula='Productivity ~ inter1+inter2', data=Emp_Productivity_raw)
fitted_combined = model_combined.fit(method="bfgs")
fitted_combined.summary()

Optimization terminated successfully.
         Current function value: 0.208985
         Iterations: 26
         Function evaluations: 27
         Gradient evaluations: 27

Out[33]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	119
Model:	Logit	Df Residuals:	116
Method:	MLE	Df Model:	2
Date:	Tue, 15 Nov 2016	Pseudo R-squ.:	0.6805
Time:	16:09:54	Log-Likelihood:	-24.869
converged:	True	LL-Null:	-77.848
		LLR p-value:	9.805e-24

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-12.2134	1.907	-6.405	0.000	-15.951 -8.476
inter1	8.0193	1.409	5.693	0.000	5.258 10.780
inter2	8.5983	1.509	5.697	0.000	5.640 11.556

In [34]:

#coefficients
coef=fitted_combined.normalized_cov_params
print(coef)
# getting slope and intercept of the line
slope_combined=fitted_combined.params[1]/(-fitted_combined.params[2])
intercept_combined=fitted_combined.params[0]/(-fitted_combined.params[2])

           Intercept    inter1    inter2
Intercept   3.635572 -2.326774 -2.637054
inter1     -2.326774  1.984538  1.413297
inter2     -2.637054  1.413297  2.277541

In [35]:

#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)

plt.legend(loc='lower left');

x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope_combined+intercept_combined, x_max*slope_combined+intercept_combined])
plt.show()

In [36]:

#### Confusion Matrix, Accuracy and Error of the Intermediate
#Predciting Values
predicted_values=fitted_combined.predict(Emp_Productivity_raw[["inter1"]+["inter2"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)

accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)

Confusion Matrix :  [[74  2]
 [ 4 39]]
Accuracy :  0.949579831933
Error :  0.0504201680672

We got an accuracy of 94.95% with an Intermediate model.

Next post is about neural network intuition.

Link to the next post : https://statinfer.com/204-5-6-neural-network-intuition/

21st June 2017

204.5.5 Practice : Implementing Intermediate outputs in Python

Putting intermediate outputs into practice.

Practice : Intermediate output

Statinfer

Statinfer

Statinfer

204.5.5 Practice : Implementing Intermediate outputs in Python

Putting intermediate outputs into practice.

Practice : Intermediate output

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer