Link to the previous post : https://statinfer.com/204-2-2-logistic-function-to-regression/

In the last post 204.2.2 we created a Logistic Regression model with single input variable. In this post we will use multiple input variables.

Multiple Logistic Regression

The dependent variable is binary
Instead of single independent/predictor variable, we have multiple predictors
Like buying / non-buying depends on customer attributes like age, gender, place, income etc.,

Practice : Multiple Logistic Regression

Dataset: Fiberbits/Fiberbits.csv
- Active_cust variable indicates whether the customer is active or already left the network.
Build a model to predict the chance of attrition for a given customer using all the features.
How good is your model?
What are the most impacting variables?

In [15]:

#Dataset: Fiberbits/Fiberbits.csv

Fiber=pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv")
list(Fiber.columns.values)

Out[15]:

['active_cust',
 'income',
 'months_on_network',
 'Num_complaints',
 'number_plan_changes',
 'relocated',
 'monthly_bill',
 'technical_issues_per_month',
 'Speed_test_result']

In [16]:

#Build a model to predict the chance of attrition for a given customer using all the features. 
from sklearn.linear_model import LogisticRegression
logistic= LogisticRegression()
###fitting logistic regression for active customer on rest of the varibles#######
logistic.fit(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']],Fiber[['active_cust']])

Out[16]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [17]:

predict1=logistic.predict(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
predict1

Out[17]:

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [18]:

#How good is your model?
### calculate confusion matrix
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix ###for using confusion matrix###
cm1 = confusion_matrix(Fiber[['active_cust']],predict1)
print(cm1)

[[29210 12931]
 [10183 47676]]

In [19]:

total1=sum(sum(cm1))
total1

Out[19]:

In [20]:

#####from confusion matrix calculate accuracy
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1

Out[20]:

0.76885999999999999

In [21]:

#What are the most impacting variables?
#### From summary of the model

logit1=sm.Logit(Fiber['active_cust'],Fiber[['income']+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
logit1

Out[21]:

<statsmodels.discrete.discrete_model.Logit at 0x203bc8be550>

In [22]:

result1=logit1.fit()

Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7

In [23]:

result1.summary()

Out[23]:

Logit Regression Results
Dep. Variable:	active_cust	No. Observations:	100000
Model:	Logit	Df Residuals:	99992
Method:	MLE	Df Model:	7
Date:	Sun, 16 Oct 2016	Pseudo R-squ.:	0.2403
Time:	14:35:49	Log-Likelihood:	-51717.
converged:	True	LL-Null:	-68074.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
income	1.71e-05	4.17e-06	4.097	0.000	8.92e-06 2.53e-05
months_on_network	0.0150	0.000	31.172	0.000	0.014 0.016
Num_complaints	-1.7669	0.027	-65.284	0.000	-1.820 -1.714
number_plan_changes	-0.1784	0.007	-23.909	0.000	-0.193 -0.164
relocated	-3.0826	0.040	-76.259	0.000	-3.162 -3.003
monthly_bill	-0.0024	0.000	-16.014	0.000	-0.003 -0.002
technical_issues_per_month	-0.4636	0.007	-64.010	0.000	-0.478 -0.449
Speed_test_result	0.1094	0.001	75.073	0.000	0.107 0.112

In [24]:

result1.summary2()

Out[24]:

Model:	Logit	Pseudo R-squared:	0.240
Dependent Variable:	active_cust	AIC:	103450.4420
Date:	2016-10-16 14:35	BIC:	103526.5454
No. Observations:	100000	Log-Likelihood:	-51717.
Df Model:	7	LL-Null:	-68074.
Df Residuals:	99992	LLR p-value:	0.0000
Converged:	1.0000	Scale:	1.0000
No. Iterations:	7.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
income	0.0000	0.0000	4.0973	0.0000	0.0000	0.0000
months_on_network	0.0150	0.0005	31.1715	0.0000	0.0141	0.0159
Num_complaints	-1.7669	0.0271	-65.2837	0.0000	-1.8199	-1.7138
number_plan_changes	-0.1784	0.0075	-23.9093	0.0000	-0.1930	-0.1638
relocated	-3.0826	0.0404	-76.2589	0.0000	-3.1618	-3.0034
monthly_bill	-0.0024	0.0002	-16.0138	0.0000	-0.0027	-0.0021
technical_issues_per_month	-0.4636	0.0072	-64.0101	0.0000	-0.4778	-0.4494
Speed_test_result	0.1094	0.0015	75.0729	0.0000	0.1065	0.1122

For all the variables p < 0.05, so all are impacting

The next post is about goodness of fit for logistic regression.

Link to the next post : https://statinfer.com/204-2-4-goodness-of-fit-for-logistic-regression/

21st June 2017

204.2.3 Multiple Logistic Regression

What happens if we have more than one input variables.

Multiple Logistic Regression

Practice : Multiple Logistic Regression

Statinfer

Statinfer

Statinfer

204.2.3 Multiple Logistic Regression

What happens if we have more than one input variables.

Multiple Logistic Regression

Practice : Multiple Logistic Regression

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer