• No products in the cart.

204.2.3 Multiple Logistic Regression

What happens if we have more than one input variables.

Link to the previous post : https://statinfer.com/204-2-2-logistic-function-to-regression/

In the last post 204.2.2 we created a Logistic Regression model with single input variable. In this post we will use multiple input variables.

Multiple Logistic Regression

  • The dependent variable is binary
  • Instead of single independent/predictor variable, we have multiple predictors
  • Like buying / non-buying depends on customer attributes like age, gender, place, income etc.,

Practice : Multiple Logistic Regression

  • Dataset: Fiberbits/Fiberbits.csv
    • Active_cust variable indicates whether the customer is active or already left the network.
  • Build a model to predict the chance of attrition for a given customer using all the features.
  • How good is your model?
  • What are the most impacting variables?
In [15]:
#Dataset: Fiberbits/Fiberbits.csv

Fiber=pd.read_csv("datasets\\Fiberbits\\Fiberbits.csv")
list(Fiber.columns.values)  
Out[15]:
['active_cust',
 'income',
 'months_on_network',
 'Num_complaints',
 'number_plan_changes',
 'relocated',
 'monthly_bill',
 'technical_issues_per_month',
 'Speed_test_result']
In [16]:
#Build a model to predict the chance of attrition for a given customer using all the features. 
from sklearn.linear_model import LogisticRegression
logistic= LogisticRegression()
###fitting logistic regression for active customer on rest of the varibles#######
logistic.fit(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']],Fiber[['active_cust']])

Out[16]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [17]:
predict1=logistic.predict(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
predict1
Out[17]:
array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
In [18]:
#How good is your model?
### calculate confusion matrix
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix ###for using confusion matrix###
cm1 = confusion_matrix(Fiber[['active_cust']],predict1)
print(cm1)
[[29210 12931]
 [10183 47676]]
In [19]:
total1=sum(sum(cm1))
total1
Out[19]:
100000
In [20]:
#####from confusion matrix calculate accuracy
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
Out[20]:
0.76885999999999999
In [21]:
#What are the most impacting variables?
#### From summary of the model

logit1=sm.Logit(Fiber['active_cust'],Fiber[['income']+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
logit1
Out[21]:
<statsmodels.discrete.discrete_model.Logit at 0x203bc8be550>
In [22]:
result1=logit1.fit()
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
In [23]:
result1.summary()
Out[23]:
Logit Regression Results
Dep. Variable: active_cust No. Observations: 100000
Model: Logit Df Residuals: 99992
Method: MLE Df Model: 7
Date: Sun, 16 Oct 2016 Pseudo R-squ.: 0.2403
Time: 14:35:49 Log-Likelihood: -51717.
converged: True LL-Null: -68074.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
income 1.71e-05 4.17e-06 4.097 0.000 8.92e-06 2.53e-05
months_on_network 0.0150 0.000 31.172 0.000 0.014 0.016
Num_complaints -1.7669 0.027 -65.284 0.000 -1.820 -1.714
number_plan_changes -0.1784 0.007 -23.909 0.000 -0.193 -0.164
relocated -3.0826 0.040 -76.259 0.000 -3.162 -3.003
monthly_bill -0.0024 0.000 -16.014 0.000 -0.003 -0.002
technical_issues_per_month -0.4636 0.007 -64.010 0.000 -0.478 -0.449
Speed_test_result 0.1094 0.001 75.073 0.000 0.107 0.112
In [24]:
result1.summary2()
Out[24]:
Model: Logit Pseudo R-squared: 0.240
Dependent Variable: active_cust AIC: 103450.4420
Date: 2016-10-16 14:35 BIC: 103526.5454
No. Observations: 100000 Log-Likelihood: -51717.
Df Model: 7 LL-Null: -68074.
Df Residuals: 99992 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
income 0.0000 0.0000 4.0973 0.0000 0.0000 0.0000
months_on_network 0.0150 0.0005 31.1715 0.0000 0.0141 0.0159
Num_complaints -1.7669 0.0271 -65.2837 0.0000 -1.8199 -1.7138
number_plan_changes -0.1784 0.0075 -23.9093 0.0000 -0.1930 -0.1638
relocated -3.0826 0.0404 -76.2589 0.0000 -3.1618 -3.0034
monthly_bill -0.0024 0.0002 -16.0138 0.0000 -0.0027 -0.0021
technical_issues_per_month -0.4636 0.0072 -64.0101 0.0000 -0.4778 -0.4494
Speed_test_result 0.1094 0.0015 75.0729 0.0000 0.1065 0.1122

For all the variables p < 0.05, so all are impacting

The next post is about goodness of fit for logistic regression.

Link to the next post : https://statinfer.com/204-2-4-goodness-of-fit-for-logistic-regression/

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.