Link to the previous post : https://statinfer.com/204-2-5-multicollinearity-and-individual-impact-of-variables-in-logistic-regression/

We left some part of the post regarding goodness of fitness behind. We will cover them in this post and see if we can improve our model based on AIC and BIC.

We will also cover various methods used for model selection in a series dedicated to it.

How to improve model

By adding more independent variables?
By deriving new variables from available set?
By transforming variables ?
By collecting more data?
How do we choose best model from the list of fitted models with different parameters

AIC and BIC

AIC and BIC values are like adjusted R-squared values in linear regression.
Stand-alone model AIC has no real use, but if we are choosing between the models AIC really helps.
Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models.
If we are choosing between two models, a model with less AIC is preferred.
AIC is an estimate of the information lost when a given model is used to represent the process that generates the data.
AIC= -2ln(L)+ 2k
- L be the maximum value of the likelihood function for the model.
- k is the number of independent variables.
BIC is a substitute to AIC with a slightly different formula. We will follow either AIC or BIC throughout our analysis.

Practice : Logistic Regression Model Selection

Find AIC and BIC values for the first fiber bits model(m1)
What are the top-2 impacting variables in fiber bits model?
What are the least impacting variables in fiber bits model?
Can we drop any of these variables and build a new model(m2)
Can we add any new interaction and polynomial variables to increase the accuracy of the model?(m3)
We have three models, what the best accuracy that you can expect on this data?

In [30]:

#Find AIC and BIC values for the first fiber bits model(m2)

m1=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
m1
m1.fit()

m1.fit().summary2()

Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7

Out[30]:

Model:	Logit	Pseudo R-squared:	0.240
Dependent Variable:	active_cust	AIC:	103450.4420
Date:	2016-10-16 14:35	BIC:	103526.5454
No. Observations:	100000	Log-Likelihood:	-51717.
Df Model:	7	LL-Null:	-68074.
Df Residuals:	99992	LLR p-value:	0.0000
Converged:	1.0000	Scale:	1.0000
No. Iterations:	7.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
income	0.0000	0.0000	4.0973	0.0000	0.0000	0.0000
months_on_network	0.0150	0.0005	31.1715	0.0000	0.0141	0.0159
Num_complaints	-1.7669	0.0271	-65.2837	0.0000	-1.8199	-1.7138
number_plan_changes	-0.1784	0.0075	-23.9093	0.0000	-0.1930	-0.1638
relocated	-3.0826	0.0404	-76.2589	0.0000	-3.1618	-3.0034
monthly_bill	-0.0024	0.0002	-16.0138	0.0000	-0.0027	-0.0021
technical_issues_per_month	-0.4636	0.0072	-64.0101	0.0000	-0.4778	-0.4494
Speed_test_result	0.1094	0.0015	75.0729	0.0000	0.1065	0.1122

What are the top-2 impacting variables in fiber bits model?
What are the least impacting variables in fiber bits model?

In [31]:

m1.fit().summary()

Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7

Out[31]:

Logit Regression Results
Dep. Variable:	active_cust	No. Observations:	100000
Model:	Logit	Df Residuals:	99992
Method:	MLE	Df Model:	7
Date:	Sun, 16 Oct 2016	Pseudo R-squ.:	0.2403
Time:	14:35:52	Log-Likelihood:	-51717.
converged:	True	LL-Null:	-68074.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
income	1.71e-05	4.17e-06	4.097	0.000	8.92e-06 2.53e-05
months_on_network	0.0150	0.000	31.172	0.000	0.014 0.016
Num_complaints	-1.7669	0.027	-65.284	0.000	-1.820 -1.714
number_plan_changes	-0.1784	0.007	-23.909	0.000	-0.193 -0.164
relocated	-3.0826	0.040	-76.259	0.000	-3.162 -3.003
monthly_bill	-0.0024	0.000	-16.014	0.000	-0.003 -0.002
technical_issues_per_month	-0.4636	0.007	-64.010	0.000	-0.478 -0.449
Speed_test_result	0.1094	0.001	75.073	0.000	0.107 0.112

Can we drop any of these variables and build a new model(m2)

In [32]:

#Income and Monthly Bill Dropped because those are the least impacting variables
m2=sm.Logit(Fiber['active_cust'],Fiber[['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['technical_issues_per_month']+['Speed_test_result']])
m2
m2.fit()
m2.fit().summary()
m2.fit().summary2()

Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7

Out[32]:

Model:	Logit	Pseudo R-squared:	0.238
Dependent Variable:	active_cust	AIC:	103732.9794
Date:	2016-10-16 14:35	BIC:	103790.0570
No. Observations:	100000	Log-Likelihood:	-51860.
Df Model:	5	LL-Null:	-68074.
Df Residuals:	99994	LLR p-value:	0.0000
Converged:	1.0000	Scale:	1.0000
No. Iterations:	7.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
months_on_network	0.0146	0.0005	30.8698	0.0000	0.0137	0.0155
Num_complaints	-1.7621	0.0270	-65.2891	0.0000	-1.8150	-1.7092
number_plan_changes	-0.1765	0.0074	-23.7127	0.0000	-0.1910	-0.1619
relocated	-3.0800	0.0404	-76.1640	0.0000	-3.1592	-3.0007
technical_issues_per_month	-0.4762	0.0072	-66.1848	0.0000	-0.4903	-0.4621
Speed_test_result	0.1074	0.0014	74.5451	0.0000	0.1046	0.1102

Conclusion: Logistic Regression

Logistic Regression is the base of all classification algorithms.
A good understanding on logistic regression and goodness of fit measures will really help in understanding complex machine learning algorithms like neural networks and SVMs.
One has to be careful while selecting the model, all the goodness of fit measures are calculated on training data. We may have to do cross validation to get an idea on the test error.

23rd January 2018

204.2.6 Model Selection : Logistic Regression

Have a look on Modelselection methods.

How to improve model

AIC and BIC

Practice : Logistic Regression Model Selection

Conclusion: Logistic Regression

Statinfer

Statinfer

Statinfer

204.2.6 Model Selection : Logistic Regression

Have a look on Modelselection methods.

How to improve model

AIC and BIC

Practice : Logistic Regression Model Selection

Conclusion: Logistic Regression

Related Courses

Python(Batch6)

Statinfer

Tableau (Batch6)

Statinfer

PowerBI (Batch6)

Statinfer