Link to the previous post : https://statinfer.com/204-2-4-goodness-of-fit-for-logistic-regression/

Previous post was about goodness of fit, we covered Confusion matrix and will cover the rest in next posts too.

But, first let’s deal with a common issue with modeling:

Multicollinearity

The relation between X and Y is non linear, we used logistic regression.
The multicollinearity is an issue related to predictor variables.
Multicollinearity need to be fixed in logistic regression as well.
Otherwise the individual coefficients of the predictors will be effected by the inter-dependency.
The process of identification is same as linear regression.

Practice : Multicollinearity

Is there any multicollinearity in fiber bits model?
Identify and remove multicollinearity from the model

In [27]:

def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [28]:

#Calculating VIF values using that function
vif_cal(input_data=Fiber, dependent_col="active_cust")

income  VIF =  1.02
months_on_network  VIF =  1.03
Num_complaints  VIF =  1.01
number_plan_changes  VIF =  1.59
relocated  VIF =  1.56
monthly_bill  VIF =  1.02
technical_issues_per_month  VIF =  1.06
Speed_test_result  VIF =  1.0

Individual Impact of Variables

Out of these predictor variables, what are the important variables?
If we have to choose the top 5 variables what are they?
While selecting the model, we may want to drop few less impacting variables.
How to rank the predictor variables in the order of their importance?
We can simply look at the z values of the each variable. Look at their absolute values.
Or calculate the Wald chi-square, which is nearly equal to square of the z-score.
Wald Chi-Square value helps in ranking the variables.

Practice : Individual Impact of Variables

Identify top impacting and least impacting variables in fiber bits models.
Find the variable importance and order them based on their impact.

In [29]:

result1.summary()

Out[29]:

Logit Regression Results
Dep. Variable:	active_cust	No. Observations:	100000
Model:	Logit	Df Residuals:	99992
Method:	MLE	Df Model:	7
Date:	Sun, 16 Oct 2016	Pseudo R-squ.:	0.2403
Time:	14:35:51	Log-Likelihood:	-51717.
converged:	True	LL-Null:	-68074.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
income	1.71e-05	4.17e-06	4.097	0.000	8.92e-06 2.53e-05
months_on_network	0.0150	0.000	31.172	0.000	0.014 0.016
Num_complaints	-1.7669	0.027	-65.284	0.000	-1.820 -1.714
number_plan_changes	-0.1784	0.007	-23.909	0.000	-0.193 -0.164
relocated	-3.0826	0.040	-76.259	0.000	-3.162 -3.003
monthly_bill	-0.0024	0.000	-16.014	0.000	-0.003 -0.002
technical_issues_per_month	-0.4636	0.007	-64.010	0.000	-0.478 -0.449
Speed_test_result	0.1094	0.001	75.073	0.000	0.107 0.112

Top impacting variables are – relocated & Speed_test_result.

Least impacting variables are – monthly_bill & income.

The next post is about model selection logistic regression.

Link to the next post : https://statinfer.com/204-2-6-model-selection-logistic-regression/

21st June 2017

204.2.5 Multicollinearity and Individual Impact Of Variables in Logistic Regression

Solving the issue of multicollinearity and finding impact of individual variable.

Multicollinearity

Practice : Multicollinearity

Individual Impact of Variables

Practice : Individual Impact of Variables