• No products in the cart.

204.1.7 Adjusted R-squared in Python

R square for multiple variables in regression.

Link to the previous post : https://statinfer.com/204-1-6-multiple-regression-in-python/

Adjusted R-Squared

  • Is it good to have as many independent variables as possible? Nope
  • R-square is deceptive. R-squared never decreases when a new X variable is added to the model – True?
  • We need a better measure or an adjustment to the original R-squared formula.
  • Adjusted R squared
    • Its value depends on the number of explanatory variables
    • Imposes a penalty for adding additional explanatory variables
    • It is usually written as (R^2)

 

  • Very different from (R^2) when there are too many predictors and n is less

[latex]\bar{R}^2 = R^2 – \frac{k-1}{n-k}(1-R^2)[/latex]

 

where n – number of observations k – number of parameters

R squared value increase if we increase the number of independent variables. Adjusted R-square increases only if a significant variable is added.  Look at this example. As we are adding new variables, R square increases, Adjusted R-square may not increase.

Practice : Adjusted R-Square

  • Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
  • Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values
In [26]:
adj_sample=pd.read_csv("datasets\\Adjusted RSquare\\Adj_Sample.csv")
adj_sample.shape
Out[26]:
(12, 9)
In [27]:
adj_sample.columns.values
Out[27]:
array(['Y', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'], dtype=object)
In [28]:
#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])
In [29]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()
C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[29]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.684
Model: OLS Adj. R-squared: 0.566
Method: Least Squares F-statistic: 5.785
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.0211
Time: 11:48:28 Log-Likelihood: -10.430
No. Observations: 12 AIC: 28.86
Df Residuals: 8 BIC: 30.80
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.8798 1.163 -2.477 0.038 -5.561 -0.199
x1 -0.4894 0.370 -1.324 0.222 -1.342 0.363
x2 0.0029 0.001 2.586 0.032 0.000 0.005
x3 0.4572 0.176 2.595 0.032 0.051 0.864
Omnibus: 1.113 Durbin-Watson: 1.978
Prob(Omnibus): 0.573 Jarque-Bera (JB): 0.763
Skew: -0.562 Prob(JB): 0.683
Kurtosis: 2.489 Cond. No. 6.00e+03
In [30]:
#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values 

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])
In [31]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()
C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[31]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.717
Model: OLS Adj. R-squared: 0.377
Method: Least Squares F-statistic: 2.111
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.215
Time: 11:48:28 Log-Likelihood: -9.7790
No. Observations: 12 AIC: 33.56
Df Residuals: 5 BIC: 36.95
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -5.3751 4.687 -1.147 0.303 -17.423 6.673
x1 -0.6697 0.537 -1.247 0.268 -2.050 0.711
x2 0.0030 0.002 1.956 0.108 -0.001 0.007
x3 0.5063 0.249 2.036 0.097 -0.133 1.146
x4 0.0376 0.084 0.449 0.672 -0.178 0.253
x5 0.0436 0.169 0.258 0.806 -0.390 0.478
x6 0.0516 0.088 0.588 0.582 -0.174 0.277
Omnibus: 0.426 Durbin-Watson: 2.065
Prob(Omnibus): 0.808 Jarque-Bera (JB): 0.434
Skew: -0.347 Prob(JB): 0.805
Kurtosis: 2.378 Cond. No. 1.98e+04
In [32]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])
In [33]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()
C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[33]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.805
Model: OLS Adj. R-squared: 0.285
Method: Least Squares F-statistic: 1.549
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.393
Time: 11:48:28 Log-Likelihood: -7.5390
No. Observations: 12 AIC: 33.08
Df Residuals: 3 BIC: 37.44
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 17.0440 19.903 0.856 0.455 -46.297 80.385
x1 -0.0956 0.761 -0.126 0.908 -2.519 2.328
x2 0.0007 0.003 0.291 0.790 -0.007 0.009
x3 0.5157 0.306 1.684 0.191 -0.459 1.490
x4 0.0579 0.103 0.560 0.615 -0.271 0.387
x5 0.0858 0.191 0.448 0.684 -0.524 0.695
x6 -0.1747 0.220 -0.795 0.485 -0.874 0.525
x7 -0.0324 0.153 -0.212 0.846 -0.519 0.455
x8 -0.2321 0.207 -1.124 0.343 -0.890 0.425
Omnibus: 1.329 Durbin-Watson: 1.594
Prob(Omnibus): 0.514 Jarque-Bera (JB): 0.875
Skew: -0.339 Prob(JB): 0.646
Kurtosis: 1.863 Cond. No. 7.85e+04
Model R2
AdjR2
Model1 0.684 0.566
Model2 0.717 0.377
Model3 0.805 0.285

R-Squared vs Adjusted R-Squared

We have built three models on Adj_sample data; model1, model2 and model3 with different number of variabes

1) What does it indicate if R-square is very far away from Adj-R square?  An indication of too many variables/ Too many insignificant variables.   We may have to see the variable impact test and drop few independent variables from the model.

2) How do you use Adj-R square? Build a model, Calculate R-square is near to adjusted R-square. If not, use variable selection techniques to bring R square near to Adj- R square.  A difference of 2% between R square and Adj-R square is acceptable.

3) Is the only number of independent variables that make Adj-R Square down? No, if observe the formula carefully then we can see Adj-R square is influenced by k(number of variables) and n(number of observations) . If ‘k’ is high and ‘n’ is low then Adj-R Square will be very less.

Finally either reduce number of variables or increase the number of observations to bring Adj-R Square close to R Square

The next post is a practice session on multiple regression issues.
24th January 2018

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.