Link to the previous post : https://statinfer.com/204-1-6-multiple-regression-in-python/
Adjusted R-Squared
- Is it good to have as many independent variables as possible? Nope
- R-square is deceptive. R-squared never decreases when a new X variable is added to the model – True?
- We need a better measure or an adjustment to the original R-squared formula.
- Adjusted R squared
- Its value depends on the number of explanatory variables
- Imposes a penalty for adding additional explanatory variables
- It is usually written as
- Very different from when there are too many predictors and n is less
[latex]\bar{R}^2 = R^2 – \frac{k-1}{n-k}(1-R^2)[/latex]
R squared value increase if we increase the number of independent variables. Adjusted R-square increases only if a significant variable is added. Look at this example. As we are adding new variables, R square increases, Adjusted R-square may not increase.
Practice : Adjusted R-Square
- Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
- Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
- Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
- Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values
adj_sample=pd.read_csv("datasets\\Adjusted RSquare\\Adj_Sample.csv")
adj_sample.shape
adj_sample.columns.values
#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()
#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()
Model | R2 |
---|
AdjR2 |
---|
Model1 | 0.684 | 0.566 | |
Model2 | 0.717 | 0.377 | |
Model3 | 0.805 | 0.285 |
R-Squared vs Adjusted R-Squared
We have built three models on Adj_sample data; model1, model2 and model3 with different number of variabes
1) What does it indicate if R-square is very far away from Adj-R square? An indication of too many variables/ Too many insignificant variables. We may have to see the variable impact test and drop few independent variables from the model.
2) How do you use Adj-R square? Build a model, Calculate R-square is near to adjusted R-square. If not, use variable selection techniques to bring R square near to Adj- R square. A difference of 2% between R square and Adj-R square is acceptable.
3) Is the only number of independent variables that make Adj-R Square down? No, if observe the formula carefully then we can see Adj-R square is influenced by k(number of variables) and n(number of observations) . If ‘k’ is high and ‘n’ is low then Adj-R Square will be very less.
Finally either reduce number of variables or increase the number of observations to bring Adj-R Square close to R Square