Link to the previous post : https://statinfer.com/204-1-3-practice-regression-line-fitting/
In this post we will understand the mathematics behind a good regression line.
How good is my regression line?
- Take an (x,y) point from data.
- Imagine that we submitted x in the regression line, we got a prediction as ypred
- If the regression line is a good fit then the we expect ypred=y or (y-ypred) =0
- At every point of x, if we repeat the same, then we will get multiple error values (y-ypred) values
- Some of them might be positive, some of them may be negative, so we can take the square of all such errors
SSE=∑(y−y^)2
- For a good model we need SSE to be zero or near to zero
- Standalone SSE will not make any sense, For example SSE= 100, is very less when y is varying in terms of 1000’s. Same value is is very high when y is varying in terms of decimals.
- We have to consider variance of y while calculating the regression line accuracy
- Error Sum of squares (SSE- Sum of Squares of error)
SSE=∑(y−y^)2
- Total Variance in Y (SST- Sum of Squares of Total)
SST=∑(y−y¯)2SST=∑(y−y^+−y^−y¯)2SST=∑(y−y^+−y^−y¯)2SST=∑(y−y^)2+∑(y^−y¯)2SST=SSE+∑(y^−y¯)2SST=SSE+SSR
- So, total variance in Y is divided into two parts,
- Variance that can’t be explained by x (error)
- Variance that can be explained by x, using regression
Explained and Unexplained Variation
- Total variance in Y is divided into two parts,
- Variance that can be explained by x, using regression
- Variance that can’t be explained by x
SST=SSE+SSRTotalsumofSquares=SumofSquaresError+SumofSquaresRegressionSST=∑(y−y¯)2SSE=∑(y−y^)2SSR=∑(y^−y¯)2
In next session we will figure out R–squared which a statistical measure of closeness of datapoints to the fitted regression line.
The next post is about R squared in python.
Link to the next post : https://statinfer.com/204-1-5-r-squared-in-python/