Bad Statistical Terminology 3: R Squared

by Richard in 
statistics linear-regression python

One of the sections in Cosma Shalizi’s book Advanced Data Analysis from an Elementary Point of View is entitled “R Squared: Distraction or Nuisance?”. In my experience, statistical practitioners either use $R^2$ all the time, never use it, or report $R^2$ values but never speak about them.

I am not claiming that $R^2$ itself is a bad name. Far from it! In the case of linear regression, $R^2$ is the square of the correlation coefficient between observed and predicted values, so it’s a very sensible name.

The major problem with $R^2$ is that it is a number which measures too many things at once. Among other things, it is simultaneously a measure of how much better the fitted model is than a zero slope model, but also a measure of how far the data is from the regression surface, but also a measure of how much the variance of the fitted values shrinks compared to the variance of the observed values, but also a measure of the correlation between observed and predicted values.

This is really too much of a burden for a single number to bear!

Unfortunately, when $R^2$ is generalized to contexts other than linear regression, people might choose to take any one of the above properties as the basis of their definition of the thing they call $R^2$. In these cases, $R^2$ is a really bad name, because it is anyone’s guess which one of the properties of $R^2$ is relevant to the new definition.

This is the cause of endless confusion, as I now hope to show.

Consider a regression problem, in which some predictive model is being used to predict a continuous response variable $y$. Here, “regression” does not necessarily mean linear regression.

In the case of a predictive model, various notions of $R^2$ can be defined based on the observed values $y=(y_i)$ and predicted values $\hat{y} = (\hat{y}_i)$. A simple one is

\[R^2 = \mathrm{corr}(y, \hat{y})^2\]

This is certainly sometimes useful. However, unless something has gone very wrong, the correlation in question will always be positive, so it’s not clear why you are squaring it.

Another notion which I have seen used is the ratio of variances

\[R^2 = \mathrm{var}(\hat{y})/\mathrm{var}(y)\]

This doesn’t necessarily have to be less than $1$ but, unless something has gone very wrong, in practice it always is. It can be viewed as a measure of how much “regression” there is in your regression.

(Aside: I was recently given a presentation by an auto-ml salesman, who explained that one of the interesting properties of his model was that it always seemed to over-predict small values and under-predict large values of the response variable. But that’s what every model does…)

This $R^2$ is also often called “variance explained”, a term which I may save for a future post in this series!

The next kind of $R^2$ is based on the formula

\[R^2 = 1 - \frac{\sum_i(y_i - \hat{y}_i)^2}{\sum_i (y_i - \overline{y})^2 }\]

which measures how much better the fitted model is than the model which just predicts the mean value $\overline{y}$ for every observation. This formula is used verbatim to define r2_score in scikit-learn.

Unfortunately, this formula makes very little sense unless it is applied to the training data, because the mean of the observed values is not available at prediction time, so it compares your model with a model which you couldn’t have built in the first place.

This $R^2$ can also be negative, as explained in scikit-learn’s help pages, which don’t really address why anybody would be interested in this quantity, or why a number which can be negative is called the square of something.

But there is a special circle of hell reserved for the next $R^2$, which is the $R^2$ for linear regression constrained to go through the origin (regression without intercept). This is defined by

\[R^2 = 1 - \frac{ \sum_i (y_i - \hat{y}_i)^2 }{ \sum_i y_i^2 }\]

The motivation for this definition is that you are comparing your fitted model with a null model, but the null model is $\hat{y} = 0$ instead of $\hat{y} = \overline{y}$.

The reason why it is very bad to call this quantity $R^2$ is because it measures something completely different to the $R^2$ for a regression with intercept. It often has a much larger value, because it’s much easier to beat the $\hat{y} = 0$ model than the $\hat{y} = \overline{y}$ model.

But the fact that it has the same name often leads people to think it means the same thing, and therefore that they are achieving great results with their intercept-free regression.

In extreme cases, the higher value of $R^2$ leads people to prefer regression with zero intercept to regression with intercept. I have seen this mistake made several times.

Statistical software which calls this quantity $R^2$ should really come with a health warning!

In conclusion, $R^2$ is bad terminology because it is overused for too many different things. We would probably be better off if people reserved $R^2$ for linear regression and invented other terms for the various generalizations of $R^2$ to other contexts.

© Data Science Confidential 2021 All rights reserved.