What's the difference between correlation and simple linear regression?

  • In particular, I am referring to the Pearson product-moment correlation coefficient.

    Note that one perspective on the relationship between regression & correlation can be discerned from my answer here: What is the difference between doing linear regression on y with x versus x with y?.

  • What's the difference between the correlation between $X$ and $Y$ and a linear regression predicting $Y$ from $X$?

    First, some similarities:

    • the standardised regression coefficient is the same as Pearson's correlation coefficient
    • The square of Pearson's correlation coefficient is the same as the $R^2$ in simple linear regression
    • Neither simple linear regression nor correlation answer questions of causality directly. This point is important, because I've met people that think that simple regression can magically allow an inference that $X$ causes $Y$.

    Second, some differences:

    • The regression equation (i.e., $a + bX$) can be used to make predictions on $Y$ based on values of $X$
    • While correlation typically refers to the linear relationship, it can refer to other forms of dependence, such as polynomial or truly nonlinear relationships
    • While correlation typically refers to Pearson's correlation coefficient, there are other types of correlation, such as Spearman's.

    Hi Jeromy, thank you for your explaination, but I still have a question here: What if I don not need to make predictions and just want to know how close two variable are and in which direction/strength? Is there still a different using these two technique?

    @yue86231 Then it sounds like a measure of correlation would be more appropriate.

    (+1) To the similarities it might be useful to add that standard tests of the hypothesis "correlation=0" or, equivalently, "slope=0" (for the regression in either order), such as carried out by `lm` and `cor.test` in `R`, will yield identical p-values.

    I agree that the suggestion from @whuber should be added, but at a very basic level I think it is worth pointing out that the *sign* of the regression slope and the correlation coefficient are equal. This is probably one of the first things most people learn about the relationship between correlation and a "line of best fit" (even if they don't call it "regression" yet) but I think it's worth noting. To the differences, the fact that you get the same answer correlation X with Y or vice versa, but that the regression of Y on X is different to that of X on Y, might also merit a mention.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM