Pearson's or Spearman's correlation with non-normal data
I get this question frequently enough in my statistics consulting work, that I thought I'd post it here. I have an answer, which is posted below, but I was keen to hear what others have to say.
Question: If you have two variables that are not normally distributed, should you use Spearman's rho for the correlation?
Why not calculate and report **both** (Pearson's r *and* Spearman's ρ)? Their difference (or lack thereof) will provide additional information.
A question comparing the distributional assumptions made when we test for significance a simple regression coefficient beta and when we test Pearson correlation coefficient (numerically eual to the beta) http://stats.stackexchange.com/q/181043/3277.
Pearson's correlation is a measure of the linear relationship between two continuous random variables. It does not assume normality although it does assume finite variances and finite covariance. When the variables are bivariate normal, Pearson's correlation provides a complete description of the association.
Spearman's correlation applies to ranks and so provides a measure of a monotonic relationship between two continuous random variables. It is also useful with ordinal data and is robust to outliers (unlike Pearson's correlation).
The distribution of either correlation coefficient will depend on the underlying distribution, although both are asymptotically normal because of the central limit theorem.
Pearson's $\rho$ does not assume normality, but is only an exhaustive measure of association if the joint distribution is multivariate normal. Given the confusion this distinction elicits, you might want to add it to your answer.
Is there a source that can be quoted to support the above statement (Person's r does not assume normality)? We're having the same argument in our department at the moment.
@RobHyndman In the field of financial time series (for example when trying to learn about correlations between stock returns), would you recommend Pearson correlation or rank based correlations? Wikipedia is pretty strongly against Pearson but their source is dubious.
*"When the variables are bivariate normal, Pearson's correlation provides a complete description of the association."* And when the variables are NOT bivariate normal, how useful is Pearson's correlation?
Here: http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/ they say that "For the Pearson r correlation, both variables should be normally distributed. Other assumptions include linearity and homoscedasticity"
This answer seems rather indirect. "When the variables are bivariate normal ..." And when not? This kind of explanation is why I never get statistics. "Rob, how do you like my new dress?" "The dark color emphasizes your light skin." "Sure, Rob, but do you *like* how it emphasisez my skin?" "Light skin is considered beautiful in many cultures." "I know, Rob, but do *you* like it?" "I think the dress is beautiful." "I think so, too, Rob, but is it beautiful *on me*?" "You always look beautiful to me, honey." *sigh*
Although the asymptotic distributions of the correlations are normal, the variances of the those normal distributions depend on the unknown population parameters. In the sense of inference, we do require bivariate normality for Pearson's correlation.
No, we don't. It's quite possible to do inference for Pearson's correlation without assuming bivariate normality, in at least four different ways. (i) use asymptotic results -- already mentioned above; (ii) make some other parametric distributional assumption and derive or simulate the null distribution of the test statistic; (iii) use a permutation test; (iv) use a bootstrap test. There are probably other approaches
These answers all show what's is wrong with today's statistics education. The CLT does NOT guarantee your data will converge to normal. In fact, in almost all cases it will NOT. Every answer here is circular because it assumes normality is something real-world data tends towards, which is does NOT. Most real-world data will be fait-tailed, meaning it's moments are extremely ill-defined, or don't exist period. Convergence is either slow or non-existent. Pearson's correlation is used out of convenience, not because it is a robust measure, which it is NOT.