Correlations between continuous and categorical (nominal) variables

  • I would like to find the correlation between a continuous (dependent variable) and a categorical (nominal: gender, independent variable) variable. Continuous data is not normally distributed. Before, I had computed it using the Spearman's $\rho$. However, I have been told that it is not right.

    While searching on the internet, I found that the boxplot can provide an idea about how much they are associated; however, I was looking for a quantified value such as Pearson's product moment coefficient or Spearman's $\rho$. Can you please help me on how to do this? Or, inform on which method would be appropriate?

    Would Point Biserial Coefficient be the right option?

    Normally, one cannot advice only on the basis of the format of the data! What do the data represent, and what do you want to achieve with your analysis?

    Thanks kjetil, I would like to compare the association between gender and other continuous variables. Simply to know, which continuous variables are moderately/strongly correlated and which variables are not.

    Yes, my question is similar to that. However, I got a feedback where reviewer indicated that Spearman's $\rho$ is not appropriate. My sample size is 31. According to the answer (the link provided), non-normal wouldn't be an issue and any correlation method can be used (Spearman/Pearson/Point-Biserial) for the large dataset. Would it be true for the small dataset too? By the way, gender is not an artificially created dichotomous nominal scale. The above link should use biserial correlation coefficient.

    Correlation between nominal and interval or ordinal variable http://stats.stackexchange.com/q/73065/3277

  • The reviewer should have told you why the Spearman $\rho$ is not appropriate. Here is one version of that: Let the data be $(Z_i, I_i)$ where $Z$ is the measured variable and $I$ is the gender indicator, say it is 0 (man), 1 (woman). Then Spearman's $\rho$ is calculated based on the ranks of $Z, I$ respectively. Since there are only two possible values for the indicator $I$, there will be a lot of ties, so this formula is not appropriate. If you replace rank with mean rank, then you will get only two different values, one for men, another for women. Then $\rho$ will become basically some rescaled version of the mean ranks between the two groups. It would be simpler (more interpretable) to simply compare the means! Another approach is the following.

    Let $X_1, \dots, X_n$ be the observations of the continuous variable among men, $Y_1, \dots, Y_m$ same among women. Now, if the distribution of $X$ and of $Y$ are the same, then $P(X>Y)$ will be 0.5 (let's assume the distribution is purely absolutely continuous, so there are no ties). In the general case, define $$ \theta = P(X>Y) $$ where $X$ is a random draw among men, $Y$ among women. Can we estimate $\theta$ from our sample? Form all pairs $(X_i, Y_j)$ (assume no ties) and count for how many we have "man is larger" ($X_i > Y_j$)($M$) and for how many "woman is larger" ($ X_i < Y_j$) ($W$). Then one sample estimate of $\theta$ is $$ \frac{M}{M+W} $$ That is one reasonable measure of correlation! (If there are only a few ties, just ignore them). But I am not sure what that is called, if it has a name. This one may be close: https://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_gamma

    Spearman's rank correlation is just Pearson's correlation applied to the ranks of the numeric variable and the values of the original binary variable (ranking has no effect here). So Spearman's rho is the rank analogon of the Point-biserial correlation. I don't see any problem in using Spearman's rho descriptively in this situation.

    Michael Mayer: Yes, it might work, maybe, but is there any point in it? It doesnt give information which is not contained in some difference of means! and that is more directly interpretable.

    Is a difference in ranks much simpler to interprete as Spearman's rho? Even if so, would you call Spearman's rho wrong? Sad that we don't see the reviewers reasoning.

    It is to much to say that using $\rho$ is simply wrong, I agree. And we should really see the reviewers reasoning here! But something like I suggested above seems to me better. What do you think?

    What you suggest is nice. It seems to be related to the test statistic of Wilcoxon's two-sample test, which is itself similar to Kendall's rank correlation between the numeric outcome and the binary group variable.

    @kjetilbhalvorsen: Recently, I came across a similar question and appreciate your inputs. But in terms your designed metric, the correlation is not symmetric. This means that values depends on nominator. Is there a way to justify that? THX!

    @tao.hong In which sense do you think it is asymetric? If you switch labels (men/women), then both $\theta$ and $\hat{\theta}$ switches in the same way, to $1-\theta$.

    @kjetilbhalvorsen, OK. I was assuming that theta always represent P(X>Y). Your explanation makes sense to me now.

    This answer is excellent piece of work! If you can extend it with appropriate sources, it would be great because there is so much confusing threads about the topic.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM