### Correlations with unordered categorical variables

I have a dataframe with many observations and many variables. Some of them are categorical (unordered) and the others are numerical.

I'm looking for associations between these variables. I've been able to compute correlation for numerical variables (Spearman's correlation) but :

- I don't know how to measure correlation between unordered categorical variables.
- I don't know how to measure correlation between unordered categorical variables and numerical variables.

Does anyone know how this could be done? If so, are there R functions implementing these methods?

gung - Reinstate Monica Correct answer

6 years agoIt depends on what sense of a correlation you want. When you run the prototypical Pearson's product moment correlation, you get a measure of the strength of association and you get a test of the significance of that association. More typically however, the significance test and the measure of effect size differ.

**Significance tests:**- Continuous vs. Nominal: run an ANOVA. In R, you can use ?aov.
- Nominal vs. Nominal: run a chi-squared test. In R, you use ?chisq.test.

**Effect size**(strength of association):- Continuous vs. Nominal: calculate the intraclass correlation. In R, you can use ?ICC in the psych package; there is also an ICC package.
- Nominal vs. Nominal: calculate Cramer's V. In R, you can use ?assocstats in the vcd package.

A very thorough explanation of the continuous vs. nominal case can be found here: Correlation between a nominal (IV) and a continuous (DV) variable.

In the binary vs interval case there's the point-biserial correlation.

What would be a better alternative to the chi-squared test for large samples?

@WaldirLeoncio, "better" in what sense? What is wrong with the chi-squared if you want a test of independence? What constitutes a "large sample" for you?

Well, from what I've read and experienced, when whe sample size is in the tens of thousands, for instance, even small deviations from the expected frequencies — say, something a visual analysis would consider irrelevant — often result in very small $\left(10^{-16} \right)$ p-values.

@WaldirLeoncio, yes but if the null is true, $p$ will be $<.05$ only $5\%$ of the time. That is the way it is supposed to work. If you want to know the magnitude of the effect as well as a test of the null, you may want to calculate Cramer's V along with the chi-squared test.

As @gung pointed out, Correlation between a nominal (IV) and a continuous (DV) variable is an excellent link for how correlation for mixed variables can be done. `Hmisc::rcorr` does this beautifully and we can check it (for a mixed variables dataframe) as follows: `as.data.frame(rcorr(as.matrix(data_frame),type = "pearson")$P)` $\:$ `as.data.frame(rcorr(as.matrix(data_frame),type = "pearson")$r)`

@gung, my teacher told me `use L, C, Lambda when Nominal vs. Nominal` but you said use `chisq.test.`?

@kittygirl, I don't know what `L, C, Lambda` are (for nominal vs nominal, or anything else). I do say to use a chi-squared test to test for an association between two nominal variables, as you say & can see above.

@gung,have a look at https://www.andrews.edu/~calkins/math/edrm611/edrm13.htm

For an R implementation that calculates the strength of association for nominal vs nominal with a bias-corrected Cramer's V, numeric vs numeric with Spearman (default) or Pearson correlation, and nominal vs numeric with ANOVA see https://stackoverflow.com/a/56485520/590437

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

ttnphns 4 years ago

http://stats.stackexchange.com/q/119835/3277; http://stats.stackexchange.com/q/73065/3277; http://stats.stackexchange.com/q/103253/3277.