Correlation between a nominal (IV) and a continuous (DV) variable

  • I have a nominal variable (different topics of conversation, coded as topic0=0 etc) and a number of scale variables (DV) such as the length of a conversation.

    How can I derive correlations between the nominal and scale variables?

    The most natural measure of assosiation/correlation between a nominal (taken as IV) and a scale (taken as DV) variables is eta.

    If I understand correctly you want to say something about the relation beteen topic of conversation (as IV?) and conversation duration (DV). ''e.g. hypo= topic 1 means significantly shorter conversation than topic 2'', if this example is what you meant: You would use a ANOVA for this (if more DV's MANOVA, or several anova's) Is this what you mean? the sentence with your question is quite ambiguous..

  • The title of this question suggests a fundamental misunderstanding. The most basic idea of correlation is "as one variable increases, does the other variable increase (positive correlation), decrease (negative correlation), or stay the same (no correlation)" with a scale such that perfect positive correlation is +1, no correlation is 0, and perfect negative correlation is -1. The meaning of "perfect" depends on which measure of correlation is used: for Pearson correlation it means the points on a scatter plot lie right on a straight line (sloped upwards for +1 and downwards for -1), for Spearman correlation that the ranks exactly agree (or exactly disagree, so first is paired with last, for -1), and for Kendall's tau that all pairs of observations have concordant ranks (or discordant for -1). An intuition for how this works in practice can be gleaned from the Pearson correlations for the following scatter plots (image credit):

    Pearson correlation for various scatter plots

    Further insight comes from considering Anscombe's Quartet where all four data sets have Pearson correlation +0.816, even though they follow the pattern "as $x$ increases, $y$ tends to increase" in very different ways (image credit):

    Scatter plots for Anscombe's Quartet

    If your independent variable is nominal then it doesn't make sense to talk about what happens "as $x$ increases". In your case, "Topic of conversation" doesn't have a numerical value that can go up and down. So you can't correlate "Topic of conversation" with "Duration of conversation". But as @ttnphns wrote in the comments, there are measures of strength of association you can use that are somewhat analogous. Here is some fake data and accompanying R code:

    data.df <- data.frame(
        topic = c(rep(c("Gossip", "Sports", "Weather"), each = 4)),
        duration  = c(6:9, 2:5, 4:7)
    )
    print(data.df)
    boxplot(duration ~ topic, data = data.df, ylab = "Duration of conversation")
    

    Which gives:

    > print(data.df)
         topic duration
    1   Gossip        6
    2   Gossip        7
    3   Gossip        8
    4   Gossip        9
    5   Sports        2
    6   Sports        3
    7   Sports        4
    8   Sports        5
    9  Weather        4
    10 Weather        5
    11 Weather        6
    12 Weather        7
    

    Box plots for fake data

    By using "Gossip" as the reference level for "Topic", and defining binary dummy variables for "Sports" and "Weather", we can perform a multiple regression.

    > model.lm <- lm(duration ~ topic, data = data.df)
    > summary(model.lm)
    
    Call:
    lm(formula = duration ~ topic, data = data.df)
    
    Residuals:
       Min     1Q Median     3Q    Max 
     -1.50  -0.75   0.00   0.75   1.50 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept)    7.5000     0.6455  11.619 1.01e-06 ***
    topicSports   -4.0000     0.9129  -4.382  0.00177 ** 
    topicWeather  -2.0000     0.9129  -2.191  0.05617 .  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 1.291 on 9 degrees of freedom
    Multiple R-squared: 0.6809,     Adjusted R-squared: 0.6099 
    F-statistic:   9.6 on 2 and 9 DF,  p-value: 0.005861 
    

    We can interpret the estimated intercept as giving the mean duration of Gossip conversations as 7.5 minutes, and the estimated coefficients for the dummy variables as showing Sports conversations were on average 4 minutes shorter than Gossip ones, while Weather conversations were 2 minutes shorter than Gossip. Part of the output is the coefficient of determination $R^2 = 0.6809$. One interpretation of this is that our model explains 68% of variance in conversation duration. Another interpretation of $R^2$ is that by square-rooting, we can find the multiple correlation coefficent $R$.

    > rsq <- summary(model.lm)$r.squared
    > rsq
    [1] 0.6808511
    > sqrt(rsq)
    [1] 0.825137
    

    Note that 0.825 isn't the correlation between Duration and Topic - we can't correlate those two variables because Topic is nominal. What it actually represents is the correlation between the observed durations, and the ones predicted (fitted) by our model. Both of these variables are numerical so we are able to correlate them. In fact the fitted values are just the mean durations for each group:

    > print(model.lm$fitted)
      1   2   3   4   5   6   7   8   9  10  11  12 
    7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5 
    

    Just to check, the Pearson correlation between observed and fitted values is:

    > cor(data.df$duration, model.lm$fitted)
    [1] 0.825137
    

    We can visualise this on a scatter plot:

    plot(x = model.lm$fitted, y = data.df$duration,
         xlab = "Fitted duration", ylab = "Observed duration")
    abline(lm(data.df$duration ~ model.lm$fitted), col="red")
    

    Visualise multiple correlation coefficient between observed and fitted values

    The strength of this relationship is visually very similar to those of the Anscombe's Quartet plots, which is unsurprising as they all had Pearson correlations about 0.82.

    You might be surprised that with a categorical independent variable, I chose to do a (multiple) regression rather than a one-way ANOVA. But in fact this turns out to be an equivalent approach.

    library(heplots) # for eta
    model.aov <- aov(duration ~ topic, data = data.df)
    summary(model.aov)
    

    This gives a summary with identical F statistic and p-value:

                Df Sum Sq Mean Sq F value  Pr(>F)   
    topic        2     32  16.000     9.6 0.00586 **
    Residuals    9     15   1.667                   
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    

    Again, the ANOVA model fits the group means, just as the regression did:

    > print(model.aov$fitted)
      1   2   3   4   5   6   7   8   9  10  11  12 
    7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5 
    

    This means that the correlation between fitted and observed values of the dependent variable is the same as it was for the multiple regression model. The "proportion of variance explained" measure $R^2$ for multiple regression has an ANOVA equivalent, $\eta^2$ (eta squared). We can see that they match.

    > etasq(model.aov, partial = FALSE)
                  eta^2
    topic     0.6808511
    Residuals        NA
    

    In this sense, the closest analogue to a "correlation" between a nominal explanatory variable and continuous response would be $\eta$, the square-root of $\eta^2$, which is the equivalent of the multiple correlation coefficient $R$ for regression. This explains the comment that "The most natural measure of association / correlation between a nominal (taken as IV) and a scale (taken as DV) variables is eta". If you are more interested in the proportion of variance explained, then you can stick with eta squared (or its regression equivalent $R^2$). For ANOVA, one often comes across the partial eta squared. As this ANOVA was one-way (there was only one categorical predictor), the partial eta squared is the same as eta squared, but things change in models with more predictors.

    > etasq(model.aov, partial = TRUE)
              Partial eta^2
    topic         0.6808511
    Residuals            NA
    

    However it's quite possible that neither "correlation" nor "proportion of variance explained" is the measure of effect size you wish to use. For instance, your focus may lie more on how means differ between groups. This question and answer contain more information on eta squared, partial eta squared, and various alternatives.

    @Zhubarb the hard part was getting $R \approx 0.82$ for the bogus data...

    +1 for a very nicely explained answer! Here you argue that the sign of $\eta$ or $R$ is always positive, because of course any decently fit model will result in fitted values positively (rather than negatively) correlated with the DV. Maybe I could add that in *some* cases the sign can be meaningfully assigned to $\eta$, e.g. if IV is ordered (I believe this is then called "ordinal" instead of "nominal"), or at least partially ordered. Imagine that topics in the OP range from arts to math; then we could use the sign of correlation between nerdiness and DV and assign it to $\eta$.

    @amoeba Herein I think lies a subtle point. Suppose we run a simple linear regression and obtain PMCC $r=-0.9$ - then as x increases, y tends to decrease (this is the sort of directional effect you are talking about). Nevertheless the multiple correlation coefficient for such a regression is still $R=0.9$ (as fitted value of y increases, observed value tends to increase). Now $\eta$ is more like $R$ than $r$...

    That is correct, but I guess what I am saying is that sometimes it can make sense to consider "signed $\eta$" that is more like $r$ than like $R$.

    @amoeba, you could just multiply $\sqrt{eta^2}$ by $-1$, but it really is creating a new measure that you will have to explain every time, & I don't see how it has really done anything meaningful for you.

    @gung, perhaps this is not the best place to discuss this at length, as mine was only a marginal remark. Still, what I had in mind only makes sense for ordinal (or partially ordinal) IVs. I would argue that the direction of the effect can be meaningfully defined there, and so "signed $\eta$" could in principle be used and convey some useful information. I appreciate your point that this is unconventional.

    great answer Silverfish. Quick question I ran the same analysis on y data set (which has 9 measures and 1 categorical variable - was hoping to find how much each one individually affects the category) - and all of them got very small R-squared. What's the interpretation? Does it mean that any of them individually explains the categorical variable? thanks

    @Silverfish What if categorical variable is dependent variable having more than 2 levels and continuous variable is independent variable? I tried to do ANOVA on this case and got warnings and some errors.

    @Silverfish - is there a name for this method, or is this something you put together yourself?

    @DJBunk Having a bit of a mind-blank here - let me know which bit you mean by "This method" and I'll see if I can answer!

    @ Silverfish - In particular, the method for finding a continuous variable to plot on the x-axis (fitted duration in the example above). I am mostly looking for a term than might appear in textbooks or Google. Thanks!

    @DJBunk Try "observed vs fitted values" as a search term - suspect most of your hits will be for its use as a regression diagnostic but it works for e.g. ANOVA too.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM