ANOVA assumption normality/normal distribution of residuals
The Wikipedia page on ANOVA lists three assumptions, namely:
- Independence of cases – this is an assumption of the model that simplifies the statistical analysis.
- Normality – the distributions of the residuals are normal.
- Equality (or "homogeneity") of variances, called homoscedasticity...
Point of interest here is the second assumption. Several sources list the assumption differently. Some say normality of the raw data, some claim of residuals.
Several questions pop up:
- are normality and normal distribution of residuals the same person (based on Wikipedia entry, I would claim normality is a property, and does not pertain residuals directly (but can be a property of residuals (deeply nested text within brackets, freaky)))?
- if not, which assumption should hold? One? Both?
- if the assumption of normally distributed residuals is the right one, are we making a grave mistake by checking only the histogram of raw values for normality?
You can pretty much ignore anything else those sources that say if they claim the raw data needs to be normally distributed. And who said "we" were only checking the raw values with histograms, anyway. Are you in one of those Six Sigma classes???
@Andy W: I've just added a link to what appears to be the relevant section of the Wikipedia article on ANOVA.
@DWin: http://blog.markanthonylawson.com/?p=296 (sorry, *completely* off-topic but couldn't resist)
@onestop thank you. I only requested the link because I am lazy and did not want to look up ANOVA on wikipedia myself, not because it is essential for the question.
Related question here: what-if-residuals-are-normally-distributed-but-y-is-not.
Let's assume this is a fixed effects model. (The advice doesn't really change for random-effects models, it just gets a little more complicated.)
No, normality and normal distribution of residual are not the same. Suppose you measured yield from a crop with and without a fertilizer application. In plots without fertilizer the yield ranged from 70 to 130. In two plots with fertilizer the yield ranged from 470 to 530. The distribution of results is strongly non-normal: it's clustered at two locations related to the fertilizer application. Suppose further the average yields are 100 and 500, respectively. Then all residuals range from -30 to +30. They might (or might not) be normally distributed, but obviously this is a completely different distribution.
The distribution of the residuals matters, because they reflect the random part of the model. Note also that the p-values are computed from F (or t) statistics and those depend on residuals, not on the original values.
If there are significant and important effects in the data (as in this example), then you might be making a "grave" mistake. You could, by luck, make the correct determination: that is, by looking at the raw data you will seing a mixture of distributions and this can look normal (or not). The point is that what you're looking it is not relevant.
ANOVA residuals don't have to be anywhere close to normal in order to fit the model. However, near-normality of the residuals is essential for p-values computed from the F-distribution to be meaningful.
I think there is important points to add: in an ANOVA, the normality within each group (not overall) is equivalent to the normality of the residuals.
@Aniko Could you please elaborate on what you mean by "equivalent" in your comment? It is almost tautological that normality within a group is the same as normality of that group's residuals, but it is false that normality separately within each group implies (or is implied by) normality of the residuals.
I really meant the tautological sense: if the groups are normal then the residuals are normal. The reverse is only true if homoscedascity is added (as in ANOVA). I don't mean to advocate for checking the groups instead of the residuals, but I think this is the underlying reason for the varying phrasing of the assumptions.
I've noticed that people doing an ANOVA usually seem interested in computing p-values, and hence the normality of residuals is important for them. Are there any common reasons to fit an ANOVA model if we're not interested in computing p-values from the F-distribution? Apologies if this question is too broad for a comment.
@user1205901 That is a very good point. Two common uses of ANOVA that do not rely on the F test are (1) it's a convenient way to obtain effect estimates and (2) it's part and parcel of a components of variance calculation.
@whuber Why is normality of residuals different than the normality within the group? The group is centered over the mean estimate, the residuals over 0, both with the same variance. Can one be normal and the other not?
@Cindy Consider the (very common) situation where the two groups have different means.
I don't get it. One is normal with N(mu1,s), the other with N(mu2,s), the residuals with N(0,s). Doesn't that say that the either condition of normality is needed?
@Cindy I don't know what you mean by the "either condition." What is clear in such a case is that the *conditional* responses are Normal but the *marginal* response (which is a mixture of Normals with different means) is not Normal.
@whuber You said: "...but it is false that normality separately within each group implies (or is implied by) normality of the residuals." Within each group the data is not a mixture of normal distributions, but a single normal distribution, right? Unless I misunderstand some of the terminology.
@Cindy I think you misunderstand what I wrote. The residuals indeed can be Normal without the residuals in either group being Normal. To see how, you can simulate this situation in the computer: begin by generating the residuals from a mean-zero Normal distribution. Randomly assign them to groups "A" and "B," but make the probability of assignment to "A" depend on how extreme the residuals are. Both groups of residuals will have zero expectations but they originate with non-Normal distributions.
Here is `R` code for such a simulation: `n <- 10000; res <- rnorm(2*n); res <- res[order(abs(res))]; p <- dnorm(res); p <- p / sum(p); i <- sample.int(2*n, n, prob=p); DF <- data.frame(Group=rep(c("A", "B"), each=n), Residual=c(res[i], res[-i])); table(DF$Group); library(ggplot2); ggplot(DF, aes(Residual, fill=Group)) + geom_density(size=1.25, alpha=1/2)`
@whuber, is there any way to cite your explanation? Perhaps, you wrote or published it somewhere? I really wished we could cite responses here. Some, like many of yours, are of tremendous value.
@streamline Any good account of ANOVA will include equivalent information. But there is an easy way to cite posts here on CV: click on the "cite" link beneath the post and copy the text that pops up.
@whuber, Thank you, Dr. Huber, for taking the time to answer even the silliest of questions! I know this kind of comments is discouraged, but I think we forget too many times to say thank you.
@Streamline You are welcome. I accept your compliments on behalf of the several hundred regular users of this site who routinely answer questions. I'm pretty sure they view few of the questions as "silly." I, for one, read as many questions as possible and find them some of them quite illuminating because they reveal unexpected and often interesting ways in which people interpret explanations, descriptions, and definitions. Reflecting on such questions helps me understand the concept better and, I hope, will enable me to explain or teach it better.