T-test for non normal when N>50?
Long ago I learnt that normal distribution was necessary to use a two sample T-test. Today a colleague told me that she learnt that for N>50 normal distribution was not necessary. Is that true?
If true is that because of the central limit theorem?
Related question with a *very* good answer by Glen_b http://stats.stackexchange.com/questions/121852/how-to-choose-between-t-test-or-non-parametric-test-e-g-wilcoxon-in-small-sampl
Normality assumption of a t-test
Consider a large population from which you could take many different samples of a particular size. (In a particular study, you generally collect just one of these samples.)
The t-test assumes that the means of the different samples are normally distributed; it does not assume that the population is normally distributed.
By the central limit theorem, means of samples from a population with finite variance approach a normal distribution regardless of the distribution of the population. Rules of thumb say that the sample means are basically normally distributed as long as the sample size is at least 20 or 30. For a t-test to be valid on a sample of smaller size, the population distribution would have to be approximately normal.
The t-test is invalid for small samples from non-normal distributions, but it is valid for large samples from non-normal distributions.
Small samples from non-normal distributions
As Michael notes below, sample size needed for the distribution of means to approximate normality depends on the degree of non-normality of the population. For approximately normal distributions, you won't need as large sample as a very non-normal distribution.
Here are some simulations you can run in R to get a feel for this. First, here are a couple of population distributions.
curve(dnorm,xlim=c(-4,4)) #Normal curve(dchisq(x,df=1),xlim=c(0,30)) #Chi-square with 1 degree of freedom
Next are some simulations of samples from the population distributions. In each of these lines, "10" is the sample size, "100" is the number of samples and the function after that specifies the population distribution. They produce histograms of the sample means.
hist(colMeans(sapply(rep(10,100),rnorm)),xlab='Sample mean',main='') hist(colMeans(sapply(rep(10,100),rchisq,df=1)),xlab='Sample mean',main='')
For a t-test to be valid, these histograms should be normal.
require(car) qqp(colMeans(sapply(rep(10,100),rnorm)),xlab='Sample mean',main='') qqp(colMeans(sapply(rep(10,100),rchisq,df=1)),xlab='Sample mean',main='')
Utility of a t-test
I have to note that all of the knowledge I just imparted is somewhat obsolete; now that we have computers, we can do better than t-tests. As Frank notes, you probably want to use Wilcoxon tests anywhere you were taught to run a t-test.
Good explanation (+1). I would add, however, that the sample size needed for the distribution of means to approximate normality depnds on the degree of non-normalness of the population. For large samples there is no reason to prefer a t-test over a permutations test that makes no assumptions about the distributions.
+1 although, as far as I know, t-test is fairly resistent to moderate deviations from normality. Also, an interesting related discussion: http://stats.stackexchange.com/questions/2492/normality-testing-essentially-useless
(+1 to the comments) Good points! My answer was very black-and-white because I don't really know the theory about the greyness in between. I added some simulations that may help address this though.
good answer, although there is one small detail that you missed: the distribution of the data must have finite variance. T-test is hopeless for comparing difference in location of two Cauchy distributions (or student with 2 degrees of freedom), not because it is "non-robust", but because for these distributions there is additional relevant information in the sample beyond the means and standard deviations which the t-test throws away.
But is there really data in reality that has has an underlying infinite variance? I would suspect that this happens only by taking ratios where the value can be arbitarily close to zero. Otherwise I disagree that there is no reason to use the t-test any more; it is so closely based on the effect size that it already remains useful for this purpose. Much more critical the Wilcoxon test tests medians and not means - in some cases this does not matter, in other it might be exactly what you want and in still other it might not be what you want at all.
In addition to this, t-test also naturally yield confidence intervals for the parameter being investigated. (still upvote because of the two first paragraphs which adress the question directly, I just disagree strongly with the third)
"But is there really data in reality that has has an underlying infinite variance?" Certainly there is! Stock returns?
@Erik: data "in reality" with an infinite variance occurs often! besides, you dont need the variance to be infinite, it is enough that it is "effectively infinite" see my answer to http://stats.stackexchange.com/questions/94402/what-is-the-difference-between-finite-and-infinite-variance/100161#100161
t-test DOES require normality of the population. That's an assumption needed for the t statistic to have a t-Student distribution. If you don't have a normal population, you can't express the t statistic as a standard normal variable divided by the root of a Chi-squared variable divided by its degrees of freedom. Maybe what you are trying to say is that if some conditions are true, like not too much skewness, or a big sample, the test can still be valid even when the population is not normal.
I agree with @toneloy -- for the t-test to have a t-distribution does rely on normality. A t-statistic is not just a numerator, it's a ratio of two random variables and the behavior of the denominator (and the dependence between the two) is relevant in determining how the statistic is distributed. I discuss this in the first part of my answer here.