Is there a minimum sample size required for the t-test to be valid?
I'm currently working on a quasi-experimental research paper. I only have a sample size of 15 due to low population within the chosen area and that only 15 fit my criteria. Is 15 the minimum sample size to compute for t-test and F-test? If so, where can I get an article or book to support this small sample size?
This paper was already defended last Monday and one of the panel asked to have a supporting reference because my sample size is too low. He said it should've been at least 40 respondents.
A sample size can be substantially smaller than 15 if the assumptions hold. Was the validity of the t-distribution the only reason he suggested a larger sample?
Just to clarify, what kind of t-test are you performing: one sample, paired sample or two sample.
Historically, the very first demonstration of the t-test (in "Student"'s 1908 paper) was in an application to sample sizes of size *four*. Indeed, obtaining improved results for *small* samples is the test's claim to fame: once the sample size reaches 40 or so, the t-test is not substantially different from the z-tests researchers had been applying throughout the 19th century. You may share a modern version of this paper with the panel member: http://www.york.ac.uk/depts/maths/histstat/student.pdf. Point out the investigation in Section VI, pp 14-18.
@ Glen_b yes he, said its not valid if i have only 15 respondents and i have to repeat the procedure for getting my data with 40 subjects if i do not show him an article to support my small sample size. Ahh! Im going nuts. I saw the articles tagged in here (and i'm thankful) but our supporting data should be within the past 10 years only..
But you should ponder the fact that small sample sizes such as 4 works because Student had high-quality data: chemical lab data, experiments, not quasi-experiments. Your main problem is not with sample size but with representativity: How do you know that your data are representative of anything?
I am not clear about some of your statments: have you applied a two-sample ttest to two samples of 15 observations each? What about the variances (assumed equal or not)? Or: have you compared the mean of one sample of 15 observation against a given reference value (one-sample ttest)?
For a paired t-test the smallest possible sample size would be 2 pairs (4 observations total) -- as long as the assumptions were reasonable, and all the other usual caveats held up. For an independent samples test assuming equal variances, 3 observations total is just possible (2 in one sample, 1 in the other -- though that might not always implemented in software packages; some will want two variance estimates). For a Welch type test, 2 obs. per sample is the minimum possible. In each case, though, power would be the big issue -- you'd only have a good chance to detect very large effects.
There is no minimum sample size for the t test to be valid other than it be large enough to calculate the test statistic. Validity requires that the assumptions for the test statistic hold approximately. Those assumptions are in the one sample case that the data are iid normal (or approximately normal) with mean 0 under the null hypothesis and a variance that is unknown but estimated from the sample. In the two sample case it is that both samples are independent of each other and each sample consists of iid normal variables with the two samples having the same mean and a common unknown variance under the null hypothesis. A pooled estimate of variance is used for the statistic.
In the one sample case the distribution under the null hypothesis is a central t with n-1 degrees of freedom. In the two sample cases with sample sizes n and m not necessarily equal the null distribution of the test statistics is t with n+m-2 degrees of freedom. The increased variability due to low sample size is accounted for in the distribution which has heavier tails when the degrees of freedom is low which corresponds to a low sample size. So critical values can be found for the test statistic to have a given significance level for any sample size (well, at least of size 2 or larger).
The problem with low sample size is with regard to the power of the test. The reviewer may have felt that 15 per group was not a large enough sample size to have high power of detecting a meaningful difference say delta between the two means or a mean greater than delta in absolute value for a one sample problem. Needing 40 would require a specification of a certain power at a particular delta that would be achieved with n equal 40 but not lower than 40.
I should add that for the t test to be performed the sample must be large enough to estimate the variance or variances.
But an important note is that the test *is* valid, even if the data is not approximately normal, if the sample size is large enough. The justification is a bit round about (Slutsky's theorem + t distribution approaching normal) and the justification for use over a z-test merely that it is more conservative in smaller samples. But it is an important note that one if we suspect non-normality, large samples can save us!
@CliffAB By "valid" I assume you mean "has approximately the right significance level, in the limit as n\to\infty". But generally people care about more than the type I error rate (especially when it might only be reasonably close at samples which may be larger than whatever sample size is to hand). Asymptotic relative efficiency may be very poor indeed, so power against small effects in large samples may be very bad compared to alternative choices, even as the type I error rate becomes what it should be..
With all deference to him, he doesn't know what he's talking about. The t-test was designed for working with small samples. There isn't really a minimum (maybe you could say a minimum of 3 for a one-sample t-test, IDK), but you do have a concern regarding adequate power with small samples. You may be interested in reading about the ideas behind compromise power analysis when the possible sample size is highly restricted, as in your case.
As for a reference that proves you can use the t-test with small samples, I don't know of one, and I doubt that one exists. Why would anyone try to prove that? The idea is just silly.
I have read an article here in the internet about this: "A t-test is necessary for small samples because their distributions are not normal. If the sample is large (n>=30) then statistical theory says that the sample mean is normally distributed and a z test for a single mean can be used. This is a result of a famous statistical theorem, the Central limit theorem." I think this may help me but then the site is a wiki and i doubt he would approve of it. Thanks for helping out.
+1 (to you and Michael). Of interest, you don't even need two observations to make inferences if willing to make a set of assumptions!
The reason for the t test in small sample is that even when the samples are normal if the standard deviation is unknown the common thing to do is normalize by dividing by a sample estimate of the standard deviation. In large samples that estimate will be close enough to the population standard deviation that the test statistic will be approximately standard normal but in small sample it will have heavier tails then the normal.
The t distribution with n-1 degrees of freedom is the exact distribution for any sample size n under the null hypothesis and in small samples it need to be used in place of the normal which does not approximate it well. The real issue with sample size as both gung and I stated is power. If you want to argue with the referee that 15 is enough you need to identify how large a difference is needed to be called meaningful (the delta I mentioned) and then for that delta you need to show that the power is adequate say 0.80 or higher.
@AndyW yes I have seen a paper called estimating a mean with a sample of size 1. If you assume a known variance σ$2$ and that the samples are normal you can take X/σ and base the inference on the N(0,1) distribution. But that would not involve a t distribution.
@CzarinaFrancoise About n>=30, see http://stats.stackexchange.com/questions/2541/is-there-a-reference-that-suggest-using-30-as-a-large-enough-sample-size?rq=1
@gung Student's original (1908!) paper proves you can use the t-test with small samples. (For more about this, please refer to my extended comment to the original question.)
@whuber, not that I've read it, but I thought Student's paper was that you *must* use the t-test (as opposed to the z-test) with small samples, rather than proving that the t-test is still valid w/ small samples.
@gung: Look at the link I provided. It's a very readable paper and section VI is non-mathematical (and very modern in spirit: Student conducts a Monte-Carlo simulation to evaluate his t-test!).
My only objection to this answer is that it doesn't make explicit that in small samples, normality is a requirement, while in large samples it is not. If the response was known to be skewed (from previous studies or even just from examination of the collected data), I would say that the objection to the t-test is well founded, although a request for citation that the t-test can be used in small samples is basically saying "I wasn't paying attention in stats 101".
As mentioned in existing answers, the main issue with a small sample size is low statistical power. There are various rules of thumb regarding what is acceptable statistical power. Some people say 80% statistical power is reasonable, but ultimately, more is better. There is also generally a trade-off between the cost of getting more participants and the benefit of getting more statistical power.
You can assess statistical power of a t test using a simple function in R,
The following code provides the statistical power for a sample size of 15, a one-sample t-test, standard $\alpha=.05$, and three different effect sizes of .2, .5, .8 which have sometimes been referred to as small, medium, and large effects respectively.
p.2 <-power.t.test(n=15, delta=.2, sd=1, sig.level=.05, type='one.sample') p.5 <- power.t.test(n=15, delta=.5, sd=1, sig.level=.05, type='one.sample') p.8 <-power.t.test(n=15, delta=.8, sd=1, sig.level=.05, type='one.sample') round(rbind(p.2=p.2$power, p.5=p.5$power, p.8=p.8$power), 2) [,1] p.2 0.11 p.5 0.44 p.8 0.82
Thus, we can see that if the population effect size was "small" or "medium", you would have low statistical power (i.e., 11% and 44% respectively). However, if the effect size is large in the population, you would have what some would describe as "reasonable" power (i.e., 82%).
The Quick-r website provides further information on power analysis using R.
The two-sample t-test is valid if the two samples are independent simple random samples from Normal distributions with the same variance and each of the sample sizes is at least two (so that the population variance can be estimated.) Considerations of power are irrelevant to the question of the validity of the test. Depending upon the size of the effect that one wishes to detect, a small sample size may be imprudent, but a small sample size does not invalidate the test. Note also that for any sample size, the sampling distribution of the mean is Normal if the parent distribution is Normal. Of course,larger sample sizes are always better because they provide more precise estimates of parameters. The Central Limit Theorem tells us that sample means are more Normally distributed than individual values, but as pointed out by Casella and Berger, it is of limited usefulness since the rate of approach to Normality must be checked for any particular case. Relying on rules of thumb is unwise. See the results reported Rand Wilcox's books.
Consider the following from pp. 254-256 of Sauro, J., & Lewis, J. R. (2016). Quantifying the User Experience: Practical Statistics for User Research, 2nd Ed. Cambridge, MA: Morgan-Kaufmann (you can look inside at https://www.amazon.com/Quantifying-User-Experience-Second-Statistics/dp/0128023082/).
DO YOU NEED TO TEST AT LEAST 30 USERS?
ON ONE HAND
Probably most of us who have taken an introductory statistics class (or know someone who took such a class) have heard the rule of thumb that to estimate or compare means, your sample size should be at least 30. According to the central limit theorem, as the sample size increases, the distribution of the mean becomes more and more normal, regardless of the normality of the underlying distribution. Some simulation studies have shown that for a wide variety of distributions (but not all—see Bradley, 1978), the distribution of the mean becomes near normal when n = 30.
Another consideration is that it is slightly simpler to use z-scores rather than t-scores because z-scores do not require the use of degrees of freedom. As shown in Table 9.1 and Fig. 9.2, by the time you have about 30 degrees of freedom the value of t gets pretty close to the value of z. Consequently, there can be a feeling that you don’t have to deal with small samples that require small-sample statistics (Cohen, 1990). ...
ON THE OTHER HAND
When the cost of a sample is expensive, as it typically is in many types of user research (e.g., moderated usability testing), it is important to estimate the needed sample size as accurately as possible, with the understanding that it is an estimate. The likelihood that 30 is exactly the right sample for a given set of circumstances is very low. As shown in our chapters on sample size estimation, a more appropriate approach is to take the formulas for computing the significance levels of a statistical test and, using algebra to solve for n, convert them to sample size estimation formulas. Those formulas then provide specific guidance on what you have to know or estimate for a given situation to estimate the required sample size.
The idea that even with the t-distribution (as opposed to the z-distribution) you need to have a sample size of at least 30 is inconsistent with the history of the development of the distribution. In 1899, William S. Gossett, a recent graduate of New College in Oxford with degrees in chemistry and mathematics, became one of the first scientists to join the Guinness brewery. “Compared with the giants of his day, he published very little, but his contribution is of critical importance. … The nature of the process of brewing, with its variability in temperature and ingredients, means that it is not possible to take large samples over a long run” (Cowles, 1989, p. 108–109).
This meant that Gossett could not use z-scores in his work—they just don’t work well with small samples. After analyzing the deficiencies of the z-distribution for statistical tests with small samples, he worked out the necessary adjustments as a function of degrees of freedom to produce his t tables, published under the pseudonym “Student” due to the policies of Guinness prohibiting publication by employees (Salsburg, 2001). In the work that led to the publication of the tables, Gossett performed an early version of Monte Carlo simulations (Stigler, 1999). He prepared 3000 cards labeled with physical measurements taken on criminals, shuffled them, then dealt them out into 750 groups of size 4—a sample size much smaller than 30.
This controversy is similar to the “five is enough” versus “eight is not enough” argument covered in Chapter 6, but applied to summative rather than formative research. For any research, the number of users to test depends on the purpose of the test and the type of data you plan to collect. The “magic number” 30 has some empirical rationale, but in our opinion, it’s very weak. As you can see from the numerous examples in this book that have sample sizes not equal to 30 (sometimes less, sometimes more), we do not hold this rule of thumb in very high regard. As described in our sample size chapter for summative research, the appropriate sample size for a study depends on the type of distribution, the expected variability of the data, the desired levels of confidence and power, and the minimum size of the effect that you need to be able to reliably detect.
As illustrated in Fig. 9.2, when using the t-distribution with very small samples (e.g., with degrees of freedom less than 5), the very large values of t compensate for small sample sizes with regard to the control of Type I errors (claiming a difference is significant when it really is not). With sample sizes these small, your confidence intervals will be much wider than what you would get with larger samples. But once you’re dealing with more than 5 degrees of freedom, there is very little absolute difference between the value of z and the value of t. From the perspective of the approach of t to z, there is very little gain past 10 degrees of freedom.
It isn’t much more complicated to use the t-distribution than the z-distribution (you just need to be sure to use the right value for the degrees of freedom), and the reason for the development of the t-distribution was to enable the analysis of small samples. This is just one of the less obvious ways in which usability practitioners benefit from the science and practice of beer brewing. Historians of statistics widely regard Gossett’s publication of Student’s t-test as a landmark event (Box, 1984; Cowles, 1989; Stigler, 1999). In a letter to Ronald A. Fisher (one of the fathers of modern statistics) containing an early copy of the t tables, Gossett wrote, “You are probably the only man who will ever use them” (Box, 1978). Gossett got a lot of things right, but he certainly got that wrong.
Box, G. E. P. (1984). The importance of practice in the development of statistics. Technometrics, 26(1), 1-8.
Box, J. F. (1978). Fisher, the life of a scientist. New York, NY: John Wiley.
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144-152.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.
Cowles, M. (1989). Statistics in psychology: An historical perspective. Hillsdale, NJ: Lawrence Erlbaum.
Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York, NY: W. H. Freeman.
Stigler, S. M. (1999). Statistics on the table: The history of statistical concepts and methods. Cambridge, MA: Harvard University Press.
While it is true that the t-distribution takes into account the small sample size, I would assume that your referee was thinking about the difficulty of establishing that the population is normally distributed, when the only information you have is a relatively small sample? This may not be a huge issue with a sample of size 15, since the sample hopefully is large enough to show some signs of being vaguely normally distributed? If this is true, then hopefully the population is somewhere near normal too and, combined with Central Limit Theorem, that ought to give you sample means that are well enough behaved.
But I'm dubious about recommendations to use t-tests for tiny samples (such as size four) unless the normality of the population can be established by some external information or mechanical understanding? There cannot surely be anywhere near enough information in a sample of size four to have any clue as the shape of the population distribution.
Czarina may find interesting to compare the results of her parametric t-test with the results obtained by a bootstrap t-test. The following code for Stata 13/1 mimics a fictitious example concerning a two-sample t-test with unequal variances (parametric t-test: p-value = 0.1493; bootstrap t-test: p-value = 0.1543).
set obs 15 g A=2*runiform() g B=2.5*runiform() ttest A == B, unpaired unequal scalar t =r(t) sum A, meanonly replace A=A-r(mean) + 1.110498 ///1.110498=combined mean of A and B sum B, meanonly replace B=B-r(mean) + 1.110498 bootstrap r(t), reps(10000) nodots/// saving(C:\Users\user\Desktop\Czarina.dta, every(1) double replace) : /// ttest A == B, unpairedunequal use "C:\Users\user\Desktop\Czarina.dta", clear count if _bs_1<=-1.4857///-1.4857=t-value from parametric ttest count if _bs_1>=1.4857 display (811+732)/10000///this chunk of code calculates a bootstrap p-value/// to be compared with the parametric ttest p-value
There are two different ways to justify the use of the t-test.
- Your data is normally distributed and you have at least two samples per group
- You have large sample sizes in each group
If either of these cases hold, then the t-test is considered a valid test. So if you are willing to make the assumption that your data is normally distributed (which many researchers who collect small samples are), then you have nothing to worry about.
However, someone might reasonably object that you are relying on this assumption to get your results, especially if your data is known to be skewed. Then the question of sample size required for valid inference is a very reasonable one.
As for how large a sample size is required, unfortunately there's no real solid answer for that; the more skewed your data, the bigger the sample size required to make the approximation reasonable. 15-20 per group is usually considered reasonable large, but as with most rules of thumb, there exist counter examples: for example, in lottery ticket returns (where 1 in, say, 10,000,000 observations is an EXTREME outlier), you would literally need somewhere around 100,000,000 observations before these tests would be appropriate.
As far as assumptions go for the two sample case; it is that both samples are independent of each other and each sample consists of iid normal variables with the two samples having the same mean and a common unknown variance under the null hypothesis.
There is also the Welch t-test utilizing the Satterwaite Approximation for the standard error. This is a 2 sample t-test assuming unequal variances.
I concur regarding the usefulness of a boostrapped t-test. I would also recommend, as a comparison, a look at the Bayesian method offered by Kruschke at http://www.indiana.edu/~kruschke/BEST/BEST.pdf. In general, questions of "How many subjects?" can't be answered unless you have in hand an idea of what a significant effect size would be in terms of the problem being solved. That is, and for instance, if the test were a hypothetical study regarding the efficacy of a new drug, the effect size might be the minimum size needed to justify the new drug compared to old for the U.S. Food and Drug Administration.
What's odd in this and many other discussions is the wholesale willingness to posit that some data just have some theoretical distribution, like being Gaussian. First, we don't need to posit, we can check, even with small samples. Second, why posit any specific theoretical distribution at all? Why not just take the data as an empirical distribution unto itself?
Sure, in the case of small sample sizes, positing that the data come from some distribution is highly useful for analysis. But, to paraphrase Bradley Efron, in doing so you've just made up an infinite amount of data. Sometimes that can be okay if your problem is appropriate. Some times it isn't.