### Rules of thumb for minimum sample size for multiple regression

Within the context of a research proposal in the social sciences, I was asked the following question:

I have always gone by 100 + m (where m is the number of predictors) when determining minimum sample size for multiple regression. Is this appropriate?

I get similar questions a lot, often with different rules of thumb. I've also read such rules of thumb quite a lot in various textbooks. I sometimes wonder whether popularity of a rule in terms of citations is based on how low the standard is set. However, I'm also aware of the value of good heuristics in simplifying decision making.

### Questions:

- What is the utility of simple rules of thumb for minimum sample sizes within the context of applied researchers designing research studies?
- Would you suggest an alternative rule of thumb for minimum sample size for multiple regression?
- Alternatively, what alternative strategies would you suggest for determining minimum sample size for multiple regression? In particular, it would be good if value is assigned to the degree to which any strategy can readily be applied by a non-statistician.

I'm not a fan of simple formulas for generating minimum sample sizes. At the very least, any formula should consider effect size and the questions of interest. And the difference between either side of a cut-off is minimal.

### Sample size as optimisation problem

- Bigger samples are better.
- Sample size is often determined by pragmatic considerations.
- Sample size should be seen as one consideration in an optimisation problem where the cost in time, money, effort, and so on of obtaining additional participants is weighed against the benefits of having additional participants.

### A Rough Rule of Thumb

In terms of very rough rules of thumb within the typical context of observational psychological studies involving things like ability tests, attitude scales, personality measures, and so forth, I sometimes think of:

- n=100 as adequate
- n=200 as good
- n=400+ as great

These rules of thumb are grounded in the 95% confidence intervals associated with correlations at these respective levels and the degree of precision that I'd like to theoretically understand the relations of interest. However, it is only a heuristic.

### G Power 3

- I typically use G-Power 3 to calculate power based on various assumptions see my post.
- See this tutorial from the G Power 3 site specific to multiple regression
- The Power Primer is also a useful tool for applied researchers.

### Multiple Regression tests multiple hypotheses

- Any power analysis question requires consideration of effect sizes.
Power analysis for multiple regression is made more complicated by the fact that there are multiple effects including the overall r-squared and one for each individual coefficient. Furthermore, most studies include more than one multiple regression. For me, this is further reason to rely more on general heuristics, and thinking about the minimal effect size that you want to detect.

In relation to multiple regression, I'll often think more in terms of the degree of precision in estimating the underlying correlation matrix.

### Accuracy in Parameter Estimation

I also like Ken Kelley and colleagues' discussion of Accuracy in Parameter Estimation.

- See Ken Kelley's website for publications
- As mentioned by @Dmitrij, Kelley and Maxwell (2003) FREE PDF have a useful article.
- Ken Kelley developed the
`MBESS`

package in R to perform analyses relating sample size to precision in parameter estimation.

I don't prefer to think of this as a power issue, but rather ask the question "how large should $n$ be so that the apparent $R^2$ can be trusted"? One way to approach that is to consider the ratio or difference between $R^2$ and $R_{adj}^{2}$, the latter being the adjusted $R^2$ given by $1 - (1 - R^{2})\frac{n-1}{n-p-1}$ and forming a more unbiased estimate of "true" $R^2$.

Some R code can be used to solve for the factor of $p$ that $n-1$ should be such that $R_{adj}^{2}$ is only a factor $k$ smaller than $R^2$ or is only smaller by $k$.

`require(Hmisc) dop <- function(k, type) { z <- list() R2 <- seq(.01, .99, by=.01) for(a in k) z[[as.character(a)]] <- list(R2=R2, pfact=if(type=='relative') ((1/R2) - a) / (1 - a) else (1 - R2 + a) / a) labcurve(z, pl=TRUE, ylim=c(0,100), adj=0, offset=3, xlab=expression(R^2), ylab=expression(paste('Multiple of ',p))) } par(mfrow=c(1,2)) dop(c(.9, .95, .975), 'relative') dop(c(.075, .05, .04, .025, .02, .01), 'absolute')`

Legend: Degradation in $R^{2}$ that achieves a relative drop from $R^{2}$ to $R^{2}_{adj}$ by a the indicated relative factor (left panel, 3 factors) or absolute difference (right panel, 6 decrements).

If anyone has seen this already in print please let me know.

@FrankHarrell: look here the author seems to be using the plots 260-263 in much the same way as the ones in your post above.

Thanks for the reference. @gung that's a good question. One (weak) answer is that in some types of models we don't have an $R^{2}_{adj}$, and we also don't have an adjusted index if any variable selection has been done. But the main idea is that if $R^2$ is unbiased, other indexes of predictive discrimination such as rank correlation measures are likely to be unbiased also due to adequacy of the sample size and minimum overfitting.

(+1) for indeed a crucial, in my opinion, question.

In macro-econometrics you usually have much smaller sample sizes than in micro, financial or sociological experiments. A researcher feels quite well when on can provide at least feasible estimations. My personal least possible rule of thumb is $4\cdot m$ ($4$ degrees of freedom on one estimated parameter). In other applied fields of studies you usually are more lucky with data (if it is not too expensive, just collect more data points) and you may ask what is the optimal size of a sample (not just minimum value for such). The latter issue comes from the fact that more low quality (noisy) data is not better than smaller sample of high quality ones.

Most of the sample sizes are linked to the power of tests for the hypothesis you are going to test after you fit the multiple regression model.

There is a nice calculator that could be useful for multiple regression models and some formula behind the scenes. I think such a-priory calculator could be easily applied by non-statistician.

Probably K.Kelley and S.E.Maxwell article may be useful to answer the other questions, but I need more time first to study the problem.

Your rule of thumb is not particularly good if $m$ is very large. Take $m=500$: your rule says its ok to fit $500$ variables with only $600$ observations. I hardly think so!

For multiple regression, you have some theory to suggest a minimum sample size. If you are going to be using ordinary least squares, then one of the assumptions you require is that the "true residuals" be independent. Now when you fit a least squares model to $m$ variables, you are imposing $m+1$ linear constraints on your empirical residuals (given by the least squares or "normal" equations). This implies that the empirical residuals are not independent - once we know $n-m-1$ of them, the remaining $m+1$ can be deduced, where $n$ is the sample size. So we have a violation of this assumption. Now the order of the dependence is $O\left(\frac{m+1}{n}\right)$. Hence if you choose $n=k(m+1)$ for some number $k$, then the order is given by $O\left(\frac{1}{k}\right)$. So by choosing $k$, you are choosing how much dependence you are willing to tolerate. I choose $k$ in much the same way you do for applying the "central limit theorem" - $10-20$ is good, and we have the "stats counting" rule $30\equiv\infty$ (i.e. the statistician's counting system is $1,2,\dots,26,27,28,29,\infty$).

You say 10 to 20 is good, but would this also depend on the size of the error variance (possibly relative to other things)? For example, suppose there was just one predictor variable. If it was known that the error variance was really tiny, then it seems that 3 or 4 data points might be enough to reliably estimate the slope and intercept. On the other hand, if it was known that the error variance was huge, then even 50 data points might be inadequate. Am I misunderstanding something?

Could you please provide any reference for your suggested equation `n=k(m+1)`?

In Psychology:

Green (1991) indicates that $N > 50 + 8m$ (where m is the number of independent variables) is needed for testing multiple correlation and $N > 104 + m$ for testing individual predictors.

Other rules that can be used are...

Harris (1985) says that the number of participants should exceed the number of predictors by at least $50$.

Van Voorhis & Morgan (2007) (pdf) using 6 or more predictors the absolute minimum of participants should be $10$. Though it is better to go for $30$ participants per variable.

Your first 'rule' doesn't have m in it.

His first rule of thumb is written as `N = 50 + 8 m`, though it was questioned whether the term 50 is indeed needed

I have added a new and more complex rule of thumb that takes into account the effect size of the sample. This was also presented by Green (1991).

What are the full citations for the references Green (1991) and Harris (1985)?

I agree that power calculators are useful, especially to see the effect of different factors on the power. In that sense, calculators that include more input information are much better. For linear regression, I like the regression calculator here which includes factors such as error in Xs, correlation between Xs, and more.

I have found this rather recent paper (2015) assessing that just

**2**observations per variable are enough, as long as our interest is on the accuracy of estimated regression coefficients and standard errors (and on the empirical coverage of the resulting confidence intervals) and we use the**adjusted**$R^2$:(pdf)

Of course, as also acknowledged by the paper, (relative) unbiasedness does not necessarily imply having enough statistical power. However, power and sample size calculations are typically made by specifying the expected effects; in the case of multiple regression, this implies an hypothesis on the value of regression coefficients or on the correlation matrix between the regressors and the outcome must be made. In practice, it depends on the strength of the correlation of regressors with the outcome and between themselves (obviously, the stronger the better for the correlation with the outcome, while things get worse with multicollinearity). For example, in the extreme case of two perfectly collinear variables, you can't perform the regression regardless of the number of observations, and even with only 2 covariates.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

gung - Reinstate Monica 7 years ago

+1. I suspect I'm missing something rather fundamental & obvious, but why should we use the ability of $\hat R^2$ to estimate $R^2$ as the criterion? We already have access to $R^2_{adj}$, even if $N$ is low. Is there a way to explain why this is the right way to think about the minimally adequate $N$ outside of the fact that it makes $\hat R^2$ a better estimate of $R^2$?