What is the difference between N and N-1 in calculating population variance?
I did not get the why there are
N-1while calculating population variance. When we use
Nand when we use
It says that when population is very big there is no difference between N and N-1 but it does not tell why is there N-1 at the beginning.
Edit: Please don't confuse with
n-1which are used in estimating.
Edit2: I'm not talking about population estimation.
You can find an answer there: http://stats.stackexchange.com/questions/16008/what-does-unbiasedness-mean/16009#16009. Basically, you should use N-1 when you *estimate* a variance, and N when you *compute* it exactly.
If you want your estimator to be unbiased, then you should use n-1. Note that when n is large, this is not a matter.
None of the answers below are written in terms of finite population inference. The word *finite* is absolutely crucial here; that's what Kish's book is about (and whoever was saying "The book is wrong" simply don't know enough about finite population surveys and samples). The quotient $N-1$ instead of $N$ just makes computations nicer and obviates the need to haul around factors like $1-1/N$. The full answer to this question would have to introduce the sampling inference where the sample indicators are random, and the values of observed characteristics $y$ are FIXED. Non-random. Set in stone.
You could have a better feeling about this question when playing with octave or matlab... Example: x = rand(10,1); var1 = sum((x - mean(x)).^2) / (length(x)); var2 = sum((x - mean(x)).^2) / (length(x)-1); you will verify a significant difference between `var1` and `var2`, since your sample size is very small. Repeat it by considering a larger population size. x = rand(1e6,1); var1 = sum((x - mean(x)).^2) / (length(x)); var2 = sum((x - mean(x)).^2) / (length(x)-1); you will verify that `var1` $\approx$ `var2`
This doesn't really add to the other answers. That different divisors give different answers, or even that the difference diminishes with N, is not at issue. The question is when and why to use either divisor.
Watch this video, it precisely answers your question. https://www.youtube.com/watch?v=xslIhnquFoE
@SahilChaudhary, your video talks about n and n-1. My quesyion has nothing to do with n and n-1. My question is about N and N-1. You can see that n and N are diffetent right? I have commented on my questtion!
The book is "Survey Sampling" by "Leslie Kish" from "John Willey & Sons" https://archive.org/details/in.ernet.dli.2015.214343
I would just like to mention to the other readers that this issue is called "Bassel's correction". You can check it out on Wikipedia https://en.wikipedia.org/wiki/Bessel%27s_correction#Proof_of_correctness_%E2%80%93_Alternate_1
$N$ is the population size and $n$ is the sample size. The question asks why the population variance is the mean squared deviation from the mean rather than $(N-1)/N = 1-(1/N)$ times it. For that matter, why stop there? Why not multiply the mean squared deviation by $1-2/N$, or $1-17/N$, or $\exp(-1/N)$, for instance?
There actually is a good reason not to. Any of these figures I just mentioned would serve just fine as a way to quantify a "typical spread" within the population. However, without prior knowledge of the population size, it would be impossible to use a random sample to find an unbiased estimator of such a figure. We know that the sample variance, which multiplies the mean squared deviation from the sample mean by $(n-1)/n$, is an unbiased estimator of the usual population variance when sampling with replacement. (There is no problem with making this correction, because we know $n$!) The sample variance would therefore be a biased estimator of any multiple of the population variance where that multiple, such as $1-1/N$, is not exactly known beforehand.
This problem of some unknown amount of bias would propagate to all statistical tests that use the sample variance, including t-tests and F-tests. In effect, dividing by anything other than $N$ in the population variance formula would require us to change all statistical tabulations of t-statistics and F-statistics (and many other tables as well), but the adjustment would depend on the population size. Nobody wants to have to make tables for every possible $N$! Especially when it's not necessary.
As a practical matter, when $N$ is small enough that using $N-1$ instead of $N$ in formulas makes a difference, you usually do know the population size (or can guess it accurately) and you would likely resort to much more substantial small-population corrections when working with random samples (without replacement) from the population. In all other cases, who cares? The difference doesn't matter. For these reasons, guided by pedagogical considerations (namely, of focusing on details that matter and glossing over details that don't), some excellent introductory statistics texts don't even bother to teach the difference: they simply provide a single variance formula (divide by $N$ or $n$ as the case may be).
Instead of going into maths I'll try to put it in plain words. If you have the whole population at your disposal then its variance (population variance) is computed with the denominator
N. Likewise, if you have only sample and want to compute this sample's variance, you use denominator
N(n of the sample, in this case). In both cases, note, you don't estimate anything: the mean that you measured is the true mean and the variance you computed from that mean is the true variance.
Now, you have only sample and want to infer about the unknown mean and variance in the population. In other words, you want estimates. You take your sample mean for the estimate of population mean (because your sample is representative), OK. To obtain estimate of population variance, you have to pretend that that mean is really population mean and therefore it is not dependent on your sample anymore since when you computed it. To "show" that you now take it as fixed you reserve one (any) observation from your sample to "support" the mean's value: whatever your sample might have happened, one reserved observation could always bring the mean to the value that you've got and which believe is insensitive to sampling contingencies. One reserved observation is "-1" and so you have
N-1in computing variance estimate.
Imagine that you somehow know the true population mean, but want to estimate variance from the sample. Then you will substitute that true mean into the formula for variance and apply denominator
N: no "-1" is needed here since you know the true mean, you didn't estimate it from this same sample.
But my question has nothing to do with estimation. It is about computing population variance; with N and N-1. I'm not talking about n and n-1.
@ilhan, in my reply, I used `N` for both N and n. `N` is a size of a totality at hand, either population or sample. To compute _population_ variance, you _must_ have population at your disposal. If you have only sample you can either compute this sample's variance or compute population _estimate_ variance. No other way round.
I have a complete information about my population; all the values are know. I'm not interested in estimation.
This is not convincing at all. Population is population, sample is sample, all said. You cannot substitute one for the other, even for simplicity.
@ilhan - Couldn't comment directly on your comment to ttnphns post, but here is an explanation of what you see in the book and how you should infer it. The symbol 'S' when used to imply variance always refers to sample variance. The Greek letter sigma is used to refer to the population variance. That is the reason why you see the book mention S = N * sigma / (N - 1)
Generally, when one has only a fraction of the population, i.e. a sample, you should divide by n-1. There is a good reason to do so, we know that the sample variance, which multiplies the mean squared deviation from the sample mean by (n−1)/n, is an unbiased estimator of the population variance.
You can find a proof that the estimator of the sample variance is unbiased here: https://economictheoryblog.com/2012/06/28/latexlatexs2/
Further, if one were to apply the estimator of the population variance, that is the version of the variance estimator that divides by n, on a sample of instead of the population, the obtained estimate would biased.
There has, in the past been an argument that you should use N for a non-inferential variance but I wouldn't recommended that anymore. You should always use N-1. As sample size decreases N-1 is a pretty good correction for the fact that the sample variance gets lower (you're just more likely to sample near the peak of the distribution---see figure). If sample size is really big then it doesn't matter any meaningful amount.
An alternative explanation is that the population is a theoretical construct that's impossible to achieve. Therefore, always use N-1 because whatever you're doing you're, at best, estimating the population variance.
Also, you're going to be seeing N-1 for variance estimates from here on in. You'll likely not ever encounter this issue... except on a test when your teacher might ask you to make a distinction between an inferential and non-inferential variance measure. In that case don't use whuber's answer or mine, refer to ttnphns's answer.
Note, in this figure the variance should be close to 1. Look how much it varies with sample size when you use N to estimate the variance. (this is the "bias" referred to elswhere)
Please, tell me why N "not recommended anymore" with true population at hand? Population is not always a theoretical construct. Sometimes your sample is a bona fide population for you.
@John, can you please remove everything related with "estimation", "estimating", and "sample"? The question is about population itself. No estimation, no sampling, no samples. And please use "n" when referring to sample size. "N" is used for population size. Correct me if I'm wrong.
ilhan, N can be used for your sample, or it can be used for the population size, if one exists. In most cases the distinction between big N and small n is dependent upon the topic. For example, n might be the number of cases in each condition in an experiment while N might be the number for the experiment. They're both samples. There is no global rule.
ttnphns, it depends on what you mean by population. I would argue that if your whole population is so small that N-1 matters then it's questionable whether calculating a mean squared deviation is remotely useful at all. Show all the values, their shape and range. Furthermore, the whole old argument that you actually have N degrees of freedom if you're not making an inference is questionable. You lost one when you calculated the mean, that you needed to calculate the variance.
@John, if you calculate mean within population you just _state_ the fact about the parameter, so you spend no degrees of freedom. If you calculate it in sample and want to _infer_ about the population, then you do spend one. Also, I can have population with N=1. With denominator N-1, it appeares that such parameter as variance does not _exist_ for it. It is nonsense.
I'm not saying there isn't an argument. I'm just saying it's questionable. Even the student's textbook basically says, forget I said anything about using N. It's probably the best advice at this point. And you're right, if the population N=1 there is NO variance parameter that exists for it because it has no variability. Your example proves against your point, not for it.
_Zero_ variability and _inapplicability_ of concept of variability are different things, John. In fact, with N=1 population does have variability, it is zero.
@ilhan Please, consider updating your question (as you did) and point to the updated version rather than leaving such non-constructive comments. Everything is debatable, especially when the question itself lacks some context. Here it seems that the problem stands from defining what a population really is.
The population variance is the sum of the squared deviations of all of the values in the population divided by the number of values in the population. When we are estimating the variance of a population from a sample, though, we encounter the problem that the deviations of the sample values from the mean of the sample are, on average, a little less than the deviations of those sample values from the (unknown) true population mean. That results in a variance calculated from the sample being a little less than the true population variance. Using an n-1 divisor instead of n corrects for that underestimation.
@ Bunnenburg, If you got answer to your question. Please clear to me now, what you got? It is a big confusion to me as well.
to compensate for that _little less_ variance we get, why can't one use n-2, n-3, etc.? why n-1 in particular? why not a constant...???