What is the difference between a population and a sample?
What is the difference between a population and a sample? What common variables and statistics are used for each one, and how do those relate to each other?
Obligatory reading: Krieger, N. (2012). Who and what is a “population”? Historical debates, current controversies, and implications for understanding “population health” and rectifying health inequities. *The Milbank Quarterly*, 90(4):634–681.
The population is the set of entities under study. For example, the mean height of men. This is a hypothetical population because it includes all men that have lived, are alive and will live in the future. I like this example because it drives home the point that we, as analysts, choose the population that we wish to study. Typically it is impossible to survey/measure the entire population because not all members are observable (e.g. men who will exist in the future). If it is possible to enumerate the entire population it is often costly to do so and would take a great deal of time. In the example above we have a population "men" and a parameter of interest, their height.
Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions. Thus we could measure the mean height of men in a sample of the population which we call a statistic and use this to draw inferences about the parameter of interest in the population. It is an inference because there will be some uncertainty and inaccuracy involved in drawing conclusions about the population based upon a sample. This should be obvious - we have fewer members in our sample than our population therefore we have lost some information.
There are many ways to select a sample and the study of this is called sampling theory. A commonly used method is called Simple Random Sampling (SRS). In SRS each member of the population has an equal probability of being included in the sample, hence the term "random". There are many other sampling methods e.g. stratified sampling, cluster sampling, etc which all have their advantages and disadvantages.
It is important to remember that the sample we draw from the population is only one from a large number of potential samples. If ten researchers were all studying the same population, drawing their own samples then they may obtain different answers. Returning to our earlier example, each of the ten researchers may come up with a different mean height of men i.e. the statistic in question (mean height) varies of sample to sample -- it has a distribution called a sampling distribution. We can use this distribution to understand the uncertainty in our estimate of the population parameter.
The sampling distribution of the sample mean is known to be a normal distribution with a standard deviation equal to the sample standard deviation divided by the sample size. Because this could easily be confused with the standard deviation of the sample it more common to call the standard deviation of the sampling distribution the standard error.
Isn't it a little pointless use "all men ever" as a population? I mean, there's not even a consensus as to how old *homo sapiens* is, or whether *homo neanderthalensis* were a separate species, let alone whether males of the stone-tool using *homo habilis* count as "men". Presumably the same problems will face us in the future, too.