Detecting outliers using standard deviations
Following my question here, I am wondering if there are strong views for or against the use of standard deviation to detect outliers (e.g. any datapoint that is more than 2 standard deviation is an outlier).
I know this is dependent on the context of the study, for instance a data point, 48kg, will certainly be an outlier in a study of babies' weight but not in a study of adults' weight.
Outliers are the result of a number of factors such as data entry mistakes. In my case, these processes are robust.
I guess the question I am asking is: Is using standard deviation a sound method for detecting outliers?
You say, "In my case these processes are robust". Meaning what? That you're sure you don't have data entry mistakes?
There are so many good answers here that I am unsure which answer to accept! Any guidance on this would be helpful
In general, select the one that you feel answers your question most directly and clearly, and if it's too hard to tell, I'd go with the one with the highest votes. Even it's a bit painful to decide which one, it's important to reward someone who took the time to answer.
P.S. Could you please clarify with a note what you mean by "these processes are robust"? It's not critical to the answers, which focus on normality, etc, but I think it has some bearing.
Outliers are not model-free. An unusual outlier under one model may be a perfectly ordinary point under another. The first question should be "why are you trying to detect outliers?" (rather than do something else, like use methods robust to them), and the second would be "what makes an observation an outlier in your particular application?"
To add to Glen_b's comment which I agree with the issue becomes more complex with multivariate data. My article "The influence function and its application to data validation" discusses this along with real world applications. It appeared in the American Journal of Mathematical and Management Sciences in 1982.
Some outliers are clearly impossible. You mention 48 kg for baby weight. This is clearly an error. That's not a statistical issue, it's a substantive one. There are no 48 kg human babies. Any statistical method will identify such a point.
Personally, rather than rely on any test (even appropriate ones, as recommended by @Michael) I would graph the data. Showing that a certain data value (or values) are unlikely under some hypothesized distribution does not mean the value is wrong and therefore values shouldn't be automatically deleted just because they are extreme.
In addition, the rule you propose (2 SD from the mean) is an old one that was used in the days before computers made things easy. If N is 100,000, then you certainly expect quite a few values more than 2 SD from the mean, even if there is a perfect normal distribution.
But what if the distribution is wrong? Suppose, in the population, the variable in question is not normally distributed but has heavier tails than that?
I don't know. But one could look up the record. According to answers.com (from a quick google) it was 23.12 pounds, born to two parents with gigantism. If I was doing the research, I'd check further.
What if one cannot visually inspect the data (i.e. it might be part of an automatic process?)