How should outliers be dealt with in linear regression analysis?
Often times a statistical analyst is handed a set dataset and asked to fit a model using a technique such as linear regression. Very frequently the dataset is accompanied with a disclaimer similar to "Oh yeah, we messed up collecting some of these data points -- do what you can".
This situation leads to regression fits that are heavily impacted by the presence of outliers that may be erroneous data. Given the following:
It is dangerous from both a scientific and moral standpoint to throw out data for no reason other than it "makes the fit look bad".
In real life, the people who collected the data are frequently not available to answer questions such as "when generating this data set, which of the points did you mess up, exactly?"
What statistical tests or rules of thumb can be used as a basis for excluding outliers in linear regression analysis?
Are there any special considerations for multilinear regression?
Rather than exclude outliers, you can use a robust method of regression. In R, for example, the
rlm()function from the MASS package can be used instead of the
lm()function. The method of estimation can be tuned to be more or less robust to outliers.
If using rlm() function, I see the coefficients and their t-tests are produced. But how can I get the f-test, R-square values from here? I suppose I cannot simply bring these f-test and R square values from the simple 'lm' summary results if I am correct.
For a robust regression, the assumptions behind an F test are no longer satisfied, and R^2 can be defined in several ways which are no longer equivalent. See http://stats.idre.ucla.edu/stata/faq/how-can-i-get-an-r2-with-robust-regression-rreg/ for some discussion on this for Stata.
But I find the command called f.robftest from the sfsmisc package which gives out the f-test result. Can I use this result to define the f-test statistics for rlm? Also, I seem to get R square by simply inputting the values into the R square mathematical formula like 1 - sum(residuals(rlm(y~x))^2)/sum((y-mean(y))^2). For t-test values to check the significance of the coefficients, I get the t-test values from summary(rlm(y~x)) that I compare with the t-values from 95% confidence levels or so. Can I use these methods?
Sometimes outliers are bad data, and should be excluded, such as typos. Sometimes they are Wayne Gretzky or Michael Jordan, and should be kept.
Outlier detection methods include:
Univariate -> boxplot. outside of 1.5 times inter-quartile range is an outlier.
Bivariate -> scatterplot with confidence ellipse. outside of, say, 95% confidence ellipse is an outlier.
Multivariate -> Mahalanobis D2 distance
Mark those observations as outliers.
Run a logistic regression (on Y=IsOutlier) to see if there are any systematic patterns.
Remove ones that you can demonstrate they are not representative of any sub-population.
And if you still have outliers, consider using a different model than linear. For example, if you use a model with power-law like behaviour, Michael Jordan is no longer an outlier (in terms of the models ability to accommodate him).
Agree with most of what is said here, but I'd like to add the additional caution that "*outside of 1.5 times inter-quartile range is an outlier*" is a *convention*, not a rule with any theoretical foundation. It should not be used as a justification for excluding data points.
I do think there is something to be said for just excluding the outliers. A regression line is supposed to summarise the data. Because of leverage you can have a situation where 1% of your data points affects the slope by 50%.
It's only dangerous from a moral and scientific point of view if you don't tell anybody that you excluded the outliers. As long as you point them out you can say:
"This regression line fits pretty well for most of the data. 1% of the time a value will come along that doesn't fit this trend, but hey, it's a crazy world, no system is perfect"
Do consider other models though. The world if full of removed "outliers" that were real data, resulting in failing to predict something really important. Many natural processes have power-law like behaviour with rare extreme events. Linear models may seem to fit such data (albeit not too well), but using one and deleting the "outliers" means missing those extreme events, which are usually important to know about!
Taking your question literally, I would argue that there are no statistical tests or rules of thumb can be used as a basis for excluding outliers in linear regression analysis (as opposed to determining whether or not a given observation is an outlier). This must come from subject-area knowledge.
I think the best way to start is to ask whether the outliers even make sense, especially given the other variables you've collected. For example, is it really reasonable that you have a 600 pound woman in your study, which recruited from various sports injury clinics? Or, isn't it strange that a person is listing 55 years or professional experience when they're only 60 years old? And so forth. Hopefully, you then have a reasonable basis for either throwing them out or getting the data compilers to double-check the records for you.
I would also suggest robust regression methods and the transparent reporting of dropped observations, as suggested by Rob and Chris respectively.
Hope this helps, Brenden
I've published a method for identifying outliers in nonlinear regression, and it can be also used when fitting a linear model.
HJ Motulsky and RE Brown. Detecting outliers when fitting data with nonlinear regression – a new method based on robust nonlinear regression and the false discovery rate. BMC Bioinformatics 2006, 7:123
There are two statistical distance measures that are specifically catered to detecting outliers and then considering whether such outliers should be removed from your linear regression.
The first one is Cook's distance. You can find a pretty good explanation of it at Wikipedia: http://en.wikipedia.org/wiki/Cook%27s_distance.
The higher the Cook's distance is the more influential (impact on regression coefficient) the observation is. The typical cut-off point to consider removing the observation is a Cook's distance = 4/n (n is sample size).
The second one is DFFITS which is also well covered by Wikipedia: http://en.wikipedia.org/wiki/DFFITS. The typical cut-off point to consider removing an observation is a DFFITS value of 2 times sqrt(k/n) where k is number of variables and n is the sample size.
Both measures usually give you similar results leading to similar observation selection.
Garbage in, garbage out....
Implicit in getting the full benefit of linear regression is that the noise follows a normal distribution. Ideally you have mostly data and a little noise.... not mostly noise and a little data. You can test for normality of residuals after the linear fit by looking at the residuals. You can also filter input data before the linear fit for obvious, glaring errors.
Here are some types of noise in garbage input data that do not typically fit a normal distribution:
- Digits missing or added with hand-entered data (off by a factor of 10 or more)
- Wrong or incorrectly converted units (grams vs kilos vs pounds; meters, feet, miles, km), possibly from merging multiple data sets (Note: The Mars Orbiter was thought to be lost in this way, so even NASA rocket scientists can make this mistake)
- Use of codes like 0, -1, -99999 or 99999 to mean something non-numeric like "not applicable" or "column unavailable" and just dumping this into a linear model along with valid data
Writing a spec for what is "valid data" for each column can help you tag invalid data. For instance, a person's height in cm should be in a range, say, 100-300cm. If you find 1.8 for height thats a typo, and while you can assume it was 1.8m and alter it to 180 -- I'd say it is usually safer to throw it out and best to document as much of the filtering as possible.
Statistical tests to be used as a basis for exclusion: - standardised residuals - leverage statistics - Cook's distance, which is a combination of the two above.
From experience, exclusion should be limited to instances of incorrect data entry. Reweighting outliers in the linear regression model is a very good compromise method. The application of this in R is offered by Rob. A great example is here: http://www.ats.ucla.edu/stat/r/dae/rreg.htm
If exclusion is necessary, 'one rule of thumb' relates to Dfbeta statistics (measures change in the estimate when the outlier is deleted), such that if the absolute value of the DfBeta statistic exceeds 2/sqrt(n) then that substantiates removal of the outlier.