How to interpret the output of the summary method for an lm object in R?
I am using sample algae data to understand data mining a bit more. I have used the following commands:
data(algae) algae <- algae[-manyNAs(algae),] clean.algae <-knnImputation(algae, k = 10) lm.a1 <- lm(a1 ~ ., data = clean.algae[, 1:12]) summary(lm.a1)
Subsequently I received the results below. However I can not find any good documentation which explains what most of this means, especially Std. Error,t value and Pr.
Can someone please be kind enough to shed some light please? Most importantly, which variables should I look at to ascertain on whether a model is giving me good prediction data?
Call: lm(formula = a1 ~ ., data = clean.algae[, 1:12]) Residuals: Min 1Q Median 3Q Max -37.679 -11.893 -2.567 7.410 62.190 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.942055 24.010879 1.788 0.07537 . seasonspring 3.726978 4.137741 0.901 0.36892 seasonsummer 0.747597 4.020711 0.186 0.85270 seasonwinter 3.692955 3.865391 0.955 0.34065 sizemedium 3.263728 3.802051 0.858 0.39179 sizesmall 9.682140 4.179971 2.316 0.02166 * speedlow 3.922084 4.706315 0.833 0.40573 speedmedium 0.246764 3.241874 0.076 0.93941 mxPH -3.589118 2.703528 -1.328 0.18598 mnO2 1.052636 0.705018 1.493 0.13715 Cl -0.040172 0.033661 -1.193 0.23426 NO3 -1.511235 0.551339 -2.741 0.00674 ** NH4 0.001634 0.001003 1.628 0.10516 oPO4 -0.005435 0.039884 -0.136 0.89177 PO4 -0.052241 0.030755 -1.699 0.09109 . Chla -0.088022 0.079998 -1.100 0.27265 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 17.65 on 182 degrees of freedom Multiple R-squared: 0.3731, Adjusted R-squared: 0.3215 F-statistic: 7.223 on 15 and 182 DF, p-value: 2.444e-12
An annotated regression output can be found at: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm The layout of the output might look a little different (it's using STATA rather than R) but the content is more or less the same. Hope this helps.
It sounds like you need a decent basic statistics text that covers at least basic location tests, simple regression and multiple regression.
Std. Error,t value and Pr.
Std. Erroris the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions. Such standard deviations are called standard errors of the corresponding quantity (the coefficient estimate in this case).
For multiple regression, it's a little more complicated, but if you don't know what these things are it's probably best to understand them in the context of simple regression first.
t valueis the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0.
The formula for computing it is given at the first link above.
Pr.is the p-value for the hypothesis test for which the t value is the test statistic. It tells you the probability of a test statistic at least as unusual as the one you obtained, if the null hypothesis were true. In this case, the null hypothesis is that the true coefficient is zero; if that probability is low, it's suggesting that it would be rare to get a result as unusual as this if the coefficient were really zero.
Most importantly, which variables should I look at to ascertain on whether a model is giving me good prediction data?
What do you mean by 'good prediction data'? Can you make it clearer what you're asking?
Residual standard error, which is usually called $s$, represents the standard deviation of the residuals. It's a measure of how close the fit is to the points.
Multiple R-squared, also called the coefficient of determination is the proportion of the variance in the data that's explained by the model. The more variables you add - even if they don't help - the larger this will be. The
Adjustedone reduces that to account for the number of variables in the model.
The $F$ statistic on the last line is telling you whether the regression as a whole is performing 'better than random' - any set of random predictors will have some relationship with the response, so it's seeing whether your model fits better than you'd expect if all your predictors had no relationship with the response (beyond what would be explained by that randomness). This is used for a test of whether the model outperforms 'noise' as a predictor. The p-value in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model.
Where do the data come from? Is this in some package?