### How to interpret the output of the summary method for an lm object in R?

• I am using sample algae data to understand data mining a bit more. I have used the following commands:

data(algae)
algae <- algae[-manyNAs(algae),]
clean.algae <-knnImputation(algae, k = 10)
lm.a1 <- lm(a1 ~ ., data = clean.algae[, 1:12])
summary(lm.a1)


Subsequently I received the results below. However I can not find any good documentation which explains what most of this means, especially Std. Error,t value and Pr.

Can someone please be kind enough to shed some light please? Most importantly, which variables should I look at to ascertain on whether a model is giving me good prediction data?

Call:
lm(formula = a1 ~ ., data = clean.algae[, 1:12])

Residuals:
Min      1Q  Median      3Q     Max
-37.679 -11.893  -2.567   7.410  62.190

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  42.942055  24.010879   1.788  0.07537 .
seasonspring  3.726978   4.137741   0.901  0.36892
seasonsummer  0.747597   4.020711   0.186  0.85270
seasonwinter  3.692955   3.865391   0.955  0.34065
sizemedium    3.263728   3.802051   0.858  0.39179
sizesmall     9.682140   4.179971   2.316  0.02166 *
speedlow      3.922084   4.706315   0.833  0.40573
speedmedium   0.246764   3.241874   0.076  0.93941
mxPH         -3.589118   2.703528  -1.328  0.18598
mnO2          1.052636   0.705018   1.493  0.13715
Cl           -0.040172   0.033661  -1.193  0.23426
NO3          -1.511235   0.551339  -2.741  0.00674 **
NH4           0.001634   0.001003   1.628  0.10516
oPO4         -0.005435   0.039884  -0.136  0.89177
PO4          -0.052241   0.030755  -1.699  0.09109 .
Chla         -0.088022   0.079998  -1.100  0.27265
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.65 on 182 degrees of freedom
Multiple R-squared:  0.3731,    Adjusted R-squared:  0.3215
F-statistic: 7.223 on 15 and 182 DF,  p-value: 2.444e-12


An annotated regression output can be found at: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm The layout of the output might look a little different (it's using STATA rather than R) but the content is more or less the same. Hope this helps.

You'll also want to read this: interpretation-of-rs-lm-output. After having read those, see if you still have any questions left, & if you do, edit your Q to clarify what you still need to know.

• It sounds like you need a decent basic statistics text that covers at least basic location tests, simple regression and multiple regression.

Std. Error,t value and Pr.

1. Std. Error is the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions. Such standard deviations are called standard errors of the corresponding quantity (the coefficient estimate in this case).

In the case of simple regression, it's usually denoted $s_{\hat \beta}$, as here. Also see this

For multiple regression, it's a little more complicated, but if you don't know what these things are it's probably best to understand them in the context of simple regression first.

2. t value is the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0.

The formula for computing it is given at the first link above.

3. Pr. is the p-value for the hypothesis test for which the t value is the test statistic. It tells you the probability of a test statistic at least as unusual as the one you obtained, if the null hypothesis were true. In this case, the null hypothesis is that the true coefficient is zero; if that probability is low, it's suggesting that it would be rare to get a result as unusual as this if the coefficient were really zero.

Most importantly, which variables should I look at to ascertain on whether a model is giving me good prediction data?

What do you mean by 'good prediction data'? Can you make it clearer what you're asking?

The Residual standard error, which is usually called $s$, represents the standard deviation of the residuals. It's a measure of how close the fit is to the points.

The Multiple R-squared, also called the coefficient of determination is the proportion of the variance in the data that's explained by the model. The more variables you add - even if they don't help - the larger this will be. The Adjusted one reduces that to account for the number of variables in the model.

The $F$ statistic on the last line is telling you whether the regression as a whole is performing 'better than random' - any set of random predictors will have some relationship with the response, so it's seeing whether your model fits better than you'd expect if all your predictors had no relationship with the response (beyond what would be explained by that randomness). This is used for a test of whether the model outperforms 'noise' as a predictor. The p-value in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model.

Where do the data come from? Is this in some package?