### How does R handle missing values in lm?

I'd like to regress a vector B against each of the columns in a matrix A. This is trivial if there are no missing data, but if matrix A contains missing values, then my regression against A is constrained to include only rows where all values are present (the default

*na.omit*behavior). This produces incorrect results for columns with no missing data. I can regress the column matrix B against individual columns of the matrix A, but I have thousands of regressions to do, and this is prohibitively slow and inelegant. The*na.exclude*function seems to be designed for this case, but I can't make it work. What am I doing wrong here? Using R 2.13 on OSX, if it matters.`A = matrix(1:20, nrow=10, ncol=2) B = matrix(1:10, nrow=10, ncol=1) dim(lm(A~B)$residuals) # [1] 10 2 (the expected 10 residual values) # Missing value in first column; now we have 9 residuals A[1,1] = NA dim(lm(A~B)$residuals) #[1] 9 2 (the expected 9 residuals, given na.omit() is the default) # Call lm with na.exclude; still have 9 residuals dim(lm(A~B, na.action=na.exclude)$residuals) #[1] 9 2 (was hoping to get a 10x2 matrix with a missing value here) A.ex = na.exclude(A) dim(lm(A.ex~B)$residuals) # Throws an error because dim(A.ex)==9,2 #Error in model.frame.default(formula = A.ex ~ B, drop.unused.levels = TRUE) : # variable lengths differ (found for 'B')`

Sorry, meant to say "I can regress the column matrix B against the columns in A individually", meaning one-at-a-time calls to lm. Edited to reflect this.

One-at-a-time calls to lm/regression is not a great way to go about doing regression (going by the definition of regression, which is to find the partial effect of each predictor on a response/outcome given the state of other variables)

caracal Correct answer

9 years agoEdit: I misunderstood your question. There are two aspects:

a)

`na.omit`

and`na.exclude`

both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions like`residuals()`

or`fitted()`

will pad their output with`NA`

s for the omitted cases with`na.exclude`

, thus having an output of the same length as the input variables.`> N <- 20 # generate some data > y1 <- rnorm(N, 175, 7) # criterion 1 > y2 <- rnorm(N, 30, 8) # criterion 2 > x <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor > y1[c(1, 3, 5)] <- NA # some NA values > y2[c(7, 9, 11)] <- NA # some other NA values > Y <- cbind(y1, y2) # matrix for multivariate regression > fitO <- lm(Y ~ x, na.action=na.omit) # fit with na.omit > dim(residuals(fitO)) # use extractor function [1] 14 2 > fitE <- lm(Y ~ x, na.action=na.exclude) # fit with na.exclude > dim(residuals(fitE)) # use extractor function -> = N [1] 20 2 > dim(fitE$residuals) # access residuals directly [1] 14 2`

b) The real issue is not with this difference between

`na.omit`

and`na.exclude`

, you don't seem to want casewise deletion that takes criterion variables into account, which both do.`> X <- model.matrix(fitE) # design matrix > dim(X) # casewise deletion -> only 14 complete cases [1] 14 2`

The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$, coefficients $\hat{\beta} = X^{+} Y$) and the hat matrix $H = X X^{+}$, fitted values $\hat{Y} = H Y$). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$, so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of

`lm()`

by doing something along the lines of the following:`> Xf <- model.matrix(~ x) # full design matrix (all cases) # function: manually calculate coefficients and fitted values for single criterion y > getFit <- function(y) { + idx <- !is.na(y) # throw away NAs + Xsvd <- svd(Xf[idx , ]) # SVD decomposition of X + # get X+ but note: there might be better ways + Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ]) + list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx])) + } > res <- apply(Y, 2, getFit) # get fits for each column of Y > res$y1$coefs [,1] (Intercept) 113.9398761 x 0.7601234 > res$y2$coefs [,1] (Intercept) 91.580505 x -0.805897 > coefficients(lm(y1 ~ x)) # compare with separate results from lm() (Intercept) x 113.9398761 0.7601234 > coefficients(lm(y2 ~ x)) (Intercept) x 91.580505 -0.805897`

Note that there might be numerically better ways to caculate $X^{+}$ and $H$, you could check a $QR$-decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using

`lm()`

.That makes sense given my understanding of how na.exclude should work. However, if you call >X.both = cbind(X1, X2) and then >dim(lm(X.both~Y, na.action=na.exclude)$residuals) you still get 94 residuals, instead of 97 and 97.

That's an improvement, but if you look at residuals(lm(X.both ~ Y, na.action=na.exclude)), you see that each column has six missing values, even though the missing values in column 1 of X.both are from different samples than those in column 2. So na.exclude is preserving the shape of the residuals matrix, but under the hood R is apparently only regressing with values present in all rows of X.both. There may be a good statistical reason for this, but for my application it's a problem.

@David I had misunderstood your question. I think I now see your point, and have edited my answer to address it.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

chl 9 years ago

What do you mean by "I can calculate each row individually"?