How does R handle missing values in lm?

  • I'd like to regress a vector B against each of the columns in a matrix A. This is trivial if there are no missing data, but if matrix A contains missing values, then my regression against A is constrained to include only rows where all values are present (the default na.omit behavior). This produces incorrect results for columns with no missing data. I can regress the column matrix B against individual columns of the matrix A, but I have thousands of regressions to do, and this is prohibitively slow and inelegant. The na.exclude function seems to be designed for this case, but I can't make it work. What am I doing wrong here? Using R 2.13 on OSX, if it matters.

    A = matrix(1:20, nrow=10, ncol=2)
    B = matrix(1:10, nrow=10, ncol=1)
    dim(lm(A~B)$residuals)
    # [1] 10 2 (the expected 10 residual values)
    
    # Missing value in first column; now we have 9 residuals
    A[1,1] = NA  
    dim(lm(A~B)$residuals)
    #[1]  9 2 (the expected 9 residuals, given na.omit() is the default)
    
    # Call lm with na.exclude; still have 9 residuals
    dim(lm(A~B, na.action=na.exclude)$residuals)
    #[1]  9 2 (was hoping to get a 10x2 matrix with a missing value here)
    
    A.ex = na.exclude(A)
    dim(lm(A.ex~B)$residuals)
    # Throws an error because dim(A.ex)==9,2
    #Error in model.frame.default(formula = A.ex ~ B, drop.unused.levels = TRUE) : 
    #  variable lengths differ (found for 'B')
    

    What do you mean by "I can calculate each row individually"?

    Sorry, meant to say "I can regress the column matrix B against the columns in A individually", meaning one-at-a-time calls to lm. Edited to reflect this.

    One-at-a-time calls to lm/regression is not a great way to go about doing regression (going by the definition of regression, which is to find the partial effect of each predictor on a response/outcome given the state of other variables)

  • caracal

    caracal Correct answer

    9 years ago

    Edit: I misunderstood your question. There are two aspects:

    a) na.omit and na.exclude both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions like residuals() or fitted() will pad their output with NAs for the omitted cases with na.exclude, thus having an output of the same length as the input variables.

    > N    <- 20                               # generate some data
    > y1   <- rnorm(N, 175, 7)                 # criterion 1
    > y2   <- rnorm(N,  30, 8)                 # criterion 2
    > x    <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
    > y1[c(1, 3,  5)] <- NA                    # some NA values
    > y2[c(7, 9, 11)] <- NA                    # some other NA values
    > Y    <- cbind(y1, y2)                    # matrix for multivariate regression
    > fitO <- lm(Y ~ x, na.action=na.omit)     # fit with na.omit
    > dim(residuals(fitO))                     # use extractor function
    [1] 14  2
    
    > fitE <- lm(Y ~ x, na.action=na.exclude)  # fit with na.exclude
    > dim(residuals(fitE))                     # use extractor function -> = N
    [1] 20  2
    
    > dim(fitE$residuals)                      # access residuals directly
    [1] 14  2
    

    b) The real issue is not with this difference between na.omit and na.exclude, you don't seem to want casewise deletion that takes criterion variables into account, which both do.

    > X <- model.matrix(fitE)                  # design matrix
    > dim(X)                                   # casewise deletion -> only 14 complete cases
    [1] 14  2
    

    The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$, coefficients $\hat{\beta} = X^{+} Y$) and the hat matrix $H = X X^{+}$, fitted values $\hat{Y} = H Y$). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$, so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of lm() by doing something along the lines of the following:

    > Xf <- model.matrix(~ x)                    # full design matrix (all cases)
    # function: manually calculate coefficients and fitted values for single criterion y
    > getFit <- function(y) {
    +     idx   <- !is.na(y)                     # throw away NAs
    +     Xsvd  <- svd(Xf[idx , ])               # SVD decomposition of X
    +     # get X+ but note: there might be better ways
    +     Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
    +     list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
    + }
    
    > res <- apply(Y, 2, getFit)    # get fits for each column of Y
    > res$y1$coefs
                       [,1]
    (Intercept) 113.9398761
    x             0.7601234
    
    > res$y2$coefs
                     [,1]
    (Intercept) 91.580505
    x           -0.805897
    
    > coefficients(lm(y1 ~ x))      # compare with separate results from lm()
    (Intercept)           x 
    113.9398761   0.7601234 
    
    > coefficients(lm(y2 ~ x))
    (Intercept)           x 
      91.580505   -0.805897
    

    Note that there might be numerically better ways to caculate $X^{+}$ and $H$, you could check a $QR$-decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using lm().

    That makes sense given my understanding of how na.exclude should work. However, if you call >X.both = cbind(X1, X2) and then >dim(lm(X.both~Y, na.action=na.exclude)$residuals) you still get 94 residuals, instead of 97 and 97.

    That's an improvement, but if you look at residuals(lm(X.both ~ Y, na.action=na.exclude)), you see that each column has six missing values, even though the missing values in column 1 of X.both are from different samples than those in column 2. So na.exclude is preserving the shape of the residuals matrix, but under the hood R is apparently only regressing with values present in all rows of X.both. There may be a good statistical reason for this, but for my application it's a problem.

    @David I had misunderstood your question. I think I now see your point, and have edited my answer to address it.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM