### What is rank deficiency, and how to deal with it?

Fitting a logistic regression using lme4 ends with

`Error in mer_finalize(ans) : Downdated X'X is not positive definite.`

A likely cause of this error is apparently rank deficiency. What is rank deficiency, and how should I address it?

It normally means that one or more of your variables are not linearly independent, in that the problematic variable can be expressed as a combination of the other variables. The R package `caret` has a function called `findLinearCombos` which will tell you which are the problematic variables.

I agree with richiemorrisroe. Since it says X'X is not positive definite I think they are implying that the design matrix X'X is singular and hence does not have full rank. Hence at least one of the covariates can be written as exact linear combinations of other covariates.

Rank deficiency in this context says there is insufficient information contained in your data to estimate the model you desire. It stems from many origins. I'll talk here about modeling in a fairly general context, rather than explicitly logistic regression, but everything still applies to the specific context.

The deficiency may stem from simply too little data. In general, you cannot uniquely estimate n parameters with less than n data points. That does not mean that all you need are n points, as if there is any noise in the process, you would get rather poor results. You need more data to help the algorithm to choose a solution that will represent all of the data, in a minimum error sense. This is why we use least squares tools. How much data do you need? I was always asked that question in a past life, and the answer was more than you have, or as much as you can get. :)

Sometimes you may have more data than you need, but some (too many) points are replicates. Replication is GOOD in the sense that it helps to reduce the noise, but it does not help to increase numerical rank. Thus, suppose you have only two data points. You cannot estimate a unique quadratic model through the points. A million replicates of each point will still not allow you to fit more than a straight line, through what are still only effectively a pair of points. Essentially, replication does not add information content. All it does is decrease noise at locations where you already have information.

Sometimes you have information in the wrong places. For example, you cannot fit a two dimensional quadratic model if all you have are points that all lie in a straight line in two dimensions. That is, suppose you have points scattered only along the line x = y in the plane, and you wish to fit a model for the surface z(x,y). Even with zillions of points (not even replicates) will you have sufficient information to intelligently estimate more than a constant model. Amazingly, this is a common problem that I've seen in sampled data. The user wonders why they cannot build a good model. The problem is built into the very data they have sampled.

Sometimes it is simply choice of model. This can be viewed as "not enough data", but from the other side. You wish to estimate a complicated model, but have provided insufficient data to do so.

In all of the above instances the answer is to get more data, sampled intelligently from places that will provide information about the process that you currently lack. Design of experiments is a good place to start.

However, even good data is sometimes inadequate, at least numerically so. (Why do bad things happen to good data?) The problem here may be model related. It may lie in nothing more than a poor choice of units. It may stem from the computer programming done to solve the problem. (Ugh! Where to start?)

First, lets talk about units and scaling. Suppose I try to solve a problem where one variable is MANY orders of magnitude larger than another. For example, suppose I have a problem that involves my height and my shoe size. I'll measure my height in nanometers. So my height would be roughly 1.78 billion (1.78e9) nanometers. Of course, I'll choose to measure my shoe size in kilo-parsecs, so 9.14e-21 kilo-parsecs. When you do regression modeling, linear regression is all about linear algebra, which involves linear combinations of variables. The problem here is these numbers are different by hugely many orders of magnitude (and not even the same units.) The mathematics will fail when a computer program tries to add and subtract numbers that vary by so many orders of magnitude (for a double precision number, that absolute limit is roughly 16 powers of 10.)

The trick is usually to use common units, but on some problems even that is an issue when variables vary by too many orders of magnitude. More important is to scale your numbers to be similar in magnitude.

Next, you may see problems with big numbers and small variation in those numbers. Thus, suppose you try to build a moderately high order polynomial model with data where your inputs all lie in the interval [1,2]. Squaring, cubing, etc., numbers that are on the order of 1 or 2 will cause no problems when working in double precision arithmetic. Alternatively, add 1e12 to every number. In theory, the mathematics will allow this. All it does is shift any polynomial model we build on the x-axis. It would have exactly the same shape, but be translated by 1e12 to the right. In practice, the linear algebra will fail miserably due to rank deficiency problems. You have done nothing but translate the data, but suddenly you start to see singular matrices popping up.

Usually the comment made will be a suggestion to "center and scale your data". Effectively this says to shift and scale the data so that it has a mean near zero and a standard deviation that is roughly 1. That will greatly improve the conditioning of most polynomial models, reducing the rank deficiency issues.

Other reasons for rank deficiency exist. In some cases it is built directly into the model. For example, suppose I provide the derivative of a function, can I uniquely infer the function itself? Of course not, as integration involves a constant of integration, an unknown parameter that is generally inferred by knowledge of the value of the function at some point. In fact, this sometimes arises in estimation problems too, where the singularity of a system is derived from the fundamental nature of the system under study.

I surely left out a few of the many reasons for rank deficiency in a linear system, and I've prattled along for too long now. Hopefully I managed to explain those I covered in simple terms, and a way to alleviate the problem.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

tristan 8 years ago

I would advise you check none of your variables are constants (ie no variance). If you're fine on that score check whether you have any complex valued variables or infinite variables.