How to deal with perfect separation in logistic regression?
If you have a variable which perfectly separates zeroes and ones in target variable, R will yield the following "perfect or quasi perfect separation" warning message:
Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred
We still get the model but the coefficient estimates are inflated.
How do you deal with this in practice?
A solution to this is to utilize a form of penalized regression. In fact, this is the original reason some of the penalized regression forms were developed (although they turned out to have other interesting properties.
Install and load package glmnet in R and you're mostly ready to go. One of the less user-friendly aspects of glmnet is that you can only feed it matrices, not formulas as we're used to. However, you can look at model.matrix and the like to construct this matrix from a data.frame and a formula...
Now, when you expect that this perfect separation is not just a byproduct of your sample, but could be true in the population, you specifically don't want to handle this: use this separating variable simply as the sole predictor for your outcome, not employing a model of any kind.
"Now, when you expect..." Question regarding this. I have a case/control study looking at the relationship with the microbiome. We also have a treatment that is almost only found among cases. However, we think the treatment might also affect the microbiome. Is this an example of your caveat? Hypothetically we could find a bunch more cases not using the treatment if we tried, but we have what we have.