How should I transform non-negative data including zeros?

  • If I have highly skewed positive data I often take logs. But what should I do with highly skewed non-negative data that include zeros? I have seen two transformations used:

    • $\log(x+1)$ which has the neat feature that 0 maps to 0.
    • $\log(x+c)$ where c is either estimated or set to be some very small positive value.

    Are there any other approaches? Are there any good reasons to prefer one approach over the others?

    I've summarized some of the answers plus some other material at http://robjhyndman.com/researchtips/transformations/

    excellent way to transform and promote stat.stackoverflow !

    Yes, I agree @robingirard (I just arrived here now because of Rob's blog post)!

    Also see http://stats.stackexchange.com/questions/39042/how-should-i-handle-a-left-censored-predictor-variable-in-multiple-regression for an application to left-censored data (which can be characterized, up to a shift of location, exactly as in the present question).

    It seems strange to ask about how to transform without having stated the purpose of transforming in the first place. What is the situation? Why is it necessary to transform? If we don't know what you're trying to achieve, how can one reasonably suggest *anything*? (Clearly one cannot hope to transform to normality, because the existence of a (non-zero) probability of exact zeros implies a spike in the distribution at zero, which spike no transformation will remove -- it can only move it around.)

    I've been playing around with log(x+d)-c, where c=floor(log(smallest-non-zero-value)), and d=exp(c). This also has the feature that 0 maps to 0. However, I've been trying to interpret the coefficients resulting from this model, and I've been having trouble- In particular, the feature of a log-linear model on a continuous predictor X where a 1-unit change in X results in the same percent change in Y, regardless of choice of X and X+1 -- this no longer seems to hold due to the small adjustment. For that reason, I'm starting to suspect that the alternate transformations below are superior.

  • Correct answer

    10 years ago

    It seems to me that the most appropriate choice of transformation is contingent on the model and the context.

    The '0' point can arise from several different reasons each of which may have to be treated differently:

    • Truncation (as in Robin's example): Use appropriate models (e.g., mixtures, survival models etc)
    • Missing data: Impute data / Drop observations if appropriate.
    • Natural zero point (e.g., income levels; an unemployed person has zero income): Transform as needed
    • Sensitivity of measuring instrument: Perhaps, add a small amount to data?

    I am not really offering an answer as I suspect there is no universal, 'correct' transformation when you have zeros.

    Every answer to my question has provided useful information and I've up-voted them all. But I can only select one answer and Srikant's provides the best overview IMO.

    Also note that there are zero-inflated models (extra zeroes and you care about some zeroes: a mixture model), and hurdle models (zeroes and you care about non-zeroes: a two-stage model with an initial censored model).

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM