When (and why) should you take the log of a distribution (of numbers)?
Say I have some historical data e.g., past stock prices, airline ticket price fluctuations, past financial data of the company...
Now someone (or some formula) comes along and says "let's take/use the log of the distribution" and here's where I go WHY?
- WHY should one take the log of the distribution in the first place?
- WHAT does the log of the distribution 'give/simplify' that the original distribution couldn't/didn't?
- Is the log transformation 'lossless'? I.e., when transforming to log-space and analyzing the data, do the same conclusions hold for the original distribution? How come?
- And lastly WHEN to take the log of the distribution? Under what conditions does one decide to do this?
I've really wanted to understand log-based distributions (for example lognormal) but I never understood the when/why aspects - i.e., the log of the distribution is a normal distribution, so what? What does that even tell and me and why bother? Hence the question!
UPDATE: As per @whuber's comment I looked at the posts and for some reason I do understand the use of log transforms and their application in linear regression, since you can draw a relation between the independent variable and the log of the dependent variable. However, my question is generic in the sense of analyzing the distribution itself - there is no relation per se that I can conclude to help understand the reason of taking logs to analyze a distribution. I hope I'm making sense :-/
In regression analysis you do have constraints on the type/fit/distribution of the data and you can transform it and define a relation between the independent and (not transformed) dependent variable. But when/why would one do that for a distribution in isolation where constraints of type/fit/distribution are not necessarily applicable in a framework (like regression). I hope the clarification makes things more clear than confusing :)
This question deserves a clear answer as to "WHY and WHEN"
Because this covers almost the same ground as previous questions here and here, please read those threads and update your question to focus on any aspects of this issue that haven't already been addressed. Note, too, #4 (and part of #3) are elementary questions about logarithms whose answers are readily found in many places.
The clarification helps. You might want to ponder the fact, though, that regression with only a constant term (and no other independent variables) amounts to assessing the variation of the data around their mean. Therefore, if you really understand the effects of taking logs of dependent variables in regression, you *already* understand the (simpler) situation you are asking about here. In short, once you have answers to all four questions for regression, you don't need to ask them again about "the distribution in isolation."
@whuber: I see...so I do understand the reasons for taking logs in regression, but only because I had been taught so - I understand it from the need to do so perspective i.e., to make sure the data fits within the assumptions of linear regression. That's my only understanding. Maybe what I'm missing is "real understanding" of the effect of taking logs and hence the confusion...any help? ;)
Ah, but you know much more than that, because after using logs in regression, you know that the results are interpreted differently and you know to take care in back-transforming fitted values and confidence intervals. I'm suggesting that you might *not* be confused and that you probably already know many of the answers to these four questions, even though you weren't initially aware of it :-).
Readers here may also want to look at these closely related threads: interpretation-of-log-transformed-predictor, & How to interpret logarithmically transformed coefficients in linear regression.
so is it right to say that if a distribution is non linear, and its log is linear.. we use the log one since it is easy to model a linear distribution?
If you assume a model form that is non-linear but can be transformed to a linear model such as $\log Y = \beta_0 + \beta_1t$ then one would be justified in taking logarithms of $Y$ to meet the specified model form. In general whether or not you have causal series , the only time you would be justified or correct in taking the Log of $Y$ is when it can be proven that the Variance of $Y$ is proportional to the Expected Value of $Y^2$ . I don't remember the original source for the following but it nicely summarizes the role of power transformations. It is important to note that the distributional assumptions are always about the error process not the observed Y, thus it is a definite "no-no" to analyze the original series for an appropriate transformation unless the series is defined by a simple constant.
Unwarranted or incorrect transformations including differences should be studiously avoided as they are often an ill-fashioned /ill-conceived attempt to deal with unidentified anomalies/level shifts/time trends or changes in parameters or changes in error variance. A classic example of this is discussed starting at slide 60 here http://www.autobox.com/cms/index.php/afs-university/intro-to-forecasting/doc_download/53-capabilities-presentation where three pulse anomalies (untreated) led to an unwarranted log transformation by early researchers. Unfortunately some of our current researchers are still making the same mistake.
The optimal power transformation is found via the Box-Cox Test where
- -1. is a reciprocal
- -.5 is a recriprocal square root
- 0.0 is a log transformation
- .5 is a square toot transform and
- 1.0 is no transform.
Note that when you have no predictor/causal/supporting input series, the model is $Y_t=u +a_t$ and that there are no requirements made about the distribution of $Y$ BUT are made about $a_t$, the error process. In this case the distributional requirements about $a_t$ pass directly on to $Y_t$. When you have supporting series such as in a regression or in a Autoregressive–moving-average model with exogenous inputs model (ARMAX model) the distributional assumptions are all about $a_t$ and have nothing whatsoever to do with the distribution of $Y_t$. Thus in the case of ARIMA model or an ARMAX Model one would never assume any transformation on $Y$ before finding the optimal Box-Cox transformation which would then suggest the remedy (transformation) for $Y$. In earlier times some analysts would transform both $Y$ and $X$ in a presumptive way just to be able to reflect upon the percent change in $Y$ as a result in the percent change in $X$ by examining the regression coefficient between $\log Y$ and $\log X$. In summary, transformations are like drugs some are good and some are bad for you! They should only be used when necessary and then with caution.
I agree that whomever left the downvote(s) should leave a remark as to why this was downvoted. To Irishstat, it would be much easier to read your post if you took advantage of the formatting options for leaving answers, especially those available for marking up equations in latex. See the markdown editing help section. That link is available whenever you type a response in the top right corner of the posting box (in the orange circle with the question mark).
The cited table is found in _Introduction to Linear Regression Analysis_ By Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining.
@user1717828 tu .. I have always been a fan of Montgomery as he has a long beard involving time series
Is it not always true that the second moment and the variance are proportional to one another? We have the classic equation saying: variance is equal to the second moment minus the first moment squared.
As you say the variance is a function of the second moment. Where did I imply otherwise. Additionally the variance can change (deterministically) at different points in time SEE https://pdfs.semanticscholar.org/09c4/ba8dd3cc88289caf18d71e8985bdd11ad21c.pdf which is not remedied by a power transform.