How to interpret a QQ plot
I am working with a small dataset (21 observations) and have the following normal QQ plot in R:
Seeing that the plot does not support normality, what could I infer about the underlying distribution? It seems to me that a distribution more skewed to the right would be a better fit, is that right? Also, what other conclusions can we draw from the data?
You're correct that it indicates right skewness. I'll try to locate some of the posts on interpreting QQ plots.
You don't have to conclude; you just need to decide what to try next. Here I would consider square rooting or logging the data.
Tukey's Three-Point Method works very well for using Q-Q plots to help you identify ways to re-express a variable in a way that makes it approximately normal. For instance, picking the penultimate points in the tails and the middle point in this graphic (which I estimate to be $(-1.5,2)$, $(1.5,220)$, and $(0,70)$), you will easily find that the square root comes close to linearizing them. Thus you can infer that the underlying distribution is approximately square root normal.
@Glen_b The answer to my question has some information: http://stats.stackexchange.com/questions/71065/what-distribution-to-use-for-this-qq-plot and the link in the answer has another good source: stats.stackexchange.com/questions/52212/qq-plot-does-not-match-histogram
What of this? Does the QQ plot show notmally distributed data? !enter image description here
This question is one of our best, and I'm awarding a bounty to both answers as soon as the system lets me.
You might wanna check this link for some other insight https://seankross.com/2016/02/29/A-Q-Q-Plot-Dissection-Kit.html
If the values lie along a line the distribution has the same shape (up to location and scale) as the theoretical distribution we have supposed.
Local behaviour: When looking at sorted sample values on the y-axis and (approximate) expected quantiles on the x-axis, we can identify from how the values in some section of the plot differ locally from an overall linear trend by seeing whether the values are more or less concentrated than the theoretical distribution would suppose in that section of a plot:
As we see, less concentrated points increase more and more concentrated points than supposed increase less rapidly than an overall linear relation would suggest, and in the extreme cases correspond to a gap in the density of the sample (shows as a near-vertical jump) or a spike of constant values (values aligned horizontally). This allows us to spot a heavy tail or a light tail and hence, skewness greater or smaller than the theoretical distribution, and so on.
Here's what QQ-plots look like (for particular choices of distribution) on average:
But randomness tends to obscure things, especially with small samples:
Note that at $n=21$ the results may be much more variable than shown there - I generated several such sets of six plots and chose a 'nice' set where you could kind of see the shape in all six plots at the same time. Sometimes straight relationships look curved, curved relationships look straight, heavy-tails just look skew, and so on - with such small samples, often the situation may be much less clear:
It's possible to discern more features than those (such as discreteness, for one example), but with $n=21$, even such basic features may be hard to spot; we shouldn't try to 'over-interpret' every little wiggle. As sample sizes become larger, generally speaking the plots 'stabilize' and the features become more clearly interpretable rather than representing noise. [With some very heavy-tailed distributions, the rare large outlier might prevent the picture stabilizing nicely even at quite large sample sizes.]
You may also find the suggestion here useful when trying to decide how much you should worry about a particular amount of curvature or wiggliness.
A more suitable guide for interpretation in general would also include displays at smaller and larger sample sizes.
This is a very practical guide, thank you very much for gathering all that information.
Because they're data from a variety of different distributions which have different means and standard deviations (the x-axis is based on quantiles of standard normals, of course). The axis values are of no consequence, though, since normality doesn't depend on the mean or standard deviation. You could remove them without changing the point here at all.
I understand that it is shape and type of deviation from linearity what matters here, but still it looks odd that both axes are labeled " ... quantiles " and one axis goes as 0.2 0.4 0.6 and the other goes as -2 -1 0 1 2. Again it looks ok that some data points are within middle 40% of a theoretical distribution, but how can they be distributed between 3% of their own distributon, as the y-axis on your lower-right-most plot suggests?
Ah I see your confusion now. The two axes do not display quantiles of the *same* quantity, but sample quantiles plotted against quantiles of a standard normal (population mean=0, population sd=1). We do not for a moment expect the y-axis variable to be *standard* normal. If the original data are normal with (population) mean $\mu$, and sd $\sigma$, then the plot should tend to look like (some noise about) a straight line, with intercept $\mu$ and slope $\sigma$. The values on the axes are not expected to be similar unless it happens that $\mu\approx 0,\sigma\approx 1$.
@Macond The y-axis shows the raw values of the data, *not* their quantiles. I agree that standardizing the y-axis would make things *much* clearer, and I have no idea why R doesn't do this by default. Could someone shed some light on this?
@Glen_b Could you post the R code used to generate those plots? It would be very useful for those of us trying to dig deeper into how qq plots work. Perhaps it should be at the end to avoid interrupting the flow of your excellent answer.
@GordonGustafson It's more than eight months since I posted that answer; I sometimes can grab the code for these things for a few days, but rarely longer; I generally see the ideas as more important than the specific code for any illustration. I should be able to generate something quite close to them. Were you more interested in the first set of plots (the clean-looking "ideal" ones) or the next two sets (nice-looking real data and not-so-nice-looking real data)?
@Glen_b Code for the ideal plots would be quite sufficient. The reason I ask is because I struggled with the same thing Macond brought up: the values on the y-axis appear to be gibberish. Now that I've grasped that those 'Sample Quantiles' correspond to the raw data rather than any sort of quantile I can see why it isn't *necessary* to analyze them, but they would be useful for explaining the heavy vs. light tails plots if the standard deviation were listed (the light tails plot probably spans around 1.5 standard deviations vs around 10 for the heavy tails plot).
@Glen_b I suppose I'm asking for the code to better explain that concept, specifically to find the standard deviations for the light/heavy-tailed plots, so perhaps the code itself isn't necessary.
@GordonGustafson in respect of your first comment to Macond there's a very good reason why you don't standardize the data -- because a QQ plot is a display *of the data*! It's designed to show information in the data you supply to the function (it would make as much sense to standardize the data you supply to a boxplot or a histogram). If you transform it, it's no longer a display of the data (though the shape in the plot may be similar, you no longer show the location or scale on the plot). I'm not sure what it is you think would be clearer in a standardized plot - can you clarify?
@Glen_b If both axes were standardized (or neither axis was standardized!) it would be easy to compare the expected distribution on the x-axis with the actual distribution on the y-axis. Since they're in different 'units', it feels like I can't think about any of the specific axis values in a meaningful way (without converting one of them using the standard deviation). Obviously your plots should continue to use this standard (no pun intended...) convention, but everything made sense to me once I estimated the heavy-tailed standard deviation and imagined axes with the same units.
@Glen_b It's conceivable that I'm the only one who grasped some of the concepts through that line of reasoning, but I think the fact that the axes use different units deserves a mention for newbies like me. :)
@GordonGustafson thank you for your clarification, axes are the data and theoretical dist. values and points are the quantiles, indeed.
Thanks for the amazing answer! I'm hopelessly confused by the plots - it seems to me that the light tailed and heavy tailed plots are not quite right? `qqplot(rnorm(1000), runif(1000))` plots something similar to the light tailed plot you have, but surely uniform is heavy tailed comparing to a Gaussian distribution? Thanks!