### Examples for teaching: Correlation does not mean causation

• There is an old saying: "Correlation does not mean causation". When I teach, I tend to use the following standard examples to illustrate this point:

1. number of storks and birth rate in Denmark;
2. number of priests in America and alcoholism;
3. in the start of the 20th century it was noted that there was a strong correlation between 'Number of radios' and 'Number of people in Insane Asylums'
4. and my favorite: pirates cause global warming.

However, I do not have any references for these examples and whilst amusing, they are obviously false.

Does anyone have any other good examples?

Flip through Freakonomics for some great examples. Their bibliography is chock full of references.

That pirates / global warming chart is clearly cooked up by conspiracy theorists - anyone can see they have deliberately plotted even spacing for unequal time periods to avoid showing the recent sharp increase in temperature as pirates are almost entirely wiped out. We all know that as temperatures rise it makes the rum evaporate and pirates cannot survive those conditions. ;-)

and also: http://xkcd.com/111/ (xkcd: Firefox and Witchcraft - The Connection?)

WTF is up with the x-axis on that pirate graph?

Economists are trying their best to analyze causality. If you can identify causality for an important question, you can publish your paper in top journals.

Or pretty much anything you put into Google Correlate, come to that.

I saw a website way back that lists correlated data, I Googled and found this, might help.

That is a beautiful graph about the pirates. It's almost a pivot table or something. I should make some of the financial reports look like that.

• It might be useful to explain that "causes" is an asymmetric relation (X causes Y is different from Y causes X), whereas "is correlated with" is a symmetric relation.

For instance, homeless population and crime rate might be correlated, in that both tend to be high or low in the same locations. It is equally valid to say that homelesss population is correlated with crime rate, or crime rate is correlated with homeless population. To say that crime causes homelessness, or homeless populations cause crime are different statements. And correlation does not imply that either is true. For instance, the underlying cause could be a 3rd variable such as drug abuse, or unemployment.

The mathematics of statistics is not good at identifying underlying causes, which requires some other form of judgement.

Judgement is a good word, since all we can ever observe is correlation. All that experiments and/or clever statistics can do is allow us to exclude some alternative explanations for what could have caused an effect.

Very good comment about the symmetric/asymmetric relations. One also might claim that global warming causes piracy to increase.

• My favorites:

1) The more firemen are sent to a fire, the more damage is done.

2) Children who get tutored get worse grades than children who do not get tutored

and (this is my top one)

3) In the early elementary school years, astrological sign is correlated with IQ, but this correlation weakens with age and disappears by adulthood.

(@xmjx Supplied the first example last year.) I love the astrology example.

Can you explain the sample with astrological sign please?

Never mind, I got it. That has to do with age difference between ones born at the beginning of the year and ones born at the end. Nice.

• I've always liked this one:

Nice, but I can't see anyone trying to draw a conclusion of causality there. Or are mexican lemon-truck drivers notoriously dangerous once they get over the border?

Obviously an unforseen side-effect of the profusion of lemon laws in the US. For example see: http://en.wikipedia.org/wiki/Lemon_law

A colleague of mine looked at the data for this in the post-2000 period, and found that the relationship held fairly well 'out-of-sample', which is even more disturbing...

A simple rationalization would be that both are decreasing with time. Does the post-2000 data support that? PS, Box Hunter and Hunter (see below) explain the storks example the same way: both increased with time over the period in question.

Unforeseen, @Thylacoleo? Why would anyone be surprised that fewer lemons on the road results in fewer fatalities? ;)

I don't understand some of the comments here. It looks to me like the number of lemons imported increased between 1996 and 2000?

1. Sometimes correlation is enough. For example, in car insurance, male drivers are correlated with more accidents, so insurance companies charge them more. There is no way you could actually test this for causation. You cannot change the genders of the drivers experimentally. Google has made hundreds of billions of dollars not caring about causation.

2. To find causation, you generally need experimental data, not observational data. Though, in economics, they often use observed "shocks" to the system to test for causation, like if a CEO dies suddenly and the stock price goes up, you can assume causation.

3. Correlation is a necessary but not sufficient condition for causation. To show causation requires a counter-factual.

I like the first example you give. That will certainly get the students talking ;)

There's an interesting discussion by Steve Steinberg on his blog here: http://blog.steinberg.org/?p=11 about some of the implications of 1 and where it might lead in terms of Weak AI.

Could someone expand on the last sentence a little?

Just a quick clarification: Correlation is not necessary for causation (depending on what is mean by correlation): if the correlation is linear correlation (which quite a few people with a little statistics will assume by default when the term is used) but the causation is nonlinear. For example, if $X$ in $(-1,1)$ directly causes $Y$ (which takes values in $(0,1)$), but $Y = \sqrt{1-X^2}$. If the $X's$ are symmetrically distributed, $X$ and $Y$ will be uncorrelated even though perfectly dependent.

• I have a few examples I like to use.

1. When investigating the cause of crime in New York City in the 80s, when they were trying to clean up the city, an academic found a strong correlation between the amount of serious crime committed and the amount of ice cream sold by street vendors! (Which is the cause and which is the effect?) Obviously, there was an unobserved variable causing both. Summers are when crime is the greatest and when the most ice cream is sold.

2. The size of your palm is negatively correlated with how long you will live (really!). In fact, women tend to have smaller palms and live longer.

3. [My favorite] I heard of a study a few years ago that found the amount of soda a person drinks is positively correlated to the likelihood of obesity. (I said to myself - that makes sense since it must be due to people drinking the sugary soda and getting all those empty calories.) A few days later more details came out. Almost all the correlation was due to an increased consumption of diet soft drinks. (That blew my theory!) So, which way is the causation? Do the diet soft drinks cause one to gain weight, or does a gain in weight cause an increased consumption in diet soft drinks? (Before you conclude it is the latter, see the study where a controlled experiments with rats showed the group that was fed a yogurt with artificial sweetener gained more weight than the group that was fed the normal yogurt.) Two references: Drink More Diet Soda, Gain More Weight?; Diet sodas linked to obesity . I think they are still trying to sort this one out.

The last one is slightly more complicated than you present it, but I agree much of the observational associations found between soda/diet soda and obesity should be looked at with a critical eye. Theoretically some have posited that the fake sugar/fat substitutes have other physiological effects beyond the simple calorie intake. See for example this experiment on rats and synthetic fats (taken from the Freakonomics blog).

• The number of Nobel prizes won by a country (adjusting for population) correlates well with per capita chocolate consumption. (New England Journal of Medicine)

+1 I was very disappointed with NEJM when they published this

Seems to also correlate quite well with proximity to Sweden..

Chocolate consumption (per capita) also significantly correlates with the per capita number of serial murderers. http://replicatedtypo.com/chocolate-consumption-traffic-accidents-and-serial-killers/5718.html

Harvey, I think the Nobel prize winners vs. serial murderers makes good sense: both of these are outliers, and observing both of them means the distribution of whatever psychosocial stuff underlies both of them has heavier tails relative to other countries.

I asked three Nobel prize winners I (vaguely) know, and all three said they have eaten way more chocolate than most of their colleagues. Of course, these answers came after they read the NEJM paper!

May,I make a complete fool out of myself? Doesn't the correlation between the per capita chocolate consumption is an effect of a higher average income and the higher average income meansa better education and countries with better education have more nobel prize winners? (sorry, I just found this SE and am really noob on the subject)

@Jonathan. That is one theory. You'd need to draw the graph of per capita income values vs. nobel prizes to see if that explanation makes sense.

@MattBagg This was published as _"Occasional Notes"_ and obviously to not be taken seriously.

@Pascal It's apparently not obvious to the authors of 56 papers (per Google Scholar on Aug 29 2014) that cite the paper. As Doi et al put it in PLOS One, "Despite warnings on the danger of over-interpreting correlations, this initial observation was widely broadcasted by popular media, was taken as fact by several scientific publications, and was accepted and extended in a letter published in Nature." (quoted from http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0092612#pone-0092612-g003 )

@MattBagg Aw, that PLOS paper is a sad reminder at how many of us have no sense of humor. The Nature letter is a survey for chocolate consumption of Nobel Prize winners, of course they cite it. And while interestingly there is a correlation nobody sane will conclude that this is the one causation. Conversely there still seems to be something there – now whether that is because of some chocolate ingredient, if people on their way to a Nobel worry less and eat more chocolate, if it's their social context or if it's schools sponsored by chocolate manufacturers is up for grabs. ;)

• Although it's more of an illustration of the problem of multiple comparisons, it is also a good example of misattributed causation:

Rugby (the religion of Wales) and its influence on the Catholic church: should Pope Benedict XVI be worried?

"every time Wales win the rugby grand slam, a Pope dies, except for 1978 when Wales were really good, and two Popes died."

• There's two aspects to this post hoc ergo propter hoc problem that I like to cover: (i) reverse causality and (ii) endogeneity

An example of "possible" reverse causality: Social drinking and earnings - drinkers earn more money according to Bethany L. Peters & Edward Stringham (2006. "No Booze? You May Lose: Why Drinkers Earn More Money Than Nondrinkers," Journal of Labor Research, Transaction Publishers, vol. 27(3), pages 411-421, June). Or do people who earn more money drink more either because they have a greater disposable income or due to stress? This is a great paper to discuss for all sorts of reasons including measurement error, response bias, causality, etc.

An example of "possible" endogeneity: The Mincer Equation explains log earnings by education, experience and experience squared. There is a long literature on this topic. Labour economists want to estimate the causal relationship of education on earnings but perhaps education is endogenous because "ability" could increase the amount of education an individual has (by lowering the cost of obtaining it) and could lead to an increase in earnings, irrespective of the level of education. A potential solution to this could be an instrumental variable. Angrist and Pischke's book, Mostly Harmless Econometrics covers this and relates topics in great detail and clarity.

Other silly examples that I have no support for include: - Number of televisions per capita and the numbers of mortality rate. So let's send TVs to developing countries. Obviously both are endogenous to something like GDP. - Number of shark attacks and ice cream sales. Both are endogenous to the temperature perhaps?

I also like to tell the terrible joke about the lunatic and the spider. A lunatic is wandering the corridors of an asylum with a spider he's carrying in the palm of his hand. He sees the doctor and says, "Look Doc, I can talk to spiders. Watch this. "Spider, go left!" The spider duly moves to the left. He continues, "Spider, go right." The spider shuffles to the right of his palm. The doctor replies, "Interesting, maybe we should talk about this in the next group session." The lunatic retorts, "That's nothing Doc. Watch this." He pulls off each of the spider's legs one by one and then shouts, "Spider, go left!" The spider lies motionless on his palm and the lunatic turns to the doctor and concludes, "If you pull off a spider's legs he'll go deaf."

• The best one I've been taught has been the number of drownings and sales of ice creams may be highly correlated but that doesnt imply one causes the other. Drownings and sales of ice cream are obviously higher in the summer months when the weather is good. Third variable aka good weather causes them.

• I work with students in teaching correlation vs causation in my Algebra One classes. We examine a lot of possible examples. I found the article Bundled-Up Babies and Dangerous Ice Cream: Correlation Puzzlers from the February 2013 Mathematics Teacher to be useful. I like the idea of talking about "lurking variables". Also this cartoon is a cute conversation starter:

We identify the independent and dependent variable in the cartoon and talk about whether this is an example of causation, if not why not.