How to determine which variable goes on the X & Y axes in a scatterplot?
I am trying to do a scatterplot to see the relationship between literacy and baby mortality. How do I know if literacy is my X axis and baby mortality is my Y axis, or the reverse? How do I determine what goes in the X axis and the Y axis?
Provided you clearly label the axes, you can do it any way you want! (But there are a few conventions--and they even differ among different disciplines.)
Obvious enough, but a detail crucial for some fields: in several Earth and environmental sciences, and in archaeology too, it is common to use depth below or height above land or sea surface as the vertical variable for scatter and other plots. That just seems a natural way to show the data given how the data are produced, using atmospheric balloons, bores, cores, digging or excavations.
If you have a variable you see as "explanatory" and the other one as the thing being explained, then one (very common) convention is to put the explanatory variable on the x-axis and the thing being explained by it on the y-axis.
So, for example, you may be viewing the relationship between literacy and mortality as potentially causal (and thus, clearly explanatory) in that greater literacy might lead to lower mortality.
In that case it would be common to put mortality on the y-axis and literacy on the x-axis.
But it's also possible to conceive of them the other way around (high infant mortality might well affect literacy rates), or with neither being explanatory of the other.
In some cases, if one variable is 'fixed' and the other is 'random', the more common convention is that random one tends to go on the y-axis of the plot.
In some areas the conventions may tend to be flipped around; this is simply the most widespread.
The rules of thumb I teach students: if one variable was under experimental control (a good example of Glen_b's "fixed"), put it on the x-axis. If both variables are just observed, but you suspect a casual relationship between them, put "the cause" on the x-axis. If you are would like to make predictions of one variable based on the other, put the one you're predicting on the y-axis and what you're basing it on on the x-axis. Regardless of what you do, label axes clearly.
And there's something I use myself, but have never been able to quite put a handle on it, so haven't taught it to my students. We often have two related variables, for instance people's handspan and height, which both depend upon another bunch of variables (age, genetics, nutrition) rather than one being directly responsible for the other. I bet if we did a straw poll, the majority of analysts would put "height" on the x-axis and "handspan" on the y-axis. It seems common to put the "most fundamental" variable on the x-axis in these cases, but I'd be hard-pressed to define a firm rule for it.
@Beth, if these answers helped you, consider upvoting them by clicking the upwards facing normal distribution to their left. If 1 or both resolved your issue, please consider accepting it by clicking the check mark below the vote total.
@Silverfish Better late than never, but "casual" is a typo for "causal" in your first comment. I will add a metacomment that I've seen this typo hundreds of times: some possibly have cause some kind of auto-correct and others have cause the writer being too casual about checking what they say. In your case I blame the former.
Any x-y scatter plot is relevant only to the end user (pretty much what whuber said). In general, the x-axis is the variable (cause) and the y-axis is the response (effect). In your case, I would suggest that literacy is a variable that affects baby mortality, so I would put literacy on the X and mortality on the Y.
Independent variable goes on the x-axis (the thing you are changing) Dependent variable goes on the y-axis (the thing you are measuring)