"Best" series of colors to use for differentiating series in publication-quality plots
Has any study been done on what are the best set of colors to use for showing multiple series on the same plot? I've just been using the defaults in
matplotlib, and they look a little childish since they're all bright, primary colors.
This does not answer your question, but I think it is important to mention. Whenever possible, any color scheme chosen should be supplemented with differing symbols or line styles such that when the plot is printed in black and white it is still easy to understand. Far too often authors rely solely on color, making the figures useless to the colorblind and those who prefer to read the black and white printed version of your paper. Plots should always, if possible, work in black and white, and work "better" in color.
+1 to MHH. A legendary piece of television commentary on snooker makes the same point indirectly: "Steve is going for the pink ball - and for those of you who are watching in black and white, the pink is next to the green." Explanation for younger readers: That comes from a time when some people could afford colour television but others had to go for the cheaper black and white television.
"Best" for what purpose? This is not a trivial or flippant question. To impress readers of an internet forum, I use graphical symbols that work without color and then decorate them with rainbow colors (which may be meaningful but are there primarily to attract attention and give a sense of "quality"). For plots that are intended to transmit data, another color scheme might be chosen, whereas for plots that are created in an exploratory fashion to reveal possibly unexpected patterns (in a visual *gestalt*) the scheme should depend on purpose: differentiation, aggregation, selection, other?
@whuber: You make a point. I should have specified that I meant for publication in scientific literature, and in general I meant to ask for answers to each of the categories of aggregation, selection, differentiation, etc. Indeed, aggregation and differentiation are often not separate goals: in the figures from one of my papers (http://dx.doi.org/10.1063/1.4864755), I needed both (and I don't think I did a very good job of it). (Sorry to those of you not on academic campuses; I'll try to put a general public link up soon)
A common reference for choosing a color palette is the work of Cynthia Brewer on ColorBrewer. The colors were chosen based on perceptual patterns in choropleth maps, but most of the same advice applies to using color in any type of plot to distinguish data patterns. If color is solely to distinguish between the different lines, then a qualitative palette is in order.
Often color is not needed in line plots with only a few lines, and different point symbols and/or dash patterns are effective enough. A more common problem with line plots is that if the lines frequently overlap it will be difficult to distinguish different patterns no matter what symbols or color you use. Stephen Kosslyn recommends a general rule of thumb for only having 4 lines in a plot. If you have more consider splitting the lines into a series of small multiple plots. Here is an example showing the recommendation
No color needed and the labels are more than sufficient.
Thanks for the recommendation of ColorBrewer! That's the kind of thing I was looking for.
The greyscale image doesn't work if there are two series with the same value either in the middle somewhere (the two series can't be traced back past that point), or at the end (the labels will not be able to distinguish which series is which). It's great when it works though...
Agree @naught101 for the middle (the ends coincident just put the label somewhere before the end). It is one reason to actually not use linear interpolation between points, but use some type of spline. In that case the spline will curve in different directions. That happens quite alot in dense parallel coordinate plots. (Jittering can also help with data with many ties, like low integer count data.)
Even then, it could still happen, if the slopes are the same. Less likely, sure. Also, I think splines can sometimes give a false impression that they are based on more data than they actually are, and worse, can give a really wrong impression of the trend at the end of the time-series. But yeah, everything like this is probably going to have to be assessed on a case-by-case basis for appropriateness anyway.
The advice of no more than 4 lines in a plot is well intended but a counsel of perfection. It's a good maxim for presentation graphics whenever it's important that individual lines be identifiable. It's not when the goal is to see the family properties of tens or hundreds or even more lines and knowing identifiers is dispensable: superimposed lines can often serve well to show overall patterns and also what is happening in the tails. Also, it's increasingly common to have interactive graphics in which identifiers or other data pop-up on a mouse click.
Thank you @NickCox. Agree the four advice is quite restrictive. To give Kosslyn some more credit than my note makes here, he did say it was more about the shape of the curves than individual strokes. So you can cluster lines, like here, and the graph overall is still simple to understand. (Also that was his advice for presentations, which he thought you should only go with pretty insipid charts to keep things simple.) The blog link is working for me.
I posted a fairly detailed review of Kosslyn's graphics book at https://www.amazon.com/gp/customer-reviews/RVIIR7L4RMN25
Much outstandingly good advice in other answers, but here are some extra points from my own low-level advice to students. This is all just advice, naturally, to be thought about given the key questions: What is my graph intended to do? What makes sense with these data? Who are the readership? What I am expecting colour(s) to do within the graph? Does the graph work well, regardless of someone else's dogmas?
Furthermore, the importance of colour varies enormously from one graph to another. For a choropleth or patch map, in which the idea is indeed that different areas are coloured or at least shaded differently, the success of a graph is bound up with the success of its colour scheme. For other kinds of graphs, colours may be dispensable or even a nuisance.
Are your colours all needed? For example, if different variables or groups are clearly distinguished by text labels in different regions of a graph, then separate colours too would often be overkill. Beware fruit salad or technicolor dreamcoat effects. For a pie chart with text labelling on or by the slices, colour conveys no extra information, for example. (If your pie chart depends on a key or legend, you are likely to be trying the wrong kind of graph.)
Never rely on a contrast between red and green, as so many people struggle to distinguish these colours.
Rainbow sequences (ROYGBIV or red-orange-yellow-green-blue-indigo-violet) may appeal on physical grounds, but they don't work well in practice. For example, yellow is usually a weak colour while orange and green are usually stronger, so the impression is not even of a monotonic sequence.
Avoid any colour scheme which has the consequence of large patches of strong colour.
A sequence from dark red to dark blue works well when an ordered sequence is needed. If white is (as usual) the background colour anywhere, don't use it, but skip from pale red to pale blue. [added 1 March 2018] Perhaps too obvious to underline: red has connotations of negative and/or danger for many, which can be helpful, and blue can then mean positive. Too obvious to underline, but I do it any way: Red and blue do have political connotations in many countries.
Blue and orange go well together (a grateful nod to Hastie, Tibshirani and Friedman here: http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf) [added 1 March 2018] Many introductory books on visualization now recommend orange, blue and grey as a basic palette: orange and/or blue for what you care about and grey for backdrop.
Grayscale from pale gray to dark gray can work well and is a good idea when colour reproduction is out of the question. (It is a lousy printer that can't make a fair bash at grayscale.) (Grey if you like; preferences change across oceans, it seems; just as with colour and color.)
[added 5 Aug 2016] A fairly general principle is that often two colours work much better than many. If two groups are both of interest, then choose equally strong colours (e.g. red or orange and blue). If one group is of particular interest among several, make it blue or orange, and let the others be grey. Using seven colours for seven groups in principle carries the information, but it's hard even to focus on one colour at a time when there is competition from several others. Small multiples can be better for several groups than a multicolour plot.
Point 2 is VERY important. One of my statistics teachers was color blind and happily used "light-yellow" and "lightgreen yellow" in a chart. The color was virtually identical to us, but for him they were easily distingushable.
Thanks for the tips, especially #2. I looked back at my plots and realized that red and green are the first two colors matplotlib always chooses. That's not so great.
There's actually been a good deal of research on this in recent years.
A big point is "semantic resonance." This basically means "colors that correspond to what they represent," e.g. a time series for money should be colored green, at least for an audience in the USA. This apparently improves comprehension. One very interesting paper on the subject is by Lin, et al (2013): http://vis.stanford.edu/papers/semantically-resonant-colors
There's also the very nice iWantHue color generator, at http://tools.medialab.sciences-po.fr/iwanthue/, with lots of info in the other tabs.
Lin, Sharon, Julie Fortuna, Chinmay Kulkarni, Maureen Stone, and Jeffrey Heer. (2013). Selecting Semantically-Resonant Colors for Data Visualization. Computer Graphics Forum (Proc. EuroVis), 2013
+1 ... however, some things - such as your example of *money* - aren't universal. Money may be green(-ish) in the US. It's not green everywhere and the association with color can vary from country to country (e.g. someone in Germany might be more likely to associate blue with money, though nowadays it tends to come in a wide variety of colours).
Paul Tol provides a colour scheme optimised for colour differences (i.e., categorical or qualitative data) and colour-blind vision on his website, and in detail in a "technote" (PDF file) linked to there. He states:
To make graphics with your scientific results as clear as possible, it is handy to have a palette of colours that are:
- distinct for all people, including colour-blind readers;
- distinct from black and white;
- distinct on screen and paper; and
- still match well together.
I took the colour scheme from his "Palette 1" of the 9 most distinct colours, and placed it in my
axes.color_cycle : 332288, 88CCEE, 44AA99, 117733, 999933, DDCC77, CC6677, 882255, AA4499
Then, borrowing from Joe Kington's answer the default lines as plotted by:
import matplotlib.pyplot as plt import matplotlib as mpl import numpy as np x = np.linspace(0, 20, 100) fig, axes = plt.subplots(nrows=2) for i in range(1,10): axes.plot(x, i * (x - 10)**2) for i in range(1,10): axes.plot(x, i * np.cos(x)) plt.show()
For diverging colour maps (e.g., to represent scalar values), the best reference I have seen is the paper by Kenneth Moreland available here "Diverging Color Maps for Scientific Visualization". He developed the cool-warm scheme to replace the rainbow scheme, and "presents an algorithm that allows users to easily generate their own customized color maps".
Another useful source for information on the use of colour in scientific visualisations comes from Robert Simmon, the man who created the "Blue Marble" image for NASA. See his series of posts at the Earth Observatory web site.
+1 to the only (!) answer out of nine that actually *shows colours* in response to the question about "best colours".
On colorbrewer2.org you can find qualitative, sequential and diverging colour schemes. Qualitative maximizes the difference between successive colours, and that's what I am using in gnuplot. The beauty of the site is that you can easily copy the hexadecimal codes of the colours so they are a breeze to import. As an example, I'm using the following 8-colour set:
#e41a1c #377eb8 #4daf4a #984ea3 #ff7f00 #ffff33 #a65628 #f781bf
It is rather pleasant and produces clear results.
As a side note, sequential is used when you need a smooth gradient and diverging when you need to highlight differences from a central value (e.g. mountain elevation and sea depth). You can read more about these color schemes here.
There are plenty of websites dedicated to choosing color palettes. I don't know that there is a particular set of colors that is objectively the best, you will have to choose based on your audience and the tone of your work.
Check out http://www.colourlovers.com/palettes or http://design-seeds.com/index.php/search to get started. Some of them have colors that are two close to show different groups, but others will give you complementary colors across a wider range.
You can also check out the non-default predefined colorsets in Matplotlib.
I like the Dark2 palette from colorbrewer for scatter plots. We used this in the ggobi book, www.ggobi.org/book. But otherwise the color palettes are meant for geographic areas rather than data plots. Good color choice is still an issue for point-based plots.
The R packages
colorspaceallows selection of colors around the wheel: you can spend hours/days fine tuning.
dichromathelps check for colorblindness.
ggplot2generally has good defaults, although not necessarily color-blind proof.
The diverging red to blue scheme looks good on your computer but does not project well.
This is my favourite scheme. It has 20 (!!!!) distinct colours, all of which are easily distinguishable. It probably fails for colour blind people, though.
#e6194b #3cb44b #ffe119 #0082c8 #f58231 #911eb4 #46f0f0 #f032e6 #d2f53c #fabebe #008080 #e6beff #aa6e28 #fffac8 #800000 #aaffc3 #808000 #ffd8b1 #000080 #808080 #ffffff #000000
Another possibility would be to find a set of colors that are a) equidistant in LAB, b) take color blindness into consideration, and c) can fit into the gamut of the sRGB colorspace as well as the gamuts of the most common CMYK spaces.
I think the last requirement is a necessity for any method of picking colors- it doesn't do any good if the colors look good on the screen but are muddled when printed in a CMYK process. And since the OP specified "publication quality", I'm assuming that the graphs will indeed be printed in CMYK.
For colorblind viewers, CARTOColors has a qualitative colorblind-friendly scheme called
Safethat is based on Paul Tol's popular colour schemes. This palette consists of 12 easily distinguishable colours.
Another great qualitative colorblind friendly palette is the Okabe and Ito scheme proposed in their article “Color Universal Design (CUD): How to make figures and presentations that are friendly to colorblind people.”
### Example for R users if (!require("pacman")) install.packages("pacman") pacman::p_load(ggplot2, rcartocolor, patchwork) theme_set(theme_classic(base_size = 14) + theme(panel.background = element_rect(fill = "#ecf0f1"))) set.seed(123) df <- data.frame(x = rep(1:5, 8), value = sample(1:100, 40), variable = rep(paste0("category", 1:8), each = 5)) safe_pal <- carto_pal(12, "Safe") palette_OkabeIto_black <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#000000") # plot p1 <- ggplot(data = df, aes(x = x, y = value)) + geom_line(aes(colour = variable), size = 1) + scale_color_manual(values = palette_OkabeIto_black) p2 <- ggplot(data = df, aes(x = x, y = value)) + geom_col(aes(fill = variable)) + scale_fill_manual(values = safe_pal) p1 / p2