In preparing last week’s One Chart at a Time video on Sankey diagrams, presented by Amber Thomas at The Pudding, I started thinking more about this particular chart type. I think the Sankey Diagram—named after Matthew Henry Phineas Riall Sankey—is one of the more underused data visualizations. That’s because how and when to use it can be confusing.
First, when to use it. The Sankey diagram is especially useful for comparing categories to one another and how they flow into other states or categories. In her video, Amber showed examples from The Pudding’s lovely interactive story, The Gyllenhaal Experiment in which twenty-two thousand users try to spell complex names of certain celebrities. The resulting Sankey diagrams, like this one for Zooey Deschanel, shows the paths of the spelling attempts.
One of my favorite Sankey diagrams—and which appears in my new book, Better Data Visualizations—is this one from Tim Bennett posted on Reddit. This Sankey diagram shows how fifty-two students tried to spell the word camouflage. The first blue segment shows that all fifty-two students started with the letter “C,” fifty then went to “Cam,” followed by thirty-seven to “Camof,” and so on. Only ten students spelled the word correctly, shown in the orange segment near the top of the graph.
Unfortunately, such data—in which one category breaks down into another category and another and another—is often presented through pie charts. This visualization from the Bureau of Economic Analysis is probably the unfortunate outlier, but you get the idea—the slice of one pie chart expands to another and then to another.
Sankey Diagram Challenges
There are two primary challenges with creating Sankey diagrams. First, too many categories can make them difficult to read, as I think we can all agree is the case in this one from FiveThirtyEight.
Second, they are also not suitable for every type of comparison we would like to make. In particular, the Sankey is probably not best suited to facilitate comparisons between different metrics.
To demonstrate, let’s look at this set of small multiple parallel coordinate plots from a 2014 Bloomberg News story. For each of seven cities, they show the racial composition of police departments relative to the populations they serve. I call these parallel coordinate plots instead of slope charts because slope charts show changes over time.
This visualization is especially great for teaching because the data are manageable (only twenty-eight data points) and there are a variety of visualization options. That doesn’t mean the Bloomberg graph is perfect—for example, I might add some more space between the panels and add the word “people” next to each of the labels for racial groups—but I think it’s a great example of how to use multiple parallel coordinate plots together.
If we plotted these data (let’s just focus on Philadelphia here) using a two-group (also called a two-node) Sankey diagram, it appears that the values are somehow changing from the values on the left to the values on the right (the curves also reinforce this idea of change). In other words, that the share of the Black people in Philadelphia starts at 43 percent and then gets smaller and smaller until it reaches 33 percent. But the 44 percent share doesn’t convert to the 33 percent share, they are completely different metrics.
A standard stacked area chart gives this same wrong impression, but without the curvature in the segments, which, at least in this case, are purely ornamental and do not impart any data or statistical meaning.
I think the curves imply change or transition from one value to the next because the widths vary, which implies different values between the two ends of the graph. In the slope chart, by comparison, the lines are straight (and typically shorter), so we don’t instinctually assume the lines represent values that are changing over time.
It’s not that Sankey diagrams should only be used to show changes over time (see graphs above) but it is worth noting that, for example, the data input tool in the Flourish data visualization tool prompts users to show changes between two years. Sankeys are best used to compare data within the same metric.
Alternative Graphs Possibilities
While I like Bloomberg’s parallel coordinates approach, we could also try the more standard paired bar chart or even a stacked bar chart. Both are completely reasonable alternatives, but if you think about including seven of them, the visual would get busier than it already is.
To demonstrate yet more options, we could use a dot plot. But here the relationship for white city residents—where the share of the police force is greater than the share of the city population—is as clear as in some of the previous graphs.
Wrap Up
I’m a fan of Sankey diagrams. They’re a useful way to show breakdowns between groups. But creators need to be careful not to plot too many series or the wrong kinds of data. They give a false impression when we are trying to compare two different metrics. Even if those metrics are the same measure, say percentages, the format of the Sankey imparts a suggestion of change rather than comparison.
There are alternatives to the Sankey, including bar charts, parallel coordinate plots, and dot plots, all of which can also visualize these kinds of data. The bars, lines, and dots are better at facilitating these comparisons without implying a transformation from one to the other. None of these views are right or wrong, but they may serve different audiences differently, highlight different patterns, and answer different questions.
The author wishes to thank RJ Andrews, Alice Feng, and Cole Nussbaumer Knaflic for their comments and suggestions.
UPDATE: I originally wrote this post in response to a recent blog post by Stephanie Evergreen [there was a follow-up by Ken Flerlage a couple of days ago], who created what she called a “Proportion Plot.” I originally read that post as an alternative to a two-node Sankey diagram and believe—as I’ve written above—that it’s not the way I would visualize this kind of data. I believe it implies changes over time and a conversion from one side to the other. I was trying to avoid being confrontational here by not referencing her original post and critiquing her directly, but in hindsight, that was misguided, as the chart I created is too similar to Stephanie’s. I apologize. You can read her original post here and download the Excel file template.
Wow that pie chart combo! Unfortunately from experience I’d expect most senior people to think it was the best chart here.
Jon, that doesn’t look like an apology, looks more like a passive aggressive dig.
The chart you created with Bloomberg data (and use as this post’s featured image) appears to be built directly from Stephanie’s template. Is that not the case?
Are you trying to say that you are using her template to critique her work, but not wanting to make a fuss so just not crediting her work? Seems pretty far fetched.