The Sankey Diagram

In preparing last week’s One Chart at a Time video on Sankey diagrams, presented by Amber Thomas at The Pudding, I started thinking more about this particular chart type. I think the Sankey Diagram—named after Matthew Henry Phineas Riall Sankey—is one of the more underused data visualizations. That’s because how and when to use it can be confusing.

First, when to use it. The Sankey diagram is especially useful for comparing categories to one another and how they flow into other states or categories. In her video, Amber showed examples from The Pudding’s lovely interactive story, The Gyllenhaal Experiment in which twenty-two thousand users try to spell complex names of certain celebrities. The resulting Sankey diagrams, like this one for Zooey Deschanel, shows the paths of the spelling attempts.

Sankey Diagram showing how 22,000 people tried to spell Zoey Deschanel's last name
Source: The Pudding

One of my favorite Sankey diagrams—and which appears in my new book, Better Data Visualizations—is this one from Tim Bennett posted on Reddit. This Sankey diagram shows how fifty-two students tried to spell the word camouflage. The first blue segment shows that all fifty-two students started with the letter “C,” fifty then went to “Cam,” followed by thirty-seven to “Camof,” and so on. Only ten students spelled the word correctly, shown in the orange segment near the top of the graph.

Sankey Diagram from Reddit that shows how 52 students tried to spell the word camouflage.
Source: Tim Bennett

Unfortunately, such data—in which one category breaks down into another category and another and another—is often presented through pie charts. This visualization from the Bureau of Economic Analysis is probably the unfortunate outlier, but you get the idea—the slice of one pie chart expands to another and then to another. 

Three, 3D exploding pie charts
Source: BEA

Sankey Diagram Challenges

There are two primary challenges with creating Sankey diagrams. First, too many categories can make them difficult to read, as I think we can all agree is the case in this one from FiveThirtyEight.

Dense Sankey diagram from 538 showing voters in the Democratic primary.
Source: FiveThirtyEight

Second, they are also not suitable for every type of comparison we would like to make. In particular, the Sankey is probably not best suited to facilitate comparisons between different metrics.

To demonstrate, let’s look at this set of small multiple parallel coordinate plots from a 2014 Bloomberg News story. For each of seven cities, they show the racial composition of police departments relative to the populations they serve. I call these parallel coordinate plots instead of slope charts because slope charts show changes over time.

This visualization is especially great for teaching because the data are manageable (only twenty-eight data points) and there are a variety of visualization options. That doesn’t mean the Bloomberg graph is perfect—for example, I might add some more space between the panels and add the word “people” next to each of the labels for racial groups—but I think it’s a great example of how to use multiple parallel coordinate plots together.

Slope chart from Bloomberg showing relationship between a city's civilian population and policy force for 7 cities.
Source: Bloomberg

If we plotted these data (let’s just focus on Philadelphia here) using a two-group (also called a two-node) Sankey diagram, it appears that the values are somehow changing from the values on the left to the values on the right (the curves also reinforce this idea of change). In other words, that the share of the Black people in Philadelphia starts at 43 percent and then gets smaller and smaller until it reaches 33 percent. But the 44 percent share doesn’t convert to the 33 percent share, they are completely different metrics.

Two-node Sankey alternative to the Bloomberg slope chart.

A standard stacked area chart gives this same wrong impression, but without the curvature in the segments, which, at least in this case, are purely ornamental and do not impart any data or statistical meaning.

Area chart alternative to the Bloomberg slope chart.

I think the curves imply change or transition from one value to the next because the widths vary, which implies different values between the two ends of the graph. In the slope chart, by comparison, the lines are straight (and typically shorter), so we don’t instinctually assume the lines represent values that are changing over time.

It’s not that Sankey diagrams should only be used to show changes over time (see graphs above) but it is worth noting that, for example, the data input tool in the Flourish data visualization tool prompts users to show changes between two years. Sankeys are best used to compare data within the same metric.

Image from the Flourish tool to create Sankey Diagrams.
Source: Flourish

Alternative Graphs Possibilities

While I like Bloomberg’s parallel coordinates approach, we could also try the more standard paired bar chart or even a stacked bar chart. Both are completely reasonable alternatives, but if you think about including seven of them, the visual would get busier than it already is.

Paired bar chart alternative to the original Bloomberg slope chart.
Stacked bar chart alternative to the original Bloomberg slope chart.

To demonstrate yet more options, we could use a dot plot. But here the relationship for white city residents—where the share of the police force is greater than the share of the city population—is as clear as in some of the previous graphs.

Dot plot alternative to the original Bloomberg slope chart.

Wrap Up

I’m a fan of Sankey diagrams. They’re a useful way to show breakdowns between groups. But creators need to be careful not to plot too many series or the wrong kinds of data. They give a false impression when we are trying to compare two different metrics. Even if those metrics are the same measure, say percentages, the format of the Sankey imparts a suggestion of change rather than comparison.

There are alternatives to the Sankey, including bar charts, parallel coordinate plots, and dot plots, all of which can also visualize these kinds of data. The bars, lines, and dots are better at facilitating these comparisons without implying a transformation from one to the other. None of these views are right or wrong, but they may serve different audiences differently, highlight different patterns, and answer different questions.

The author wishes to thank RJ Andrews, Alice Feng, and Cole Nussbaumer Knaflic for their comments and suggestions.

UPDATE: I originally wrote this post in response to a recent blog post by Stephanie Evergreen [there was a follow-up by Ken Flerlage a couple of days ago], who created what she called a “Proportion Plot.” I originally read that post as an alternative to a two-node Sankey diagram and believe—as I’ve written above—that it’s not the way I would visualize this kind of data. I believe it implies changes over time and a conversion from one side to the other. I was trying to avoid being confrontational here by not referencing her original post and critiquing her directly, but in hindsight, that was misguided, as the chart I created is too similar to Stephanie’s. I apologize. You can read her original post here and download the Excel file template.