A conversation that often pops up in visualizing data is what to do about outliers. How should we handle data values that are so much larger (or smaller) than the rest of the values in our data? In this post, I’m going to share a technique I’ve seen used recently that I think is a good approach to emphasizing and making outliers clear for your reader.
When I’m asked this question in workshops and classes, my usual response is to try two graphs—a sort-of “zoom in, zoom out” approach. As an example, take this bar chart of the populations of 10 countries in Asia (data from the World Bank). There are 1.4 billion people in India, about six times as many people as Pakistan, the next country in the list. India is a clear outlier and in this simple example, it makes more detailed comparisons across the other nine countries more difficult.
It’s more difficult to make comparisons across the other values because the outlier extends the horizontal axis. Two graphs is one possible solution—the one above that shows all 10 countries and another that shows all but India. You could set them side-by-side, or subset one inside the other.
Other, standard approaches—but always worth considering—is adding text or different colors to outliers to make them visible to the reader. The top graph here from CNN does a nice job of adding a label to the July 4th spike in US wildfires and the graph below from Chris Ingraham uses a purple color for Minnesota to help make it stand out from the rest of the country.
I came across this bar chart of incarceration rates on the Prison Policy website. The incarceration rates for the United States as a whole and the state of Iowa are clear outliers relative to the other 11 countries included in the graph. Instead of using two graphs, they extended the graph outside the frame.
Now, usually, I prefer not to include this kind of unnecessary frame around my graphs—Excel, for example, includes a border around the graph, which I usually delete. But here, the frame becomes a useful aspect of the graph because it can be “broken” to extend the outlier values beyond.
Speaking of breaking, one approach I really want to recommend you avoid is the “breaking the bar” technique. With this technique, you add a symbol to the bar to denote that it is “broken” and actually extends further than is shown in the graph. Here’s an example of implementing this technique to the incarceration rates graph just shown.
This approach distorts the data and is an arbitrary decision as to where you cut the bars and how far out you extend the horizontal axis. Such decisions should really be avoided in data visualization.
Joshua Stevens, who leads the data visualization and cartography efforts at NASA , also used the “break-the-frame” approach in this tongue-in-cheek tweet:
And Auke Hoekstra, Program Director at Neon Research, a multidisciplinary research program focusing on climate issues and based in the Netherlands, also used the “break-the-frame” approach in his graph that shows how the International Energy Agency has consistently underestimated the growth in gigawatt (GW) production from solar panels. There’s no outline/frame in this graph, but the fact that the original has a gray background, and his drawn line extends outside that space has the same effect.
I’m adding this “break-the-frame” approach to my data visualization toolbox as an effective way to show (and emphasize) outliers in my data. In cases where I still want to make it easier for my reader to see variation in other values, I might use a second graph, but I think this technique is a great way to emphasize these large values. It’s certainly better than the “break-the-bar” approach, which distorts and misrepresents the data.
Did you like this post? You could have seen it a few weeks ago in my free bi-monthly newsletter! You can sign-up with just a single click on my Twitter profile. If you’d like more dataviz tips, tricks, and strategies, consider signing up for my Winno community–there’s a free version and a paid tier to get even more content!