A couple of days ago, Alex Velez from the Storytelling with Data team published an interesting blog post on starting bar charts at zero. In “my bars don’t start at zero,” Alex showed a specific case where perhaps the rule we have all come to embrace that bar charts should start at zero and nothing else may not apply.
Cole Nussbaumer Knaflic sent me a quick note to get my thoughts, so this post is a slightly edited version of that note (plus links). I haven’t thought all of these issues all the way through, but perhaps these thoughts will be useful to you. I have also posted this response in the Storytelling with Data Community, if you want to head over there to engage in that conversation.
I suggest you read Alex’s post before you read my reactions below because I don’t talk in detail about each of the points in that post. But just in case you don’t want to go over there, here’s the intro paragraph that explains the data Alex was working with:
The client conducted a controlled study to better understand how baristas preferred alternative milk in milk-based coffee drinks. Historically, this company had been hesitant to include non-dairy alternatives on their drinks menu. In response to several customer requests, they are now open to adding a single non-dairy alternative but they want it to be as comparable to traditional whole milk as possible, and well appreciated by their baristas. The study compared almond, soy, oat, and pistachio milks against standard whole milk (the control). Baristas ranked their preferences along a 9-point hedonic scale from dislike extremely to like extremely.
Here are my thoughts on each graph option in Alex’s post:
1. In some ways, there’s a basic flaw in the analysis, but it’s really rooted in how we all talk about this issue. We all say, “bar charts should start at zero” when in fact we should be saying “don’t truncate your bar charts.” There is a subtle difference here and why I think the first option—starting the chart at one and not zero—is ultimately correct.
In this case, the value of zero is not in the “support” of these data, where support is defined as the domain containing the elements of the set. Here, zero is not an option in the data and thus it’s not meaningful. But, more importantly, it’s like shifting the entire data set up one value. The scale in the original survey could just as easily been 0-8 (as Alex notes in the post) instead of 1-9. This differs from other metrics like, say, the unemployment rate or GDP—neither are going to be zero but are conceptually possible. (As another possible save to the likely criticism here, I might just add a note to the chart for those dataviz nerds who are going to complain—explain why the axis starts at one and not zero. Not sure, but worth considering.)
It’s also worth noting the words used in the academic literature on this topic. The title of Jessica Witt’s paper is “Graph construction: an empirical investigation on setting the range of the y-axis” [she was also on my podcast last year] and Brenda Yang et al’s paper is “Truncating Bar Graphs Persistently Misleads Viewers.” Neither defines the problem as specifically “starting with zero” though that’s what we all say. So, it’s really a reframing of the problem that’s needed, I think.
2. The second option is—and is somewhat alluded to in the post—misleading. Removing the vertical axis doesn’t solve the problem and is instead a way to lie with the data. So, I think it’s much more than just “a bit of a cheat” as written in the post and more dangerous. (I know Alex didn’t mean it this way, but I think it’s important to recognize that this approach is a way for others to lie with their data, and is a way people have created truncated bar graphs.)
3. I kind of like Option #3, though I might add the data values to the axis labels, just so the reader can see. So, maybe something like “Extremely Dislike (1)”; “Neutral (5)”; and “Extremely Like (9)”.
4. Option #4 would be one of my primary alternatives to the bar chart, but it also raises another important question of whether adding some encoding for statistical significance/standard error is important here? This is likely not a problem in many surveys, but depending on the data and who is looking, is adding confidence bounds (or using color) important for highlighting the results that are statistically different from one another? If so, I like a dot plot to do that work for me—not only because Correll and Gleicher (2014) found error bars in bar charts to be problematic, but also because the dot plot with error bars looks cleaner.
I hope this little commentary on Alex’s post is useful to you. I appreciate that Alex brought this up and I, for one, am going to try to be more careful in my language from now on and instead of focusing on not starting bar charts at zero to not truncating bar charts.