I recently published a paper on the share of people who receive benefits through the Social Security Disability Insurance (DI) program for reasons of mental disorders, such as mood disorders and schizophrenia. This group of DI beneficiaries now constitutes the largest group of recipients in the program, and they are an especially large group in the New England states.
In that paper, I showed the relationship between the DI recipiency rate—defined as the number of people who receive DI for reasons of mental disorders divided by the population age 18 to 65—and a variety of economic, demographic, health, and policy variables. For each variable, I visualize the relationship in a scatterplot and in each, I show the values for all states, highlight those in New England, the average across all states, and the “best fit” line. I also added some very simple annotations along the x- and y-axes to help the reader better and more quickly understand how the plots work.
I created these graphs in Excel (I also created some interactive versions using HighCharts, which you can see at the landing page for the original paper) and plotted everything with data. Especially because I was repeating this chart with different data series, I wanted to make it as easy as possible to update with new data and labels; therefore, I didn’t draw any lines or text boxes. The amount of manual work to update the chart with new data is relatively minor.
In this post, I’m going to walk you through the steps of how I made these graphs, and in doing so, I have three goals in mind:
- First, to remind you that annotation is really really important in data visualization and to encourage you to annotate your visualizations when you can. You can start by creating active titles–for example, instead of “Figure 1. The DI Recipiency Rate and Median Household Income” you can say something like “Figure 1. The DI Recipiency Rate and Median Household Income are Negatively Correlated.” Annotation can help educate the reader about both how to read the graph and the content you are trying to deliver. As Amanda Cox from the New York Times once famously said, “The annotation layer is the most important thing we do…otherwise it’s a case of here it is, you go figure it out.”
- Second, I encourage you to use data to add annotations and marks to your graphs (especially in a tool like Excel). The more you can do that, the easier it will be for you to update your data and move your graphs to other programs such as PowerPoint. When you draw a vertical line on your chart, it rarely goes exactly where you want it, it doesn’t move when you update the data, and it can be hard to move the line and the graph to PowerPoint. Adding annotations and marks with lines and text boxes make it harder to update your charts and to move your visualizations between tools, such to PowerPoint
- Finally, you don’t need a complex tool, method, programming language, or Adobe Illustrator to add annotation to your graphs. Everything I’m doing here is in Excel using chart elements, so even if you want to ignore my second goal and do everything with text boxes, you can still use something like Excel to do so.
Setting up the Data
Let’s start with the data. I have the median household incomes (column C) and the recipiency rate (column D) for All States in the US, the District of Columbia, and the US average. In column A, I’ve manually created a dummy variable (equal to 1) for the US Average and the 6 New England states.
In columns E and F, I create the x- and y-series for the Non-New England states. I don’t cut and paste here, but use this simple formula: =IF(A3=””,C3,NA()). This IF statement says that if the cell in column A is missing—which it is for all non-New England states—put the value in the cell; if it is not missing (i.e., equal to 1) put NA() in the cell. That NA() will come in handy when we make the graph because Excel will ignore the resulting “#N/A” when we plot it.
In columns G and H, I repeat the exercise, and create the series for the New England states. Here, the IF formula changes only slightly to =IF(A3=1,C3,NA()).
Over to the side, I have the values for the US Average split off from the rest. I use a simple VLOOKUP formula here [=VLOOKUP(J2,B3:D54,2,0)] in cells K2 and L2 to obtain the values for the US average.
I now plot these as four different scatterplots—columns C and D; columns E and F; columns G and H, and points K2 and L2. For each series, I do some slightly different formatting. The All States series is used to insert the “best fit” line using Excel’s “trendline” option (it’s in the Add Chart Element menu in Excel 2016). I insert the trendline and then use the “Forecast” options to extend the line all the way to the left and right edges of the plot space. I then select this All States series and turn the Markers off so you don’t see them on the graph.
For the Non-New England series in columns E and F, I change the marker color to a light gray. For the New England series in columns G and H, I change the marker color to a light blue. (You could, of course, do all of this as one series, but you would then need to select the 6 New England states one at a time and change the colors that way.) I then add data labels to the New England points and change the text color to the same blue.
For the US Average series, I select and change the marker color to black and add the data label. I then add a horizontal and vertical line to extend from the point across the graph using Excel’s error bars options. For both lines, I use the “Custom” option in the Error Amount menu to extend the lines across the plot space. For the horizontal line, I set the “Percentage” option to 100%. The same approach doesn’t work for the vertical line, so I simply set the “Fixed value” option to 4 (because the vertical axis tops out at 4).
To insert the text on the New England states, I could have created four different scatterplot series and named each, but instead I label the all data points in that series and manually retype the labels (Excel 2016 on the PC would allow me to choose custom labels). This ends up not being that big of a deal because the y-axis labels don’t change for this project.
Add the Annotation
I now add text in four places that denote which direction the x- and y-axes go. I really like this kind of annotation because it demonstrates a careful thinking about the reader who may not be familiar with this chart type and it explicitly helps guide the reader, like this one from the Washington Post.
I don’t use text boxes to add this annotation; I use data. I create four additional scatterplot series and label them in column J so it’s easier to adjust later. The labels along the horizontal axis (“RightXLabel” and “LeftXLabel”) have the same y-values (here, 0.15) and different x-values that I can adjust (cells K3 and K4) to get the arrow right where I want it. Analogously, the labels for the vertical axis have the same x values (31,000) and I can adjust the y-values (cells L5 and L6) up and down along that axis.
When I duplicate this worksheet and drop in new data, I can adjust these values to get the annotations to sit where I want. This method also has the advantage of having the labels line up exactly, which would be really hard to do with text boxes and, especially, when using new or different data for other charts.
I end up with 5 data series overall–three series that you actually see on the graph (“Non-New England”, “New England”, and “US Average”), one series that is inserted to create the trendline (“All Series”), and another for the annotation labels (“Scatters”).
So that’s it. A bunch of scatterplot series for the actual data and then again to add the annotations. Here’s the Excel file you can use to play around with.
I urge you to try to use this kind of approach in your own work. Use data to encode text and annotation. It will take you a bit longer up front, but it will make things easier to replicate and update down the road.
If you want to learn more about creating advanced data visualizations in Excel, please check out my step-by-step guide in Excel 2010. If you want to read the full paper about people with mental disorders on DI in New England states, please check it out at the Urban Institute.