Claus O. Wilke is a computational and evolutionary biologist and chair of the Department of Integrative Biology at University of Texas at Austin, where he is the Dwight W. and Blanche Faye Reeder Centennial Fellow in Systematic and Evolutionary Biology. Wilke obtained a Ph.D. in theoretical physics at the Ruhr University Bochum in 1999, and subsequently worked as a postdoctoral research fellow at the California Institute of Technology. He moved to UT Austin as an Assistant Professor in 2006, where he is now professor, department chair, and director of the Wilke Lab. Wilke studies the evolution of molecules and viruses using theoretical and computational methods. He is also the author of the Cowplot and ggridges R plotting packages.
In this week’s episode of the show, Claus and I talk about his new book, Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. We also talk about his favorite data visualization tools, standard vs. non-standard graph types, and domain-specific graphs (i.e., biology). Enjoy the show!
Support the Show
Jon Schwabish: Hi everyone. Welcome back to the PolicyViz podcast. I am your host, Jon Schwabish. On this week’s episode, I get to sit down and chat with Claus Wilke, whose new book Fundamentals of Data Visualization has just been published from O’Reilly. Uh, it’s a, it’s a great review of the Fundamentals of Data Vis, um, mostly built in R, um, it has some great images in it, really well put together. Um, I’m really enjoying, uh, making my way through it. Of course. I’ve been, uh, looking at the online version for a while. So one of the new things that, uh, I think we’re all seeing more and more is people making their books more open source and putting out review copies before they actually, uh, get onto the bookshelf. So it’s been fun seeing Claus, uh, developed this book over time. It’s also fun to chat with him about his background in biology and the data visualizations that he’s been working with and making and reading, uh, in that, uh, area. It’s always interesting for me to see how people approach data visualization and working with data and communicating data in different kinds of fields. So I hope you’ll enjoy this week’s episode. Here’s my interview with Claus. So class, welcome to the show. Thanks for coming on. I appreciate it.
Claus Wilke: Yeah. Thanks for having me.
JS: Congratulations on the new book. You must be relieved.
CW: I am. I am very relieved. It’s out.
JS: Taking, are you taking a long break now?
CW: Uh, no. I mean, it’s just so, so much work got like put on the back burner while I was writing the book that now I just have to do all that work.
JS: No rest, right? No rest.
JS: Do you want to talk a little bit about yourself and your background and, and give folks a sense of how you got interested in, in data visualization? Ultimately, how you came to, to write this book?
CW: Yeah, sure. So I’m a professor of integrative biology at the University of Texas at Austin. I consider myself as a computational biologist. I do a lot of data science, broadly speaking. I’m also quite involved lately in the art community. Originally I’m actually theoretical physicist. So I did my PhD in Theoretical Physics and then I transitioned over to biology. And so I have some amount of physics background. I have a good understanding of biology, and I also have a fair amount of computing and data science background, and I’ve really always been interested in visualization just like something I like. I like to make things look nice and to look at visualizations and so, so something I’ve, I’ve cared for a lot over the years and uh, if you talk to my students, they’ll, they’ll tell you how picky I am when they show the figures. I’m like, ah, the font is too small and here there’s two lines that are not quite in alignment. So somehow I also just pick up on subtle visual cues that other people might not be so sensitive to.
JS: Mm-mm. Are you teaching data visualization classes in addition to the biology classes?
CW: I’m not, I might at some point in the future, but I’m currently not. My main class that I’m teaching is data science for biologist class. It’s kind of, um, meant as the next class that they take after they’ve taken bio stats and it just moves them a bit more towards doing practical data science. And as part of that they learn how to make graphs and to interpret them. And then we also do some machine learning and some bioinformatics and things like that. But there isn’t a dedicated visualization class that I’m teaching.
JS: Uh, we’ll talk about the book but, but it’s uh, it’s just interesting to me to think about how you teach biologists these others skill sets. So do you think having a separate data vis class is necessary for those students or having it combined with these other data science skills the way you’re teaching it now is is the way like, is that the right way to do it or is having them separated out? You would think a better way to do it.
CW: So I think what I’m teaching now is an essential class that really almost everybody should take. Everybody should have some basic knowledge in data science, data wrangling. I teach them R, tidyverse and some python and we’d just like learn how to take a large data set and get patterns out and so on. And that I think everybody should be familiar with. And that’s actually taught to undergrads. And the idea is to get them learn this relatively quickly in the curriculum. I think a data vis class would be more an advanced class, maybe primarily for graduate students or really dedicated Undergrad. So I think it would be much more of a specialized class versus general data science. I feel certainly in the natural sciences, every student in the national sciences should have some basic data science skills.
JS: Hmm-hmm. So you mentioned that you’re teaching R and Python.
JS: And I think the book, you did the graphs in, in R as well, right?
CW: That’s correct. Yeah.
JS: And so is R your go to, like what are your favorite data vis tools?
CW: So yeah, for data visualization I exclusively use R. Actually, the old I get the more use I use R and the less I use Python. So it’s kind of an interesting transition that in the past I did a lot of Python and now I am in my own work almost exclusively R. They just have slightly different application areas. I feel Python is a good general purpose programming language. If you just want to write a game or you want to build an interactive webpage where people like enter inflammation and get stuff back or so Python is great for that. But for pure data science work, I just feel that R in many ways is more convenient.
JS: Mm-mm. Do you find that your students struggle with learning R, or is it, I mean they’re undergraduate, so are they coming to it? This is the first language that they’re learning, so it’s just the uphill climb. They’re not trying to think around other languages that they may have already learned.
CW: It’s very mixed. So actually the background in the class is incredibly mixed, like some people have done tons of Python and they have never touched R. Others actually know already some R. We’ve, our biostatistics class also uses R, so they have used it a little bit. I personally feel that in particular that the tidyverse we can do a lot of interesting stuff without really having to think about programming. Like we get halfway through the class before we ever do a loop or ever do an if statement because we’ve, we just do things like, I don’t know how familiar you are with the tidyverse, but like in particular deploy, you’re like a filter to pick rows and select to pick columns and then you group and you summarize. And so you can do a lot without ever really thinking about what I call the logistics of the data. Now if you have a for loop with an index variable and do like first number in your vector, the second number in the vector, then you think about the logistics, right?
CW: Because if you write a statement, give me all the numbers that are bigger than 10, then you only think about the logic. And so I found that the teaching, starting with a tidyverse, I can really hone in on the logic without getting too bogged down by the logistics and that actually almost works better if people have absolutely no background whatsoever. Like if they come already with some preconceived notion of what programming is and how you do data analysis, they find it very difficult to switch that off and to program without using loops, for example. But if they’ve never programmed before, they don’t miss the loops.
JS: Right. You’re getting a first shot at their experience with programming, right?
JS: Can you talk a little bit about the book itself? Now the print version is just coming out, so people get their hands on that. But you’ve had the, the digital version has been online for awhile. Collecting comments, I would guess you’ve collected a lot of comments. Um, but do you want to walk listeners through the, the goal of the book and, and what you hope they’ll get out of it?
CW: Yeah. So the, the book really had its origin and me giving the same type of advice over and over to my mostly graduate students in the lab. So I would just find that they would show me their figures and I had the same comments over and over. Maybe like one biggest one that was the first chapter that I wrote was like everybody makes the access labels too small, universal truth of data vis the access labels are too small and actually almost every visualization software that [indiscernible] [00:08:44] makes the labels too small. Yeah. So that’s just like, I, I feel like I’m repeating myself and repeating myself and repeating myself. And so then at some point, actually I thought about writing such a book for a long time and that didn’t really, I feel that I have the technology in place to make it sufficiently convenient that I was willing to do it. And uh, R has developed to the point, I mean, the entire book is written in R Markdown, right. All the figures are automatically generated. I can just press a button and the entire book gets rendered just as a [indiscernible] [00:09:18] up on the webpage. And that technology in, in that convenience has really been only around for couple of years. Like if I had tried to write this book 10 years ago, I would have written it in [laytec 00:09:30] and I would have had to keep track of even every figure individually and it just, that always seemed too much of a headache. I wasn’t willing to invest that amount of effort. So in some level I wrote the book now because I have tools to write it now.
CW: Yeah. And then when you commit to a book then you also have to write all the chapters that maybe it didn’t want to write but they needed to be there. Um, I mean the, the book is kind of three parts. The first part just goes through just all sorts of standard ways of visualizing data. How do you visualize amounts? How do you visualize associations between variables? How do you visualize proportions, things like that, just the standard things, bar plots, scatter plot, line plot and so on. And then the, the second part is about figure design. And that just goes through various things that one should think about. Like one big one is for example, color choices, right? How do you pick colors that work? And also that work for color blind people and access labels. Not making the figures too busy, but also not making them not busy enough. And then the last part, um, is kind of the, the various other topics that I’ve felt that should be in a book but that didn’t have clear coherent maybe heading. And that includes things like how do we combine figures into a larger document? Like how do we tell a story with a figure? I’d just touch on this very briefly, but then also things like how do you save the figure on your computer? What’s the right format. A lot of these things that we kind of expect everybody to know, you know, what’s the difference between a PDF and a PNG? When do you pick which one? And nobody ever really spells that out, right?
CW: And like you have people that read carefully through the specs of these file formats and understand for them when to use which. Most other people just don’t. And just like think mostly accidentally pick some. Sometimes it works out and often it doesn’t. It’s my experience.
CW: No, I’ve I’ve been seeing a lot of people, uh, publishing, you know, reports or documents for the, you know, the text is nice and crisp, and then all of a sudden there’s like a blurry graph in the middle and then the text picks up right after it and it’s, it’s jarring to see this, you know, nice crisp texts and all of a sudden this bar chart that is clearly pixelated, uh, because, you know, however they, however they export it from whatever tool they were using, they weren’t using the right image format. So, um, yeah, I think that’s a, that is a big issue for people. Yeah. and like JPEG artefacts.
JS: Yeah. Yeah.
CW: All these little artefacts because people don’t understand JPEGs. Or on the flip side, they understand that PDF is, is essentially a, um, resolution independent format and tends to give you the best results. And so they passed all the PDFs into a Microsoft word document and then the document becomes completely unresponsive. Right? I mean you could print it, but like online or on the screen trying to edit it doesn’t work. So yeah, there’s, there’s all these little tricks that everybody should know but it’s just too many of them and most people don’t. And so hopefully my book can fill some of those.
JS: Yeah, that’s great. I mean the other thing about the book that I think is one of the things that’s not really out there is you explore not just a standard line bar pie, you know, area charts. There’s, there’s other chart types out there and you spend some more time talking about those other graphs. So, uh, I’m curious about your take on the standard charts, which I mean however you define standard charts, but you know, if you think about like, you know, lines and bars and, and pies versus non-standard chart types, which you know, might just include slope charts and dot plots as a, as a starter. You know, they’re, they’re not the sort of core things but, but they are sometimes actually better ways to show data. And I’m curious how you balance the two when you’re thinking about teaching people data vis or when you’re thinking about what you see online in terms of people choosing different chart types.
CW: I think we, we should be adventurous, but we should also be critical, right? So if I can, if I can show some data set in, in a way that maybe is not so standard, but really brings out a key aspect of the data, then we should totally go for that. Uh, at the same time, you can also like if you go to this Zeno graphs webpage, I mean, some of the ideas that some people try out are really maybe didn’t quite work, but it’s okay. I mean, we tried it out and then either it works or it doesn’t. There’s a couple of things that I care a lot about. One is if you write a report with say five or six figures. I think, it’s actually really important that every figure looks different visually. Like if you have 20 pie charts, then after a while they all blend together. And the audience really, I mean, oh, another pie chart. You know, and you might be talking about a totally different topic now, but it’s just all pie charts and um, and I should maybe not talk about pie charts because people have strong opinions on that. So let’s talk about scatter plots, right.
CW: Everybody is on board. Scatter plots are a good idea when you have association data, right? But if I show you 10 scatter plots in a row after awhile you just, your mind shuts off and you can only like they all look the same, right? And so I think it’s actually really important to, to have a repertoire of different possibilities of showing data so that we can just keep it changing so that the audience sees, okay, the scatter plot was this part of the story and now we have, I don’t know, a density plot and that’s now we’re talking about something else, right? So you can really use, just like we can use color and we can use fonts and so on to, to make clear that we’re now talking about something else. We can also switch up the type of visualization that we use to structure document. I care a lot about that a lot because I, I’ve definitely, I’ve sat in like PhD committees where like every graph is aligned that goes down and you know, 20 minutes.
CW: I’m like I can’t keep this apart, you know, and they are all different things, but the graphs look all the same.
JS: Look all the same. Yeah. Well what about, I mean I think report construction is an interesting topic in itself. I mean there are lots of journals out there at least in social science where they require you to put the figures at the very end of the paper as an appendix, which always bothers me because I want them to be integrated with the rest of the, the report. I want it to be an argument. And the visuals are supposed to support the argument and by putting them at the back you sort of relegate them to this secondary status.
CW: Yeah. But do they print it like that or is it only when you submit and–?
JS: Uh, some of them put them in the back. Some of them move it, move it later on. Uh, I guess it’s just a, yeah, so it differs by journal.
CW: Yeah, I know people care about that a lot. I’m, I’m, I guess I’m okay if the figures are all at the end, I can find them there. What I care more about is that the caption needs to be with the figure, the worst form or this you have like a page with all the figure captions and then afterwards you just have all the figures separately and then you have no idea which caption goes with which.
JS: Oh, right, right. Like figure one here, you know, then, then the title or the caption, but then when you get down to the actual figure at the end of the paper, it doesn’t have that same text.
JS: Yeah. It’s certainly easier to lay things out that way, but I think it’s a real disservice to the, to the reader.
CW: Yeah. I mean, so this is a completely different topic, right?
CW: But I find scientific journals in their submission instructions, they mix guidelines for how, how the reviewer would want to look at it with how final production happen, right? They make you submit the figure separately as separate image file. They’re really thinking about the final production of the paper rather than the initial review stage. And I think we should just be allowed to just submit a PDF with all the figures embedded, however we want, wherever they, they fit correctly into the document flow. And once everybody agrees that this is an appropriate publication and we want to publish it, then we can worry about production. I wish this was more separated, but many certainly scientific journals kind of model the two together and that’s where problems come.
JS: Yeah. No, I’m guessing you’re like me when you’re reviewing a paper for a journal, you are commenting on the, on the visuals as well.
CW: Oh yeah.
JS: I mean I was arguing with someone about this a couple of days ago. Now we’re totally off topic of, of your book.
CW: That’s fine. Yeah.
JS: But um, this, this is, this is really interesting. So is there a responsibility from the, from the journals to encourage their reviewers to review the, the visuals as well?
CW: So responsibility is, is a strong word. So I feel like ultimately the responsibility is always with the author is my own opinion. Like review is the purpose of review is to improve the product. It’s not necessarily to verify that the product is correct. Because certainly in biology, well nobody’s going to spend $1 million and three years of trying to actually validate that everything is correct. Right?
JS: Right, right.
CW: We have to on some level accept it as it is, but then we can look at it, we kind of do a basic sanity check. If something sounds totally outrageous, then we point that out. And most of the time though, certainly when I review, I tried to help the authors improve the paper and if, if they have visualizations that clearly are not going to work for the audience and it’s actually in the author’s interest is that point that out. Um, in the end, I, I strongly believe in people have the right to embarrass themselves. Right? So if I say this is really a bad visualization, and they say, no, we like it, we want to habit that way. Okay, well it’s your choice. But in at least I said it.
JS: Right. Right. You feel like you’ve done your, your job as a reviewer.
CW: Yeah, exactly. Unless it’s clearly wrong. Right? I mean–
CW: Things are just objectively wrong. Well, objectively wrong.
CW: And you point it out, but, but if it’s more, well, you really should consider using larger labels in all your figures because nobody can read this. If the author insists that they want to have figures nobody can read, I mean, in the end, it’s their choice.
JS: Yeah. Have you ever written your review back to the editor and saying, look, you know, my comments of the authors were, were such and such, but really the graphs are so bad that you, that this would really need to be resubmitted as an entirely new thing, or the authors really need to rethink the way that they’re presenting this because of the, the graphs are just so horrendous?
CW: I, I have done that. Yeah.
CW: Yeah. But then again, in the end, if the authors insist, I would say, “Okay, it’s your choice.”
JS: It is amazing to me that, well maybe it’s not amazing to me still, but I want to say it’s amazing to me. There’s not a lot of thought given to the reader, even of academic journals. So even if you’re thinking about someone, you know, researchers thinking about communicating to other researchers, they still don’t think about the audience and the text is very dense and the graphs are really hard to read. And then you see these in publication and well, I feel like there should be someone at some point who should be, you know, there’s the editor, there’s the peer reviewers, there’s um, the, you know, the desk editor, then the editor that might be actually laying things out that someone’s got to say, “Look, this is really hard to understand.”
CW: Yeah, I mean, I, I think it’s the reviewers really that should do that. And I actually also, when I work as editor, most of the reviews that I get, uh, along those lines, well this really, I couldn’t quite understand this, the authors, I encourage the authors to improve. I personally feel it’s a waste of opportunity, right? If you write a dense paper that nobody can understand, then in the end nobody’s going to hear your message. Right? And so when I tell an author to maybe reconsider how they’re presenting their work, I’m really trying to help them to get the message across better.
CW: But some authors listen and others don’t.
JS: Now I’m curious about your experience in the biology literature because I, you know, I spend obviously in my time in the social science literature and I dabbled in and just perusing some biology literatures and the graphs for like, I mean, I’m sure for biologists, they’re second nature, but for me, you know, there’s, you know, dendrograms and, and you know, things showing gene breakdowns and you know, they, they look completely foreign to me. So I’m curious about your experience in the biology literature and, and the type of graphs that folks use to present their research.
CW: Yeah, so graphs in biology can be wild. So the, the one thing that I think biology does well, but it might still be unusual when you come into it is biologists are very good at drawing, drawing diagrams or schematics. So you have a gene and you have a regulator and you have a promoter and so on or like a pathway, those kinds of diagrams that, uh, they just use the same visual language over and over and so every biologist looks at it and say, oh, this is the gene and there’s an enzyme and oh there’s a connection here. And if you have never seen those, it would just be boxes and arrows and you wouldn’t know what is going on. So that actually, I, I think biologists do very well and other fields maybe could do more of that also. Then this, this other part, and that’s mostly in computational biology and like high throughput systems, biology and so on, they use incredibly dense and complicated visualizations where honestly I, I feel like the typical Nature paper these days, you open it up and there is like beautiful colors and it could work as modern art on your wall. And I am convinced no reader actually understands what’s going on. And, and the problem is not only are the visualizations incredibly complex, they also tend to be of incredibly derived quantities. You know, you, you do some complicated measurement of millions of values and then you calculate some sort of summary statistic of subsets of data eta and then you take those and you pull them again and average and then you integrate or whatever. And in the end you still have a million numbers, but they were like central pipeline of 10 computing steps and really nobody can understand what it is. So I’m, I’m very critical of that because I feel there’s a lot of like it looks cool and people are kind of think they should like it because clearly it was a lot of work. But the insight, it’s not clear that they actually convey that much inside.
JS: Yeah.The beauty and the complexity part might be valuable in a different context. But in the, in the journal article world, you want to make that argument.
CW: In the end, there should be an insight, right?
CW: After, after we have spent, I don’t know, $500,000 on experiments and grouted student time working for three years and making millions of measurements, it would be good if there was a clear insight at the end and not just, oh, here is stuff and–
JS: So now we can, we can come full circle and come back to the book then.
JS: So that’s the primary goal of the book is to help people use data vis to provide insights to their readers or their users.
CW: Uh, if, if that worked out, that would be great. Yeah, we have–
CW: We’ve, I mean, I, I kind of touched on all of these things. Some of them may be are only a short section. So one thing I really care about in, in writing reports say is I feel you always should go from data that is the closest to raw data. And then as you go along, you kind of, you can work with more and more process data until you have some very derived quantity at the end, right? So like you measure some quantitative variables and you start with a scatter plot and then you can turn that into maybe even do a regression, you have a correlation. And then if you have a lot of correlations and you can visualize them as a heat map and then maybe you can summarize heat maps into a pie chart where like some grouping is this way, some grouping is that way. So that would be a sequence of your start out at something that’s very close to some number that you can imagine that was measured. At the end you have some highly derived quantities.
CW: And it’s really important to have this sequence. If you, if you go back once, you immediately lose everybody. And if you just start at the end and never show the more, so the less derived parts of the analysis then also everybody’s [indiscernible] [00:26:22].
CW: That’s somewhere in the book, it’s only a few paragraphs, but it’s in there.
JS: Right?Well I’m sure people will check it out. So there’s the, the online version and then there’s the print version that’s just, that’s just coming out. So, um, so good luck with it. Uh, I’m sure you’re at least relieved that it’s done and, uh, and congrats again on getting it out there.
CW: Thanks for having me.
JS: Yeah, thanks for coming on the show. This is a lot of fun. We, we, we veered off a little bit, but this is a lot of fun. All right. Thanks Claus. I appreciate it.
Thanks everyone for tuning into this week’s episode. I hope you enjoyed it and I hope you will check out Claus’s book The Fundamentals of Data Visualization. Also, if you’re interested in supporting this show, please consider leaving a review on your favorite podcast provider or put a couple bucks a month towards the show on the Patreon account, uh, where you can, uh, help support the show, help me, uh, cover costs of editing and transcription services and all the things that, uh, I need to bring the show you’re aware. So I hope you enjoyed this week’s episode. Until next time, this has been the PolicyViz podcast. Thanks so much for listening.