Max Kuhn is a software engineer at RStudio. He is currently working on improving R’s modeling capabilities and maintains about 30 packages, including caret. He was a Senior Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He was applying models in the pharmaceutical and diagnostic industries for over 18 years. Max has a Ph.D. in Biostatistics. He, and Kjell Johnson, wrote the book Applied Predictive Modeling, which won the Ziegel award from the American Statistical Association, which recognizes the best book reviewed in Technometrics in 2015. Their second book, Feature Engineering and Selection, was published in 2019 and his book with Julia Silge, Tidy Models with R, was published in 2022.
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Garrett Grolemund and Hadley Wickham
New Ways to Support the Show!
With more than 200 guests and eight seasons of episodes, the PolicyViz Podcast is one of the longest-running data visualization podcasts around. You can support the show by downloading and listening, following the work of my guests, and sharing the show with your networks. I’m grateful to everyone who listens and supports the show, and now I’m offering new exciting ways for you to support the show financially. You can check out the special paid version of my newsletter, receive text messages with special data visualization tips, or go to the simplified Patreon platform. Whichever you choose, you’ll be sure to get great content to your inbox or phone every week!
Welcome back to the PolicyViz podcast. I am your host, Jon Schwabish. On this week’s episode of the show, I welcome Max Kuhn. Max is the author of the new book, Tidy Models with Julia Silge. Now, if you’re a listener of the show, you may have listened to that episode with Julia just a few weeks ago. Julia really didn’t want to talk about her new book, which is amazing, she wanted to just move on to the next thing that she’s working on at RStudio, which is great. But I want to talk more about Tidy Models, so I reached out to Max to see if he would like to talk about the book. So we do talk about the book, and we talk about Tidy Models, and we talk about a lot of the other fascinating things that he’s working on at RStudio. We also talked about his background in the pharmaceutical area, and how he moved into R, and then into RStudio. There’s a lot to learn here. There’s a lot of great links in the show notes that I’ve included if you want to explore some of these different packages, and explore Max’s GitHub page, where he’s got a lot of information, a lot of code, a lot of resources for you. This is a really interesting episode, a really interesting discussion, and if you are an R programmer, this is the episode for you.
Now, before I turn you over to that interview, I’m excited to let you know that I have a sponsor for the podcast. That’s right, PartnerHero is sponsoring this and several more podcasts coming up, so let me tell you a little bit about them, before I let you go on to the next thing, because I think if you are a freelancer, if you’re a small business owner, if you’re working in the DataViz field, PartnerHero might have the right solution for you. PartnerHero is a customer service outsourcing firm, they have flexible terms, they’ll help you scale quickly, they have quality assurance programs baked right into the tool, and they’ve offices all around the world, which, of course, is really important. Now, I’ve used lots of different outsourcing tools and platforms to do some of my work, maybe I need someone to help me scrape something off the web, maybe I need someone to help me write something or clean something up. And they’re all fine, they have some pros and cons, some of the user interfaces aren’t great, this and that. The thing that I like most about PartnerHero is that they are really emphasizing values, and they are values aligned and trying to change the outsourcing industry, because there’s so much work out there where people are being exploited, people are being taken advantage of, and PartnerHero is really focusing on that. So if you are a small business owner, or if you’re a freelancer, generally, if you’re just ready to bring in outside customer support help for your startup, that feels like it’s part of your existing team, I recommend you check out PartnerHero. So head on over to partnerhero.com/policyviz to book a free consultation with their solutions team, mention you heard about PartnerHero from PolicyViz, and they’ll waive the setup fee. So that’s partnerhero.com/policyviz to check out PartnerHero’s outsourcing firm.
So with no further ado, let’s check out the interview with Max Kuhn from RStudio, I hope you’ll enjoy the conversation.
Jon Schwabish: Hey Max, good afternoon, how are you?
Max Kuhn: I’m well. How are you doing?
JS: I’m good, you’ve got a friend behind you.
MK: Yeah, that is my new puppy, Kaladan. He’s six months old. I’ve had him for like two weeks, and he’s like the Platonic idea of a good boy.
JS: He’s literally Max’s Best Friend.
MK: Yeah, so far, yeah.
JS: He’s just chewing on a, what is it, one of those rubber things with a peanut butter in it?
JS: Yep. Good. Enjoying it. Well, thanks for coming on the show. You have this new book out that I’m excited about, which I am just checking out now, and there’s a lot in there, so I want to get to that. But I wanted to start with your background, because a lot of folks who come on the show working in data and data visualization, it’s not like a direct line to where they are now, and you have an interesting background. So I was curious, you can just talk about yourself, so folks know a little bit more about you, where you started, and how you ended up at RStudio.
MK: Yeah, so I’m a biostatistician, so I have a PhD in that. I worked for about six years in Baltimore for a company doing molecular diagnostics. There was a lot of like traditional, nonclinical statistical work, like, designing and analyzing experiments, and then, doing some algorithm work for the instruments. And then, I went into drug discovery for about 12 years, again, doing the early research, sciency stuff, and did a lot of modeling and experimentation, things like that, I really enjoyed it. So yeah, when I left there, it was like a little bit of push, a little bit of pull. Honestly, I was not thrilled with working for a huge corporation, with maybe goals or maybe not in alignment with mine. And the scientists I worked with and the work I did, I loved, and I really, really enjoy all that. But it was kind of like, I think I’d reached my period of, like, yeah, I need a new job. And so, there was a little bit like, yeah, I should look for something else, but a big part of it was when J. J. Allaire and Hadley Wickham were like, “Hey, we’re going to be doing some more stuff with R, would you like to work on modeling software?” And I was like, yeah. So I, in the first week or two of working in discovery, I had had some ideas on R package, this is like 2005, and there weren’t many – R was a very heterogeneous environment, it still is for modeling. And so, I wrote this package called caret, which was mostly used internally for a year or two for computational biology and chemistry, but it was like something that would unify a lot of the disparate interfaces and things like that in R, and also had a lot of functions that didn’t really exist. Like, if you wanted to calculate sensitivity and specificity, just smart support function.
MK: And with that, and I was working on a book on modeling, and that was going pretty well, it was like a little pet project that kind of blew up on me. And I was like, holy smokes, now people are using it, and then, the company was like, yeah, no, we allocated some time for you to do that, but, you know, you’re kind of on your own now for doing that. So with caret, I love it, I’m happy so many people have used it, and I’ve gotten a lot out of it. But it maybe wasn’t designed as time went by very well. It was like something I was doing in my spare time. So it didn’t really have much potential for – I mean, it’s got a lot of stuff in it, but not – it wasn’t very extensible. And so, when they say like, hey, what would you do if you could spend all day writing modeling software. And I was like, well, I’d kind of start over, there’s a lot we learned about interfaces to models since then, especially with Hadley and all this stuff he’s done with the Tidyverse and ggplot, things like that. So yeah, so there was a very enticing offer to be paid full time to write data analysis software and start over, and work with a lot of really good people who know a lot of really good stuff, and let things evolve, yeah.
JS: So did you go to RStudio in 2005?
MK: I think it was 2016 – no, 2005 is when I started in drug discovery.
JS: Okay. So you’ve been at RStudio now for about seven-eight years, something like that?
MK: It was like the end of 2016.
JS: ‘16, okay, yeah. I mean, that’s really interesting. So where does Tidy Models come from in your origin story as it were?
MK: I thought about different components of what I would want to do next. I thought, like, it happened really weird that I happened to be in New York the same day that they were in New York. And they knew I was in Connecticut, so like, hey, could you come to New York, and I’m like, what’s New York. And so, they’re like we actually sat down on like a whiteboard and outlined some stuff, and I had originally had been thinking about what became the recipes package, which I’m like, really particularly proud of, recipes is sort of like a combination of dplyr, and R’s one of the method. And what that means, if you’re not familiar with those is, you could very quickly have a very expressive sequential way of pre-processing your data prior to modeling or doing feature engineering or feature extraction and allow you to do things that you couldn’t necessarily do with R’s traditional modeling tools. And that’s very, very much influenced by the [inaudible 00:08:34] functions in dplyr, things like that. So we kind of thought about that, and that was sort of like the first little bit of it. I’d read R for Data Science, and some things that Hadley was doing in there that didn’t really persist beyond that. But the way that he stored like resampling information, stuff like that, eventually became R sample, sort of, like a beefed up version of that. And so, these little pieces sort of came together, you know, what do we do about having a better interface to models; and there were things I definitely learned with care, like, yeah, I’m not going to do that again. It became a little bit more complicated way of doing things behind the scenes. But in the end, I think it’s a lot simpler for – well, I think, at least, it’s a lot simpler for people to use. So it was just sort of like, yeah, you know, and I remember the first time, like, right when I started was like the first RStudio conference, and somebody asked Hadley on stage what’s Max going to be working on. And he said, modeling. And they’re like, oh, what kind. He’s like, all of it. And I’m really sitting [inaudible 00:09:35] just mad because we kind of start at the same place, we’re sitting against each other, and I was like, shit. Pardon me. Clean this podcast. But it’s very open ended, and thankfully, it still is. And they trusted me enough to say, like, all right, well, let’s, you know, give him a head when he needs it. Like, we usually talk a lot about interfaces and naming things, I’m like, what’s a good way of doing things.
JS: Yeah. I want to ask, what drew you to R in the first place as opposed to, I don’t know any of the biostat packages, I’m sure there’s a ton of them, but what drew you to R?
MK: Well, I’m going to date myself, I was in graduate school in the 90s. So basically, your two choices there for statistical analysis was SaaS and S-plus. There was no R, and SaaS was what people were being taught. It became very clear to me that you’re very limited in what you can do. And the reason I went into non-clinical statistics was the nature of that kind of job is you have like hundreds of customers doing many different things that don’t have any predefined analysis. So somebody comes to you with some new laboratory tests that they’re working on, that produces some really funky type of output, and it’s your job to translate that into some numbers. And I always gravitated to problems like that, so I felt having something very, like, it’s going to sound bad, I don’t mean it to be bad, but like a superficial sort of programming language like SaaS, it’s not very expressive. And so, then graduate school, I saw S-plus, I was like, all right, this is nice, right. And so, eventually, S-plus sort of petered out when R came online. And so, you know, R, for me – I mean, it sounds like a silly thing to say now, because we’re so used to having actual programming languages to work with, but in the mid-90s, that was basic. And so, you are having something where you had scoping and functions and data types, and that was like a – it was like, yeah, I can do anything I want to here. So, I mean, I think in the language or some data science, I think most of it’s based on where you started, and it’s like, I’ve never used Python or anything like that. But R, I really do believe that R really [inaudible 00:11:50] but like, the S language in R is really based for people whose fundamental thinking process is about interactive data analysis. Right? So it’s built literally from the ground up with that in mind, which is not a knock against any other programming language. But if you’re asking me, like, what I feel makes it click with me. It’s like, oh, there are some things that could look really kludgy to outside people, but they’re really nice, in terms of the context of what you’re doing. So contextually, you can call like a DSL, domain specific language, something like that, but I like it for what it does. I wouldn’t use it to do my taxes…
JS: No, right.
MK: But for data analysis, it’s fantastic.
JS: Yeah, for the tool, yeah. And I guess, I don’t know enough about the history of R, to be perfectly honest, but the movement from Base-R into using a gooey like RStudio, I don’t even know what my question is here, I guess. But maybe the question is, how did that change your workflow or the way, or maybe it didn’t, or how you think about using the tool when it has more of this space that’s a little bit, well, I don’t know this is true, right, just more user friendly.
MK: So I think there’s like two aspects of that that are worth talking about, and first is syntax, and people are still arguing about this. Being like an S person, back in the day, when I first started using it, I remember sitting in my office, and this was in Richmond, Virginia, in the basement of the medical building and saying, like, where’s the damn inverse function, I just need the [inaudible 00:13:27] matrix. Right?
MK: It’s a solve function without adding the extra argument. And I’m like, what the hell or like, the sword function, I want a sword data frame, and no, you can’t really do that. You have to use subscripting with the order function. And so, it’s really efficient and does some good things, and there’s a lot of good things.
MK: But it’s not really written in terms of like, I can figure that out, I have a PhD in statistics and done all this stuff. I’ve worked with like a lot of [inaudible 00:13:55] scientists, and I’ve worked with a lot of people who don’t have any training in data analysis or statistics, computer science, and I think that things like the Tidyverse are born out of that, like, there’s some just low hanging fruit about let’s name things better, let’s give them arguments that, you know, and then you consider like the pipe, the magrittr pipe or now the Base-R pipe, and start designing code for that. People are still arguing about this, but I feel like for the average user, it’s far and way better to be using, assuming you’re inside the scope of what, let’s say, the Tidyverse or associated things do, it’s a much nicer place to live. So on the syntax side, I think that’s a good answer. I think the Tidyverse gets a lot of press, but I think the thing for me as a developer – and this translates to the average user too is the toolsets are so good, they’re so good, like, the RStudio editor, like the IDE, the tooling [inaudible 00:14:54] everything is – it just can’t be better. And so, the thing I’ve learned over the years is it’s not that you need really, really good tools to be great at something, but man, is it make your life so much better, like, I can’t imagine to impact management without the tools that the people inside of RStudio build. Think about users like in terms of importing data, the tools are largely, I think a lot of the tools I’m talking about are really born out of RStudio. Here’s a good example is Jenny Bryan has spent a lot of time working on spreadsheets. She’s this brilliant person, and I think she recognizes, yeah, all that fancy statistics is great, but if people can’t get their data in, like, why are you doing.
JS: Yeah, right.
MK: I remember talking to her about spreadsheets, and she now had this sort of shared miserable experience of having people. So in biology, most genes can – there’s different ways to reference genes. But for humans, there’s this thing called the HUGO ID, which is usually a couple of letters, like, Interleukin-18 is IL-18. And I had this experience where people, like, external collaborators would give data in an Excel file. And then, we tried to read it in, you know, we would eventually convert to CSV for the systems we had. And then, eventually it goes back into Excel or have these back and forth. And with HUGO ID, like, SEP 12, and think, oh that’s September 12.
JS: Oh, yeah, sure.
MK: And so, it converts that cell to a date, and then, when you save it as a CSV file, it converts that as an integer from some reference date. And so, you’re looking at this spreadsheet that’s got maybe 20,000 things and you’re like, why is this long number there, and [inaudible 00:16:32] figuring this out, you’re like, holy shit, that was a…
JS: That was a date, right.
MK: And [inaudible 00:16:38] don’t do that. I mean, it sounds silly, but that’s what I mean is like, you end up fighting the process so much if you don’t have good tooling. In R Markdown, Shiny, in [inaudible 00:16:50] it’s so fundamentally, like, I’m kind of jealous of people now, as they don’t have to, they don’t have to live with the pain that we had before, you know, anything but GitHub, I was using CVS and really old version control, and wow.
MK: But I wasn’t like oh man, but like…
JS: I know, but some of those, even some of those ones – the version control software that you’d have to purchase for just like, they were just not very good, and just hard to use, yeah.
MK: And then, the notion of needing it for data analysis. A scientist or statistician, now we’re meeting with some other statisticians, and we were talking to them about using version control, like, oh, we don’t need that. And the person I was with, his name was Jim Rogers, said – and he said it not a condescending way, but he basically said, you do, I just don’t know yet. Just haven’t lost something that you wish you had, and when you do, remember, I’m like three doors down, and I’ll talk to you, I’ll evangelize the gospel of version control.
JS: Right. Okay, so you have – so let’s pivot a bit. So you have a new book out with Julia Silge who was on the show a few months ago. So I sort of know that it’s like a review of the codes and text behind Tidy Models – it’s not really a review, it’s more of like a step by step. But it also is kind of a primer on regressions and modeling. And so, I’m curious, like, let’s start with – so it seems like there’s these two pieces. And so, did you set out to have these two pieces, like, not just going to give you a step by step in this particular coding language, which is kind of what like Hadley’s R for Data Science book does, but yours is more of like, I think you could sort of think of it as like an Intro Stats textbook with doing it in R, like, was that the idea?
MK: Really, we thought it out like that. I think the problem we have, assuming it’s a problem, is that when we want to teach any of this, whether it’s in a workshop, or in a book or whatever, it’s very hard to be like, and here’s how you do resampling. And people are like, what’s resampling. Right? You have to front load all this information about what’s a training set, what’s a test set. And we’re usually writing for people who are not experts at this. Again, we’re not writing for ourselves, we’re writing for people who work at a bank, and their boss is like, hey, I read this thing, go do a linear regression – like, what’s linear regression? And so, people just don’t know. And so, a lot of times, we want to talk about the syntax of what we’re doing, even if it’s something that’s not particularly fancy, we do have to sort of talk about the nomenclature, and the nomenclature leads a little bit into like, well, why would I do that, like, why would I save some data as a test set. And so, it’s really – I don’t think we’d ever be successful in teaching anything if we were like, and here’s how you do a random forest without, I mean, you want a random forest – to some extent, tell me what a random forecast is. So I don’t think there’s any other way to do it, I mean, unless you’re looking at like a pure statistics or machine learning type of book where there’s no syntax or application. I think you kind of have to do them hand in hand, or otherwise it doesn’t really benefit anybody. And the problem with me sometimes is like just stopping myself from going on, and, you know, what’s the minimum amount that makes sense. And there’s plenty of things I would like for them to know, but hey, this is a book on modeling software. So let’s not worry about stuff like, oh, there’s things.
JS: Well, also the thing that’s interesting is I think about like your book, compared with like Hadley’s book – Hadley’s book starts with data visualization. Right? And I don’t know why he did that. My guess is that because with data visualization, when you start with ggplot, you get something – you can see your success right away. You don’t have to, you know, cleaning is kind of boring. It’s not like – and you get the success right away. And so, I mean, how did you and Julia, sort of, think about – did you think about that you’re like, just someone who’s going to read this book, is someone who’s going to be interested in modeling and learning this, and we just sort of think about that much is that much of a setup?
MK: Yeah, I don’t know, I feel like if they were reading that book, they have a reason, right? They’re like, all right, I’ve heard of this, or maybe I haven’t heard of it, but I haven’t used it, but, like, I know of its existence, what does it do. We don’t really – a lot of times in training materials in some books, we do this, like, what’s called the whole game, where we give them a little introductory chapter that’s like, not very in-depth, it gives them a roadmap of like, hey, here’s an analysis and what’s happened. And then, we’ll go through this in detail later. We didn’t really do that with this book. We kind of approached it from the standpoint of the first parts of being like about our philosophy, like, why would I care about this, like, sell me on why this is important, or I should spend time with this. And then, we went a little bit into, like, well, why aren’t you using Base-R. And then, from there, it kind of proceeds in terms of like how you would do your analysis. So we talk about, like, a little bit about exploratory data analysis, we talk about data splitting, because this is the first thing you do. And then, talk about, like, your first model, and then, eventually, worm your way through to measuring performance and tuning models. And then, there’s a bunch of just assorted interesting stuff at the end. So, if anything, it emulates sort of the process that you would analyze your data [inaudible 00:22:22] your data. So yeah, and it’s mostly, like a book where we want to give them like an early win or something like that is more a book where we want to sell people on using it, like, using that method, in general. Like, for Hadley’s book, it might be somebody who’s coming out of Excel, and they’re like, you know, to give them something like that, like you said, is like, you get them kind of hooked in. Shiny does the same thing. And with us, we sort of had the premise that, all right, well, you’re picking up this book for a reason, you’re going to do some modeling, maybe you’ve heard the Tidyverse and Tidy Models, so now you’re in, what do you do.
JS: Yeah. So the case you gave is the bank analyst is asked by his boss to do something. I’m curious what you think about – this is going totally different direction, but I’m curious what you think about making modeling too easy for people, right that, you know.
MK: I think about it so much. I’ve thought about this so much, recently, it’s been a point of discussion. I mean, so this is like heresy to say, and people are going to be like, all right, delete now, but, like, I’m really, really not a believer in any automated machine learning. I feel like that is like a recipe for disaster. It’s not like Skynet taking over, but where things go wrong is, and that’s been my mantra, my entire career is like what’s the worst that can happen. And so, most of the machine learning that I’ve done is like assistive, like, I don’t need to tell a chemist, it’s obvious this compound is the one you should synthesize to cure cancer. Right? But giving them, I hate to use the word insights, but giving them prototypes, like, hey, based on the data we have, here’s like a de novo structure we think you might look at, and with the understanding that it might prompt them to say, like, oh, I never thought about designing it that way, or, they might take a structure that they have and say, well, how would this work if we made it, we haven’t made it yet, but I’ll give you the formula, you tell me how active it’s going to be. And so, that’s sort of where I see the bulk of the utility in machine learning and modeling and things like, if we’re talking about predictive models as opposed to like making inferences and draw conclusions. So that’s sort of where I live. AutoML is something, depending on how you define that, that I think is really interesting and helpful, because I think as long as it instructs you, will you learn something out of it. So to me it’s really scary to be like, oh, I just gave my CSV file, and I have this published model without knowing anything…
MK: So I feel like we don’t want to do that. In some sense, we have honestly put constraints on that in Tidy Models. We’ve enabled you to do a whole lot of stuff. In fact, a couple of the other podcasts have talked about some of these tools that we’ve built up. They’re like, oh no, but just saying that, like, I had a lot of trepidation in making them. But on the other hand, when I was doing modeling for living, I would have wanted those tools. So we certainly don’t have anything where you just blindly get results without any oversight. And also, we put in a lot of guardrails to really prevent people from – there’s a lot of pitfalls in machine learning, it’s really – and I’ve done this, it’s really easy to get a model you think has really high accuracy. And then, six months later, you get new samples, you’re like, why am I missing them all. Just because you made a methodology error and didn’t realize it. And so, we know where the most of those things are, and we’ve designed the software and the syntax to really, you’d have to go out of your way to do it poorly and not realize it.
MK: So those are all things that the kind of moment where people ask me that question. Our friend, David Robinson was talking about this in a blog post about Tidy Models, and his point was, yeah, don’t let people fool you with that, you should be concentrating on your problem, you should not be concentrating on is my software doing the wrong thing, am I unassumingly, like, making some big error, you should use tools that you can feel safe with and give you results. And he definitely felt like that was something that Tidy Models does. And I agree with him. You don’t want to get bogged down in, am I resampling the right way, or am I doing it properly, did I use the training set in the right way or the test site. And again, like, when I was talking about tooling earlier that it goes into that, like, if your job is modeling, and you have good tools, I think we’re making those. Then you get to be so much, much more productive, because you’re not bogged down in the errors or, like, oh, but did I do that right. And so, yeah, I feel like, that’s probably the best answer is I’ll admit, like, I was talking to somebody who I knew pretty well, and I was at their institution, and we were talking, and he was using caret at the time, and he said, I tried a dozen models in caret, and I don’t really know what a lot of them do, or how they work; but I get one that seems like it’s doing really well, and then I go read the paper. You’re like, all right, so what exactly is it doing, how can I figure that out. And to me, that made me feel the first part of the conversation was like, okay, and then in the second part, I was like, that’s…
JS: Yeah, that’s right, yeah.
MK: So yeah, we want to facilitate – we especially want to facilitate people doing things that they couldn’t do before, whether there wasn’t an R package to it, or, and this happens, to some extent, there’s something that you want to use in R, but you end up slamming your keyboard on the table, because it’s so awful to use. And so, we want to smooth all that out, we want to just make those things either rewrite them or make them easier to use and more consistent and work well, and allow you to do things that you really would have trouble doing before, it adds a lot of power to what you’re doing. But again, you can do it in a way that you hopefully can feel safe in your application listings.
JS: Yeah, it’s like building some roadblock, but not a wall. Right? It’s like a speed bump, right? You want people to slow down, but you don’t want them to be able to get to the end of the street.
MK: Yeah, maybe the analogy I’d be like is going off road, like, honestly, I can go straight to my grocery store from my house, go through a lot of people’s lawns and get there much faster. But the idea is like, you know, but I have to do a lot of work to get there. Right? Here’s another example from caret, like, early on, somebody, and God bless them for doing this, but they emailed me and said, oh, I wrote this blog post by caret, how it helped me win this Canva competition, I just want to say thanks. I was like, that’s really nice. I got around to reading it, it was like, well, I mean, they weren’t wrong, but they basically use caret to try to map their training set to the test set and make them as similar as possible, to maximize their accuracy. I’m like, okay, yeah, I mean, you did it, we just want to make sure – we don’t put roadblocks over anywhere, but to do that, you would have to really like – you’d have to do some interesting things with Tidy Models to be able to finagle that. It would be so unnatural for you to do, like, why am I doing this.
JS: Yeah, I mean, it’s interesting, because there are lots of design tools now like Canva and Figma, and these other tools that are sort of democratizing design. And I’m sure there’s lots of graphic designers out there who hate those tools because you’re putting these fairly powerful design tools in the hands of people who don’t know design. And I guess, there’s an equivalence with statistical packages where you’re giving power to people who, like you said, they’re running regressions, but they don’t really know what they’re doing, or they don’t know how to interpret it, or they’re mucking with the data in ways that maybe they shouldn’t.
MK: And that is the modern history of statistics. So, I’m serious, like, when I was in my 20s, I was reading papers about the democratization of statistics, and people having mini tab or SPSS or whatever, and I think, as a profession, statisticians have had to come to terms with the idea that you can’t be data cop. Right? It’s not that it’s an, you know, I think there’s a history in statistics of being sort of like, the analogy I use is like a PhD in statistics is like some wizard that lives in a cave, and people are scared, they don’t know how to kill the dragon, so they go and genuflect in front of the wizard. The wizard is like, oh, I’ll teach you the incantations to do this complicated thing, and solve your problems. The wizard is not out there helping them grow weak, right?
JS: Yeah, right.
MK: And I feel like the statistics over the years has suffered because of this sort of inherent, I mean, it’s a generalization, but I think [inaudible 00:31:11] a lot of the statisticians I’ve worked with is, like, yeah, so I’ll come and bless what you’re doing. And if you didn’t do it right, I won’t scold you, but I’ll look [inaudible 00:31:17].
MK: And so, I feel like data science, as much as it was talked negatively about, the whole Six Sigma thing, my experience with it, in the corporation I worked with, was actually very positive, because there was a statistician in almost every project. And so, we got in the mud a little bit and, like, oh these are the problems that they’re facing, not, should I use a T-test or Wilcoxon test. And so, I feel like the environment we’ve had to accept the idea that we are not, in fact, we’ve been marginalized to some degree, because there’s so many tools, and honestly, I think on the computer science side, in some ways, are a lot better at promoting and talking about things, and there’s so many more of them that I feel like, I’ve been in places where it’s like, well, yeah, you’re doing a bunch of statistics, why aren’t you in the statistics group, like, we’re not, because that’s how we got 20 people instead of five people, because we’re obviously…
MK: So it’s really been a detriment, the cycle repeats itself every 10 years, it happened in the 80s, with Taguchi methods, in spectrometry, like partial least squares; and it’s happening in machine learning with boosting where we’re getting a little bit, I feel like, as a population, we get a little bit complacent, we see things from outside our community being published that had really good ideas that maybe aren’t really statistically all that rigorous. And then, it’s more like, once you have ideas, we can refine them. And I feel like we’re on the road to being more integrated, and more hands on, and more just generally proactive than we were before, because we don’t have much choice. In a way, that’s the thing.
JS: Yeah, the last thing I wanted to ask you, especially for folks who are listening to this, who are, maybe less interested in the modeling side, and more interested in the DataViz side, can you talk a little bit about the link between Tidy Models and like as ggplot would be the way most people would go?
MK: So I say this quote all the time, if you’ve heard me speak before, you’ve most likely heard it, but, Professor [inaudible 00:33:25] said like the only way to be comfortable with your data is to never look at it. And so, before doing any modeling we promote in our workshops and things like that, hey, let’s take 10 minutes and whether it’s based on R or just summary statistics or ggplot, look at your data, what do you notice about this. That really informs the models, so there’s usually this big feedback loop of like, you build a model, it works okay, you can figure out where it doesn’t work, and then, you have to do a bunch of like exploratory analysis if you [inaudible 00:33:55] why not those, why don’t samples doing poorly. Also, recipes, in particular, have some nice tools where you can do, especially for like high dimensional data, you can do very helpful and informative feature reductions, like, there’s always principal component analysis, but there’s a whole host of things like that, some of which are nonlinear, some of which are supervised, that will really help you understand your data a lot better. Back when I was doing computational biology, the minimum number of outcomes I had in my experiments were about 7000, ranging up to about like a million, right?
MK: And so, now you have all these dimensions, and it’s a very dense dataset, and how do I know if – but you’ve got like 30 samples – how do I know if any one of those samples is problematic? And so, using these particular techniques, it gets you very far very quickly to figure out like, oh yeah, this one’s all goofed up, because they ran it on a week or two after the others or whatever the example might be. So if we do anything analytical, like, in terms of analysis, it’s kind of a recipe for disaster, if you’re not looking at your data. ggplot facilitates that to look at your data, and then, what you learn can then facilitate going backwards, like, you might, in some data analysis, figure out through these model terms are really interacting with each other. And that’s maybe the ggplot that you show to your boss’s boss, that says, like, oh look, we can exploit this interaction and make more money or do better widget or whatever it is that you’re doing. You really can’t divorce data analysis in visualization, they really are tied together.
JS: Yeah. Okay, so terrific. Thanks so much. So Tidy Models, people can buy the actual physical book, but there’s also an open source version that they can check out. I’ll put the link to that. And there’s code snippets in there too, so they could – they shouldn’t, but they could just copy and paste and get things to run.
MK: Absolutely. And the whole book is there, like, all the source files, that if you wanted to compile the book and print it on your printer, like, I mean, have fun with that. But yeah, it’s all out there for you, yeah. And there’s also tidymodels.org is a nice website we put together that has a lot of tutorials, and a lot of really good resources. So if you like long form, check out the book, if you want more short form, almost like blog posts length information, that’s a better…
JS: That’s where to go, all right. I’ll put links to that, and everything we talked about. We got a whole – I’ll have like the history of statistical packages on this on the show notes for today. Terrific. Max, thanks so much for coming on the show. This is great.
MK: Thank you. Thank you for inviting me.
And thanks for everyone for tuning into this week’s episode of the show. I hope you liked that discussion with Max Kuhn. I hope you’ll check out all the links I put on the show notes page, all the links to his books, his GitHub page, his RStudio page, all of the great stuff that you could learn about the Tidyverse, Tidy Models and all the other stuff that he is working on. So until next time, this has been the PolicyViz podcast. Thanks so much for listening.
A whole team helps bring you the PolicyViz podcast. Intro and outro music is provided by the NRIs, a band based here in Northern Virginia. Audio editing is provided by Ken Skaggs. Design and promotion is created with assistance from Sharon Sotsky Remirez. And each episode is transcribed by Jenny Transcription Services. If you’d like to help support the podcast, please share and review it on iTunes, Stitcher, Spotify, YouTube, or wherever you get your podcast. The PolicyViz podcast is ad free and supported by listeners. But if you would like to help support the show financially, please visit our Winno app, PayPal page or Patreon page, all linked and available at policyviz.com.