In case you don’t know, for the last 12 weeks or so I’ve been hosting live video chats with experts in the fields of data, presentation skills, and data visualization. This Data@Urban Digital Discussion series gives (I hope!) people an opportunity to chat with experts in these various fields and to learn more about how to be better data communicators.
In this week’s episode of the podcast, I’m publishing the discussion with two of my colleagues at the Urban Institute. Graham MacDonald is the chief data scientist and Claire Bowen is the lead data scientist for privacy and data security at the Urban Institute. We talk about their current work and how they’re helping researchers understand data security and privacy in the age of big data.
I hope you enjoy the show!
Support the Show
This show is completely listener-supported. There are no ads on the show notes page or in the audio. If you would like to financially support the show, please check out my Patreon page, where just for a few bucks a month, you can get a sneak peek at guests, grab stickers, or even a podcast mug. Your support helps me cover audio editing services, transcription services, and more. You can also support the show by sharing it with others and reviewing it on iTunes or your favorite podcast provider.
Welcome back to the PolicyViz Podcast. I’m your host Jon Schwabish. I hope everybody is well. I hope you’re safe, and I hope you’re healthy. I’m excited to be back with more podcasts. I’ll be going through about the end of June with some more episodes and then taking a couple months off for the summer, even though there probably won’t be a lot of travel going on with the Schwabish household, but I’ll be taking a little bit of a rest as we get into the summer months. So as you may know, over the last few weeks, well, now months, I would say, I’ve been hosting these Data@Urban Digital Discussions. They are one-hour chats with me and a guest or two. We talk for minutes and then a little bit of Q&A with folks who show up, and they’ve been a great experience for me to talk with people in lots of different fields of data visualization, graphic design, people who are making tools, people who are working in data science, and it’s been a really great experience and I’ve really enjoyed seeing lots of people come out and chat with me, and, of course the guests. So on this week’s episode of the podcast, I’m going to repost one of those discussions. And in this digital discussion, I chatted with two of my Urban Institute colleagues Graham MacDonald and Claire Bowen. Graham is a chief scientist at Urban and Claire is the lead data scientist for privacy and data security at the Urban Institute. We talk about their work, we talked about how they’re helping researchers at Urban understand the issues around data security and privacy, and we also talk about other issues related to data security, related to data privacy, related to working with administrative data. So it’s a really interesting conversation, there’s a lot of interesting things happening in this space, and it was great to be able to sit down with Graham and Claire and talk about these various issues.
Just a couple of notes before we get into that discussion, I’ve been trying to post more blogs on my website, sort of some shorter things, not as long as I usually write, just trying to get a little bit more writing done, get a few more things out onto the blog. So lately, I’ve been writing about things like some success I had helping the Social Security Administration improve the data visualizations in their reports, I’ve written about a visualization I created on the benefits of wearing face masks in the era of COVID-19. And of course, I have a new book coming out later this year, Better Data Visualizations and I hope you’ll check that out on the show notes page and over on Amazon you can go and preorder it right now, and that’s very excited to see it coming out, so I’m excited for that. So this week’s episode of the podcast is my discussion with Graham and Claire and you’ll hear, of course, some other people chiming in with their questions after our discussion. So I hope you enjoy this week’s episode of the podcast with Graham MacDonald and Claire Bowen.
Jon Schwabish: Good afternoon everyone. I’m Jon Schwabish, thanks so much for coming this afternoon to another Data@Urban Digital Discussion, digital chat. Hopefully, you’ve been able to tune into some of these in the past. The plan is pretty simple – we have two great guests, I don’t know what to call folks who are showing up for these, people, two great folks, two colleagues of mine at the Urban Institute. We’re going to chat for 10 or 15 minutes about the work that they’re doing, and then we’ll just open it up for questions and have a discussion. It’s very casual, very low key. So if you have questions, just pop them into the chat window and I’ll be able to build out a queue and then you’ll be able to unmute yourself and have a discussion. Right now, there’s only about 35 of us, so we can just unmute ourselves and have at it. So I thought we would start by just having our two guests introduce themselves, so we have Graham MacDonald from Urban Institute and Claire Bowen also from the Urban Institute, and then we’ll just take it from there. So again, super laidback. So Claire do you want to start?
Claire Bowen: Hi, I’m Claire Bowen, and I’m the lead data scientist at the Urban Institute and I specialize in data privacy and data security. Am I supposed to say anymore?
JS: No, that’s good.
CB: All right.
JS: And just as a point of reference here in Santa Fe.
CB: Yes, so right now, I’m in Santa Fe, New Mexico. I was going to be remote working anyway.
JS: Yeah, so this worked out.
CB: This totally worked out.
JS: Okay, and then Graham who is in DC.
GM: I am in DC Metro Map back here. A little data, there’s no silver line on this one. But yeah, I’m in DC, chief data scientist here at the Urban Institute, started as a housing researcher, did a lot of DataViz, as you were coming on to Urban and I left Jon and then came back, and have sort of built up the data science team. We’re now seven folks working across all the researchers at Urban to integrate anything you could think of called data science into our research at Urban.
JS: Yeah, maybe we should start there Graham, you can talk maybe a little bit about the team you’ve built out, and I don’t know, just the evolution maybe of what data science means at Urban, and maybe for researchers at places like us in general; and then Claire I know you [inaudible 00:05:22] a couple things in the chat box for people to take a look at on data security, privacy, so maybe we could just sort of segue into the data privacy issue as well. But maybe I think it would be interesting for people, Graham, to hear how you’ve been working to change how Urban uses data every day.
GM: Yeah, and let me know, my internet’s going in and out these days, so just let me know if I’m slow and just, I don’t know, yell or put your hands up or something.
JS: All right.
GM: Okay, so yeah, it’s a little under four years ago now I came back to Urban from Berkeley after having, you know, I got my policy degree but really just taking a bunch of the programming courses at the school of computer science and information there, and I really had this vision for Urban where we could use some of these new tools, mostly a lot of cloud technology, but also any of the data science methods like machine learning, natural language processing, I would put web scraping in there that we could use to sort of, in our research, to collect new data, to do new types of analyses, to have better real time data collection because often in our policy world we have datasets that are, you know, at the neighborhood level, we’re using like five-year averages from the American Community Survey from 2014 to 2018 to talk about today. So there’s all these ways in which we want to better use data here at Urban and, Jon, as you know before, when I was at Urban, we had been just putting out PDFs, and I’m sure, when you were at CBO, they were doing the same thing, and they’re putting out these PDFs and you’re like, you see the graph of who reads a PDF, and no one here is going to be surprised when it’s like not even your mum sometimes, which is like really, like, at least you get like 10, some of them are zero. And so building out the data science at Urban is sort of like when I think about building DataViz into policy organizations, sort of this like obvious next step in how to communicate your work and took a lot of effort but was really valuable, I sort of see the same thing in data science at policy organizations today. I mean, don’t get me wrong, I still need more DataViz at policy organizations, and better DataViz, I also need more and better data science to, for example, we’re doing projects where we use natural language processing on news articles to collect instances of major zoning reform so we can understand the impact of zoning reform on housing affordability. Or like, think about the COVID crisis right now, it’s like there’s a lot of creative people including us thinking about how do we understand which neighborhoods or which areas are going to be most impacted by this crisis economically or socially or whatever that may be in real time because we don’t want to wait three years to get that data right now, we want to have some proactive response. And so, I feel like data science is really important in this policy realm, not just for real time data but also for using these new methods creatively to come at new data sources, and that’s where we use big data methodology, new cloud architectures, APIs, things like that.
JS: Can you talk a little bit about, I don’t want to say the pushback, I guess, the right word is challenge really, because it’s not necessarily pushback, because it is changing the way people who have worked with data or built models for a long time is changing the way they have to work, like, you reference your dataset on your C drive on your computer, and then going to something in the cloud is kind of a different way to think and a different way to act. So I don’t want to say problems, I think it is challenges, right, and maybe not even convincing people that it’s a better way, but just helping them get to that better state.
GM: Yeah, that’s a super good question, I’m going to give you a boring answer which is like there’s win-win mentality, I mean, go back to like Stephen Covey’s book, like, 7, right? Like you go and say, hey, where’s the win-win situation here. I’m not going to people by fiat and saying like, hey, you have to switch from your personal computer running SaaS into running R on the cloud now. But there have been projects, we were working on this [inaudible 00:09:16] job, this company that’s the largest online app for connecting people with sort of hourly jobs, and I never heard of it before but they have like tens of millions of people across the US, so this is how I’m used to applying for jobs with a paper application when I was working at my minimum wage grocery store, but they are like, they have this app now, and they’re trying to – these researchers are sitting like with data on their personal computer trying to merge a 72 million record dataset with the 24 million record dataset and just crashing their computer constantly and taking days and days. And then we come in and we’re like, hey, let me show you how to do that in five minutes. And then they’re like, whereas in a previous project where it was like I have 10,000 records, you have 20,000 here, I can figure it out myself. But then, now there’s this huge problem, it’s like, okay, this is an easy win-win for me. And then same thing with like the zoning reform I mentioned, it’s how do we collect this totally new data that we never had before. In that case, you’re like, wow, I can do all this new research that no other researcher can do because I have the [inaudible 00:10:16] capacity. So it’s like, we’re not pushing in every area and 80% of the research projects are still doing the same things that they’re doing, but in areas where there’s this real opportunity to innovate and do new things, it seems like an obvious win-win. And even then you’re right, there’s still some, let’s say, extreme skepticism as we are trained to have in the research field [inaudible 00:10:37]. So that’s also hard and I don’t want to minimize that.
JS: Right. So maybe this is a good segue to the data privacy security issues that Claire is specializing in, because you already mentioned these huge datasets, they may have specific individualized observations, and I don’t think at Urban we haven’t really ignored those security issues but they definitely change and/or maybe even more important when we go from like the ACS or an aggregate dataset to your credit record. I don’t know, Claire, do you want to – I don’t even know what the question is to be honest, maybe, I don’t know, maybe give like a baseline for people to start a framework for people to start thinking about when it comes to research what does data privacy mean.
CB: Yeah, so data privacy is a huge umbrella, and the specific area that I specialize in is more on the data releasing side of data privacy. So the example I usually give people is healthcare data because everybody kind of, well, especially now that’s on their mind, so healthcare data is very useful dataset to find early correlations on diseases, cancer, HIV, and now with COVID right now it would be really useful to figure out maybe certain kind of symptoms that are more common or people who are more targeted they are affected. However, at individual levels, there’s a lot of very personal information there in those healthcare datasets. And researchers who use them should know who in the dataset has COVID or has cancer, HIV, and so on, but they still need to have access to it. So kind of research I tried to do is to make a kind of what we call a synthetic or like pseudo records, so fake kind of version of the dataset that is statistically representative of the original but still useful for those medical researchers to use. And another caveat that I forgot to mention is that our taxpayer money goes towards collecting that data too, so people should have access to it at some level. And so, here at Urban, we have a lot of, like, Graham was talking about, very large datasets, and nowadays, it’s becoming much harder to protect people because we have these computers in our pockets, so I talked a little bit about that in my blog where, for the decennial 2010 US Census came out, the methods they used that have existed for many years, basically roughly worked, but nowadays it’s a lot harder to just remove those PIIs or personally identifiable information or one of the techniques that decennial used was called swapping, so they would randomly swap people’s attributes with somebody else in another state. That’s how they were protecting individuals. But 2010, smartphones were barely a thing; now, today’s smartphones are stronger than our laptops or desktops that we had in 2010. So it’s really hard to protect against just brute force computational power that you can just find, oh look, we have now social media, maybe people could be linked by those external datasets and we can find them in decennial in our [inaudible 00:13:37] data which happens quite often.
And so, I guess, going into the kind of segue of some of the links I provided, one of the kinds of datasets that’s really hard to protect against is called spatial data. And so the two links are talking about cell phone metadata because pretty accurate, right, we carry a lot of – all of us have smartphones, and so that information is being collected. And so, I think it’s the second link, the older one which is in December of 2019, showing from New York Times, talking about how you could find out where people live and work because just how frequently or like what time of day that they would go between places. And one of the things they noticed was somebody worked, I might invert this, so somebody worked at Amazon, it was obvious, and then there was like one time they went to Google, and then two months later they switched jobs and they started working again. So it’s something like that. But there was an article some couple years back, I believe it was from Stanford, that talked about how they could figure out who had heart conditions because they would go visit a particular doctor, and so that’s kind of scary.
JS: So is all this data, especially on the cell phones, are companies able to get these data because we all just click that little agree button on every app that pops up and we don’t – no one ever reads it?
CB: Most of them, yes, because you’ll see that they – it’s like, oh are you okay sharing your location data. A lot of people say, yeah, I totally want Google to know where I’m at. And actually one of the most identifying pieces of information is next to your social security number is your cell phone number actually, because people are very resistant to changing that. And so when people ask you like, oh, would you like to be part of our membership, to like if you go to Costco, I don’t know if actually Costco does this, but I was like thinking of grocery stores. So like when I was in DC…
JS: My video has slowed down. Sorry.
CB: So in DC, there’s a Harris Teeter I went to, the grocery store there. And they asked, oh would you like a premium membership, and you just need your cell phone number, and so that’s how they track you.
JS: Great. This is going to be the most terrifying conversation we’re going to have, I think.
CB: I’m so sorry.
JS: So there is a question that was sent over. So the question is about the recent blog posts about reduction in movements based on cellphone data where DC got an A plus and Wyoming got an F. [inaudible 00:16:09] wants to know what level of aggregation did they use, and what kind of apps or sensors were they using to measure all that.
CB: That’s a good question. I haven’t been able to dig in deeper because that was one of my questions, it was like, what was their baseline, and comparing people several months ago or not, they did talk about, it’s from – what was the company, there’s a startup company that’s actually gathering people.
CB: So that’s what I’m very curious about, and I tweeted about it because one of the things that bothered me when they were saying like DC got A+ and Wyoming got an F was that, I thought they ignored the social aspects of these different areas because like, for me, I am from Idaho and I actually lived in Wyoming for a little bit because that’s where my mother is right now. And unlike, when I was in DC, I could just right off the metro, there was a Harris Teeter, so a grocery store, and I would just go pick up when I needed to and it was pretty frequent, until the COVID happened, then I was, oh okay, I’ll just go weekly. But in Wyoming and other more rural states, you live so far away that you usually like even without the COVID situation, you wouldn’t be going to the store every other day, you’d be going once a week or every two weeks. Or when I was growing up, we went to our monthly trip to Costco, like, you got the biggest car you could, sometimes you got a trailer, you went to [inaudible 00:17:34] it’s for reference, it was three hours’ drive away. Go to Costco and you just load in like a $1000 worth of stuff and that was supposed to hold you over until the next time you do that.
JS: The next month, yeah.
GM: And Jon, I would add that too, there’s a [inaudible 00:17:53] issue that Claire mentioned, but then there’s also the fact that like when you’re analyzing these data, whether it’s from the startup or other companies, you don’t actually know usually, you know, like the location and like the unique user, but you don’t usually know which app is the one recording it, and each app records it in different ways and at different intervals. So like, we just have all these locations from different apps in someone’s phone, maybe just one app, maybe it’s 10 apps, you pressed yes 10 times to every app you downloaded, and you have like – some people are super users, you have like every minute you know exactly where they are. And some people you know like once a day or once a week because they didn’t open that app all week, right, so definitely data quality issues there.
JS: Right. So can you two maybe talk about a project at Urban where this issue became really acute, because I think about as someone who’s just done research for a long time like ACS or the CPS, a lot of the Census Bureau data haven’t really had these particular issues because the census tries to, they try to address the security privacy issues. Graham as you’ve built out your team at Urban and introduced some of these other datasets, is there an example that you can provide to people where some privacy security issues come in that your more senior researcher may not really think hard about the security and privacy issues?
GM: Yeah. So I mean, there’s a ton of examples here, and one of the reasons we brought on Claire was so we could get better at this.
JS: Yeah, right.
GM: Okay, I can memorize all these problems all at once and figure that out, and so part of it is that Claire is going to be in the future, we haven’t decided when, she just recently started a few months ago [inaudible 00:19:36] to help disseminate these. I don’t want to commit her to exactly when that’s going to happen but she’s interested in doing that soon. That’s going to help disseminate practices throughout Urban. Generally speaking, I think right now we’re in the face of trying to make people aware of the issue and especially when it’s related to deidentification. So like the most common use case where I think this can go wrong and when we’re talking with a bunch of government agencies and also internal researchers about is, whatever data collection effort you’ve gone through and you want to deidentify the data and release it publicly or release it to other researchers via an archive, it’s like what’s the risk of that data being reidentified. In other words, you have all this really personal information that people might not want to have shared with other people, and now that’s just a random individual in the file, you don’t know who that is, but then you use a few pieces of information like their zip code or especially their phone number, and all of a sudden you reidentify the individual or their birth date or age, location, things like that are really sensitive. So the point is how can we do that better, and I think Claire and I, and I will let her talk in a second, have sort of like talked with a bunch of people who are releasing data both internally and externally, and one of those – great example is the decennial census, we’re working directly with the Internal Revenue Service, with IRS, on how do we release their data in a way that’s useful for researchers, because right now they release a public use file every year; and over the years, over the last few years, it’s public information, it’s gotten less and less useful because they’ve had to create more and more draconian sort of privacy cuts to that dataset. And so, there’s this issue around how do we best protect that dataset in a way that works for the research users but at the same time protects everyone’s individual privacy, because we don’t want to share that information publicly with anybody.
CB: It’s my time to jump in, all right. I’m just like, it’s kind of harder to tell [inaudible 00:21:37] it’s like wave my hand. So one of the issues that Graham was alluding to is like what I would call the top five problems and data privacy is accessibility. So even though the field has existed for decades, there has been not a lot of communication nor, I guess, the motivation became more prevalent with computers coming along. So when I say data privacy has existed for a while, it’s existed more formally, I guess, like 60s or so, like, when first papers were coming out, like, more technical, talking about what we should do with the data, like, decennial and things like that. So right now the field is very scattered. So some people say, oh, it’s originally in computer science, or some people say statistics, but it’s beyond just that. So computer science, statistics, economics, social science, and the list goes on and on of those who are interested in the field because they all want to access data in some form or have data and they want to share it with others. But the whole accessibility issue is either those who specialize in trying to release this dataset and trying to figure out latest techniques or ways to analyze data that’s more coarsely aggregated because of privacy issues, so there are techniques to try to like get better estimates, however, they tend to be very technical. So the average user or anybody who’s interested in trying to either apply it on their data or trying to use the data, they can’t quite decipher it all. And even with that, some people think, okay, well, there has to be some packages because right now we have a lot of great open source languages, why don’t we use that. There are a couple of packages existing, but they are either very specialized or not as well known or well used. So even on the computational front, it’s a little tricky, or sometimes the packages assume, like, one of the packages I found, they assume you have a GPU or access to a computer with the GPU which is not – every person is not going to have one of those, especially, the one that they said they used was like $3000 to $5000, just for the component. That’s very unrealistic even for a government agency because there’s red tape everywhere when you’re trying to figure out like, oh, I want to buy this, and your supervisor says, why. That’s definitely a big, big hurdle and it’s something that I’m trying to actively more work on is creating better communication material about what are latest techniques and how do people get access to that, or even just trying to decipher some of those technical papers a bit better for the average person.
GM: Can I frame this to [inaudible 00:24:15] for a second, do you mind?
JS: Yeah, of course.
GM: It’s like, the real issue here is that, as Claire explains pretty well in her blog post, like, it’s basically the usefulness of the data for us and for others to use it for to help inform policymakers and improve the lives of people. And the privacy of it for the folks who don’t want to be revealed are sort of directly at odds, and so what we’re trying to do is ensure that we make the tradeoffs in the areas where the data is super needed publicly, where we reveal a little bit more privacy and maybe have the data be more useful. And on the other hand, where we really actually don’t need it, we can give privacy back to the individuals and make sure it’s super secure. And so, one of the things I think that – I think I’m really excited to have Claire and why we added her to the team and why we have so much work coming up with her is that it’s a policy discussion and not computer science discussion, and where this data privacy has been is in the computer science field. They’re talking about here’s how you protect privacy – protect privacy like this. And it’s a really stringent standard for doing so, and we sort of give up all the usefulness, and now we need to actually have a conversation with both people in the room, like, I want to protect the privacy, great, here’s the usefulness of the data, great, how do we have that conversation when there are terms like epsilon and no one really understands the mathematical definition of differential privacy and what does that mean when you increase one versus the other for a dataset, where I see a lot of these sort of academic wonky debates have popped up over the last year with folks from [inaudible 00:25:56] or the census or others who are sort of trying to say we’re doing it the right way, no you’re doing it the wrong way, we’re doing it the right way, you should do it this way sort of thing. And I think we want to have a much more productive discussion than that, I think because it’s really a valuable one to have.
CB: It’s like a food fight right now. Nicely, I guess.
JS: So like the easy answer – well, not the easy answer, but the easy next stage is, oh, let’s get everybody in a room and talk about it and have these conversations. But that’s like the answer for everything is like let’s bring both parties together and have a conversation. So looking beyond just having these conversations, what’s the next step in this process of really getting these two, the computer science and the research sides together as not just they need to talk it out? Let me rephrase this question. If you two were in charge of all federal data and privacy in the United States, I guess, my question is like, what is your ideal policy framework that you would put into place.
CB: Oh my gosh, that is a very good question.
JS: Well, that gives you a lot of power.
CB: Oh my god, yes…
GM: Starting with now and maybe we’ll work up to that big answer.
CB: Yeah, I know, that’s [inaudible 00:27:12]. Well, actually one of the first things I was doing while I was in DC, just starting the job, was to start all those conversations with people and seeing like what, talking both to the computer scientist, census, IRS, housing department, just trying to get a feel for like what are their data needs, why they’re trying to release to other people, but then also talking with the data users. So internally, in Urban, and externally as well, what are the needs. And so, because of that, that’s why I got inspired to do this data communication kind of initiative. So currently in the process of submitting a proposal to make a series of written and computational communication materials that were targeted based on these kinds of meetings and trying to bring people together and solicit feedback from them frequently, because even though we gather great information from all these meetings figuring out what people want or what they want to release is not always perfect because sometimes you make these poor assumptions about what the data users want or what does census actually want to release or IRS or other government agencies. And so just kind of have that open dialogue.
Now, hopefully, there’s better communication among everybody will help release more data, because there are, frankly, a lot of great and valuable datasets that aren’t being released out that could help inform policymakers. And so, I guess, the next step after that is that once we have this getting more researchers to analyze that data and then seeing how we can take that data to, or, these analyses and bring it up to the policymakers to make better decisions, and, I guess, trying to figure out like the, big, big, big picture, you said like, if we had control of everything, so trying to get all the different government agencies to accept these better practices, it’s actually harder than people – I don’t know how many people have worked with government or worked in government, it can be very challenging at times to adopt new practices. And so, like, if I had that, I don’t know, that stamp, we are like, you are just…
JS: Magic wand.
CB: Yeah, magic wand would be like, you guys all accept us now, all right.
JS: Magic wand, right.
GM: To your point about, like, what would be the, like, okay, we can just get people together all day. But like that’s expensive and hard. I think what we want to do, like, here’s the state of the field now, there’s probably 10 to 15 people, Claire can name every single one of them off the top of her head, who have enough knowledge and expertise in this field to be able to bridge that gap and help those conversations happen. And that’s one of the reasons we hired her and it was a long hiring process, and I think a staffer just asked a question, I think helped us with that, [inaudible 00:30:01] also, so thank you then. What I think is such an important thing is to recognize that that’s such a small field, and to make the data – look, we’re not only hearing from federal policymakers or from researchers here in Urban. We’ve been talking to local cities who are like we want to keep our city accountable by releasing local data but we’re worried about releasing it because we don’t want to, like, we’re [inaudible 00:30:27] equity promise, we want to show that we’re making progress toward hiring more people of color, but releasing that data is sensitive. So how do you [inaudible 00:30:37] and so, they’re worried about it. So there’s like a ton more people in this world than the 10 or 15 people that are like Claire to be able to broker these conversations. So I think our first goal is how do we get that second group the information they need to quickly understand the problems and tradeoffs and be able to make those conversations happen much more broadly, or, how do we get the government analyst or the head, you know, they’re [inaudible 00:30:59] data officers in every government agency right now, and they’re [inaudible 00:31:03] popping up at the city level, how do we at least get them to understand the tradeoffs, so how do we, you know, Claire’s saying, how do we produce communications materials, we’re partnering with folks from the future of privacy foundation and others who are real experts in this field and talk to policymakers all the time, how do we get them educated on how to talk about these tradeoffs. And so, I think producing these materials, producing some of these programming packages and open source, like R, Python, so you can imagine a playbook or a quick checklist or things like that that would help us to sort of get that next level involved. We’re not going to get everyone involved, but hopefully, expand that field beyond just the few people that Claire knows.
JS: Right. Do you think it’s more important or maybe it’s just equally important that this comes from the analyst level up or from the head of an agency down? I mean, it’s easy, in some ways, easy to say from the top down because an agency head says we’re going to be doing this, and then it’s that’s the rule now. But I wonder if that’s actually not – that’s not how the world is, and so, I wonder if you’ve already talked about all the different people that you’re working with, when you look at the landscape of people out there doing this sort of work that is important for, do you think about the analysts level first or do you think about the CDO or the agency head, or is it just everybody in the in the ladder?
GM: Yeah, I’ll let you answer Claire, [inaudible 00:32:32] super high level answer which is the high level people are like, I am super worried about privacy, I want to make this tradeoff in a responsible way, I want to keep this data out there, how do I do it. The analyst is like, man, this is way over my paygrade, I’m not a privacy researcher.
GM: So like, this is [inaudible 00:32:49] and we want to help the analysts have the tools and help the chief data officer have the one or two page, like, here’s what you need to do and here’s how you say yes and here’s how you empower your analyst sort of thing. But as you know, like, different audiences, different products.
JS: Yeah, right.
GM: But I think they both have their own specific problem. Claire’s talked to more of them though, I will let her…
CB: I was going to say it’s definitely throughout that ladder because I’m just going to give an example, I was working at the census bureau early on when they were thinking about switching over to differential privacy. And so it did come from higher ups saying like, this is what we’re going to do. And when I was working in the Center for Survey Research and Methodology, a lot of people just were like, I don’t – what is epsilon, I have no idea – they were all very confused, and they brought in these big name researchers to give these talks. And I just remember my supervisor who, like, we kept staring at each other in this one talk that was supposed to help educate the group on differential privacy, and they were all confused. He had this look of WTF on his face, and so then afterwards he looks at me, it’s like, you got a private chat with me, it’s like Claire, you saw my face, right. I said, yes. He’s like, do you think I understood what’s going on. Like, no, I don’t think. Okay, you need to do better because I’m going to make you give a talk on this. So anyways, this was like my story about how we do need to think about everything in between or else you’re not – if you don’t have everybody else who are, I guess, like the ground floor working on the data analyzing it onboard, then it’s going to be much harder to move the organization.
JS: Right. Yeah, that’s great. So we have a question from Grecia.
Grecia: Thank you. Hi. Thanks so much for doing this. My question relating to governments, specifically, local city governments. Do you think they would be more likely to adopt these kinds of best practices or, Graham, I think you mentioned a playbook kind of thing if more of the big players started adopting them? So, for example, like you somehow convinced the US Census or the IRS or big City of New York or City of LA to adopt them and then other people would kind of feel the pressure or be more willing to try it out since it’s kind of been standardized because of these big players adopting them.
GM: That’s such a good question. I love that question. So yes, I definitely think you’re right. I think right now the fact of the matter is everyone’s looking at census.
CB: They are.
GM: For better or worse, that’s our first big picture example of what’s going to happen. And I think people are sort of holding their breath and waiting, Claire and I and Rob Santos who’s our chief methodologists at Urban have tons of opinions about what census has done well and then let me say diplomatically what they could have done better about this differentially private release. And I think people are waiting to see, like, what are those lessons learned, and might I apply that as a local government. I also think, frankly, when we’ve been talking to local governments, that’s not just like your LA or New York, folks like Austin or Kansas City are also interested in this stuff from my conversations with them. And I think they are really interested in it, it’s like give me a tool, a specific tool that I can apply level of interest. So like, whereas your folks at the IRS or HUD or census might be much more willing to say like, well, my data is super special, maybe we have a little bit of a budget to try and work through these issues of privacy and security versus usefulness, Kansas City or someone might be like, as you said, the next level after that once you’ve seen a few models and you’re like, well, this one is the right one for me because they use this type of data sort of thing. So I definitely agree with you that it’s going to cascade or waterfall down, but census right now is at the top of that waterfall, metaphorically, and we’re all sort of just waiting to see what happens. Claire, I don’t know if you’d like to add to that.
CB: The only thing I’m going to add to that is that I realize we haven’t talked about the fact that a lot of places, from federal, all the way down to a local government level, actually don’t know about privacy issues either. There are some who think that I just removed the personally identifiable information and that’s sufficient. But again, we have so much extra data out there, social media, and just – what I call, innocent pieces of information that people don’t think about could be used against them because it has enough information to link them to a dataset. If people want some references, there’s like, I think the classic example, this has been from 2008, is the Netflix Prize, where, yeah, Netflix really some – there are datasets for those who don’t know, they were going to give $1 million for researchers to improve the recommendation system, by, I think, it was like, 10% was the threshold. And so, one research group, instead of trying to improve the recommendation system, tried to see if they could identify the people in the dataset, and they were because of – I always say that’s wrong – the IMDB database. So they were just able to use that and find people, and this is back, again, in 2008, so imagine what we can do now with much more powerful computers, and there’s more information, because social media was kind of merging, that was when MySpace was still a thing. Imagine now what you can do and Netflix Prize dataset, some people are like, I don’t care if people know what I like on Netflix.
GM: [inaudible 00:38:37] connect the dots on why social media is like a date, is like a good source of winking data, is that valuable.
GM: Claire, can you just describe that, like, why is social media a problem when you’re talking about reidentifying?
CB: Oh yeah, so just expand a bit, Facebook is a really good one because a lot of people will put down, like, oh I was alumni of this university or high school, this is the year I graduate, a lot of people have their date of birth actually on Facebook. And so, you can just link those to other datasets that have maybe they don’t have the year of your date of birth but they have your month and day because they thought, oh it’s fine, nobody needs to know; or, saying like this is an education dataset, and this is us tracking people through time, and so, because of your Facebook feed, you said, oh, I graduated this place; or, let’s say, workforce dataset, some people put down on there – again, I’m picking on Facebook because it’s a really easy one, that you have like, oh, this is the first place I worked, I work now at this other location; or, there was a case of woman from Harvard who was able to link people posting about if they had an accident or something like that, like, oh, so and so went to the hospital and you can link them to the healthcare dataset.
GM: And depending on the social media platform, people don’t realize you just take picture of your pet in your home which is a very popular social media activity, as we all know – why else do you go on social media anyway, and then, I guess, to hear your relative rant about politics, but other than that, you push. So you just take a picture and a lot of our phones are geo-enabled by default in our images, and in the image metadata of most of our images is a geolocation. So now I know the block or zip code in which you live which is super valuable for linking you across datasets, because you may be – your name may not be unique across United States, but it’s probably pretty unique in your zip code, in your block.
JS: Yeah, like, I’m poking around my office, I’m in the middle of a book and I can’t find it right now, it’s on algorithms and the introductory chapter is about Boston releasing some health data and saying, oh, it’s secure, and the reporter was like, well, let’s see. I think it was like I think whoever released it might have been the governor or the mayor, so that’s completely secure, and the reporter is able to track his specific address and location from health records.
CB: So it was actually Latanya Sweeney, she’s now of CS, a full professor at Harvard. So she did this as her graduate student project, because she was at MIT at the time. And so, yeah, the Governor of Massachusetts, it’s like we’re going to release all this data, and it’s useful, it was like the federal work – excuse me, not federal, the statewide employee dataset, and he was like, we removed all the personally identifiable information. It’s perfectly fine, and she record-linked it to voter data and sent them an envelope of all his personal healthcare records, like, directly to his office. I was like, wow, that is…
JS: Yeah. That’s nicely done, yeah. So there are a few, I have a few questions in the chat box, so there’s one, I think that’s everyone’s, I guess, I’ll just read it quickly. Do either of you have recommendations for a secure cloud storage use if there’s PII in a dataset? Elizabeth wrote in, she has a small organization, they don’t have an internal service so they rely on cloud storage. I don’t know if you have thoughts on that, either of you.
GM: So we do actually, so I don’t have a lot of our cloud infrastructure along with our DevOps team here at Urban. So we do have IT department over 30 people, so we’re in a privileged position, I’ve definitely recognized that. There’s nothing inherently insecure about the cloud, but you do have to be really careful about how you architect the systems. So we do have protocols around checks and security logs and things like that that you sort of need to have in place, just like you would in your on-premise environment, and then you need to understand how data are being transferred between your organization and the cloud, because a lot of these organizations that have virtual private clouds that are connected directly to their organization’s network, they have what are called like, in AWS speak, it’s called direct connect, you have a direct line that goes from you where no other internet traffic is going over that line. And if you are transferring data that’s not over that line, you need to ensure that it’s encrypted in transit and encrypted at rest. And most of the monitoring tools will do that, but like, what you really want to ensure is that you have a system that’s built and you may need to, like, we actually – it’s not like a thing that we like to do, but we have peer or friend organizations that we are nice with and work with where me and our DevOps team will just consult with them for a few hours to say, yeah, no, that’s not the way you should set it up, you should set it up this way, just to be nice, it’s not like Urban’s business model or anything, we just want to see people succeed. But there are, you know, you do want to, at minimum, make sure you have encryption at rest, encryption in transit, and if there’s any data that’s governed by a federal law such as HIPAA or FERPA or any of these health education or other privacy laws, you need to ensure that you’re storing it in a system that is built for that and that any data that’s transferred to and from that system is also following those regulations.
The good news is a lot of the cloud providers do have that by default, you can access a list for any cloud provider, you can just Google AWS, HIPAA certified services, or Azure FERPA certified services, and they will give you a full list of everything that’s certified and available to use. So that’s a really benefit, it’s like throw out of the box, and it’s better than on premise, because you can sort of just build it out. But generally speaking, you do want to have somebody, you know, we have five certified solutions architects on staff, you probably want to have at least one person who really knows the cloud really well and knows how you’re working with data in your organization to at least double check that that system is secure whether you get a consultant for a few hours or some other method.
JS: Great. I hope that was helpful. So a couple of other questions. This one from Daniel, I like this question. So the question is: do you expect the public’s tolerance for privacy to change given the recent COVID-19 pandemic? So after 9/11 we had the Patriot Act which people seemed at the time, just obviously in the immediate aftermath, we seemed willing to give up some of our privacy for security and that has kind of waned depending on who’s trying to sell the message, but it’s a good question about whether you think people’s tolerance for privacy is going to change given the current pandemic. As an example, this isn’t in the question, but as an example that I saw, like, DC is publishing the gender and age of everyone who’s identified to be infected with COVID. So I just wonder whether you think things will change in one direction or another once we’re through this this moment in time?
GM: Claire do you want to take it first?
CB: Yeah, I’ll try to take a stab at it. There’s not a good response to it, I think, the example I think of more prominently is that when Facebook Cambridge Analytica scandal came out, people were very upset about it. And then they were trying to do some changes, but obviously, things didn’t change a whole lot other than now that people can reference it saying this is an issue. I can see both ways going. I can see people relaxing it up because with the – for instance, that metadata, that is really useful for figuring out not just like a COVID pandemic but also other emergency responses such as if we get hit by hurricanes a lot too, trying to figure out, like, where can we identify people, where are the choke points in our road systems to get people out, and things like that. But then I can also see the other way of like what if all this data being released, you said the gender and age and people who’ve been infected, it could be very targeted for insurance companies saying, like, hey, you had COVID before and so your lungs are now maybe more compromised, your respiratory system, so we’re going to raise your rates now. Should that be fair? No. But that’s what insurance companies will do. There have been cases where if they know that you’re smoking, it isn’t your health insurance goes up, your car insurance goes up, because they think that you’re more likely to pass away. So I guess that wasn’t a very precise answer of what I think one way or the other but that’s some of the implications.
GM: And I will just add one thing which is like the – I was in this sort of closed door session where there’s a – I can’t name the funder or the organization where they’re doing market research with average everyday Americans on data usage and sort of data privacy. And a concept was how can we better market integrated data systems, like, merging your healthcare data with your criminal justice data to do good in the world, because there’s a lot of systems around the country that are trying to merge different datasets across fields to help homeless people better access multiple services instead of each person treating them as like they’ve never seen them before. Right, so that is super valuable and what sort of like the summary that I took away from that conversation is like people are okay with it if you tell them exactly how it’s being used and they’re like okay with that use of the data. So I think, for Claire’s point, it’s like, insurance company could go use this and people’s default view is on average going to be, well, someone’s going to do something bad with this data so it’s not going to be good for me. And like, I can’t blame them, I don’t know if I’m – I probably am on the same page with them. I must [inaudible 00:47:57]. Like when you tell them, like, hey, this is going to be you, so we can better teach the kids in your school how to do X or better support the kids who are having trouble at home or we can help the firefighters better prioritize where to go or we can help your local, like, in those sort of cases where it’s like there’s a clear good and we’re using data to do this X, that’s like what resonated most with most people, is that they could actually – there’s actually something concrete that was good and that was going to happen with the data, and I think if we have some protections of ideas around that, it would, like, that would, I think, help us in the long run, but I am like a little bit sort of on Claire’s side, I think I am a bit pessimistic, if we do relax it, I do think that there will be a lot of those struggles with the insurance companies.
CB: Actually, there was a, I think, paper I read that they surveyed a bunch of people about what they thought about privacy, and the general consensus was that people were okay, like Graham said, as long as you told them what they were going to – what the data was going to be used for, and that whoever had collected the data did the best they could to protect it, and they were open about the process on which they did, because one of the theoretical questions, like, what if it got broken in and people were more okay if it got broken after it was like password protected-encrypted and all these other things. They’re like, oh, it happens, they tried their best, because it was for a greater good or something like that.
JS: Right. We’re almost out of time here, but Sarah has a question that I thought was really interesting. She unmuted, so Sarah, go ahead.
Sarah: Yeah, that kind of picks up on what Graham was just saying a bit, but I was wondering if there’s work that folks have done on sort of the equity dynamics and reidentification risk, like, are different groups more or less vulnerable to being reidentified and then given that are different groups more likely to see harm or benefit from the identification?
GM: Claire, I don’t know of any that have taken an explicit equity focus, but that just couldn’t be me not knowing as much of the literature, do you know of any?
CB: No, not specifically, other than what we call the small population problem. So it’s another, like, when I said, top five challenges of data privacy and one of them is the small population problem because on one hand you need that kind of finer grained detailed data to have more targeted benefits of thinking of, for instance, I’m writing a proposal with somebody in our Metro Center on trying to figure out, oh, can we get access to employee and employer data to be better about helping rural communities do startups and even help their small businesses flourish. However, at that such a fine grained detail of geography and western demographic information, then it’s going to be very identifying, for example, I come from Salmon, Idaho, I talk about that in my blog, there’s 3000 people there and I was the only Asian American high schooler. So people could definitely find me there, so unfortunately, not to my knowledge there’s any papers looking into the equity, it’s very important issue. I like to tell people that there are so many cool and interesting problems that we really need to address and work on, but there are not enough Claires to work on them.
GM: It’s a really open place, I think like, if I’m going to, it’s sort of like why we haven’t seen any action is also a good question, and I would I think we should see more research in this field. I think part of the reason we haven’t, and I think maybe census might be the first to do this is, is, as you said Claire, there’s a small population problem, and then there’s the non-response or are they included in the data problem, in some ways we over-surveil people in criminal justice but in other datasets we often have to correct or oversample lower income people or people of color or Latino people because they aren’t responding to the surveys at as high of a rate, and that also just quickly creates the exact same problem that Claire was mentioning in terms of how do you do representative risk. And one of the things we were talking with, we were talking about the City of LA a couple months ago, about a spatial equity representation tool, essentially it’s a tool we’re building that helps local governments give us point data on where they’re like investing, and we can say, well, you’re over investing in higher income white neighborhoods. And we use underlying census data to do that analysis and the City of LA is like, yeah, that’s a great tool, and we can see why we really need it. On the other hand, we have a huge census non-response rate, we have one of the highest in the country. So it’s like, why do I even trust your equity analysis because I’m not – there are so many people that aren’t included in the data that we trust as ground truth today that I am questioning anything you do with that data. So I think that’s also an interesting larger question to consider.
JS: Yeah, there was, I don’t know if I can call it research, but in Cathy O’Neil’s book, Weapons of Math Destruction, she has a lot of examples of these. I don’t know if I’d call that research per se, I mean, she’s researched to write the book, but probably not in the sense that we were just talking about. So we have, I think, we have just a couple of minutes, so I want to let Daniel come back in on this question about whether people’s perspective on data security privacy will change. So he had a follow up question. So Daniel, go ahead.
Daniel: Hello, can you hear me?
Daniel: Okay, great. Hey Claire, hey Graham. So I wanted to push back a little bit on your comment, Claire, because while the surveys are nice, I haven’t really seen like a more sort of legislative or popular pushback, and I was curious on whether you thought the light punishment of the Equifax hack or any sort of repercussions on Facebook because of the Cambridge Analytica scandal sort of reinforce the idea that the public perhaps doesn’t care or doesn’t particularly understand sufficiently about privacy to do something about it, do you have either examples of companies being correctly punished or maybe sort of organizations that are really promoting sort of legislative change?
CB: So that’s a good question. This is a little bit outside of my expertise. It’s probably more like I’m aware of these things because they are conventionally related to my research. So in terms of companies being directly punished, I can’t think of any right now on the top of my head, but I know that given things like Facebook, what happened there, there were other companies who responded to that in their own practices because they were afraid that either thinking that the users wouldn’t trust them anymore and not use their products or the fact that they are like, maybe they have – I shouldn’t say it like that, but some of them do care and try to be a good acting sort of citizen entity or something like that. So for instance, right after the Analytica scandal happened, that was, actually that incident happened two days before I defended my thesis. So that was very interesting and timely. And so when I went out on the job market, there was actually a lot more positions up for a data privacy specialist, and I think it was in response to kind of that [inaudible 00:55:14] going like, hey, we don’t want this to happen in our company, we don’t want being under scrutiny of Congress, we need to respond and be better citizens. And so on to the policy side, so if you’re interested in that area, you should look at the future privacy forums. They are a nonprofit who very specifically try to talk to Congressmen and women on trying to enact better laws. So they were following a lot with the Washington State when they were enacting one of their legislation [inaudible 00:55:43] eventually that fell through, so they’re going to be pushing again to try to get better privacy. I’m trying not to [inaudible 00:55:50] with the time, but there’s other privacy issues such as when you go on to shop and people are like, oh, you want to be part of our newsletter, that’s actually a privacy risk, because they are kind of indirectly forcing you to be part of their server. So yeah.
JS: Wow, interesting.
GM: And I will say like, I will just make a comment [inaudible 00:56:10] GDPR has actually forced a lot of innovation in this area, I hate to say that because people are really anti-GDPR most the time, but I think it has and the companies are basically trying to do this. They’re essentially taking these systems now, like, Social Science One is an area where Facebook had put out a bunch of public data on Facebook Shares, took them two years to build a private system that researchers can access. And their lawyers, if you look at their public blog on that, their lawyers were almost not releasing the data because they’re so worried about GDPR fines and penalties. So there really, I think, there really has been a shift [inaudible 00:56:45] some of these private companies like LinkedIn and others of really better protecting people’s data.
JS: Yeah. We started this hour by saying what are we going to talk about, I don’t know, we’ll just kind of [inaudible 00:56:57] and figure it out, and we just hit the hour mark, and there’s more that we could talk about, I’m sure. And I’m sorry to the folks who sent in questions and we didn’t get to them, I’m sure everybody else has another Zoom call they have to get to, so I’ll just say thanks everyone for coming in, tuning in today, and a special thanks to Graham and Claire for chatting, this was really cool. I put the link to the rest of this week’s lineup, tomorrow two more Urban colleagues of mine, Rob Santos who’s on this call, I see [inaudible 00:57:27] and Diana Elliott will be talking about their project from last year on 2020 census and what COVID will mean for the counts that are going on right now. And then, more great folks coming up the rest of the week. So thanks everyone for tuning in, coming on, I appreciate it, and yeah, keep in touch, let me know who else you’d like to hear from on these chats. So thanks everybody, have a good one, stay safe, stay healthy. Take care.
Thanks everyone for tuning in to this week’s episode. I hope you enjoyed that. I hope you found it interesting. I am going to post some more of those discussions in the coming weeks that we had on the digital discussion series. If you’re interested in hearing more of them or even seeing more of them, you can head over to the Urban Events page where we have posted recordings of nearly all of those talks on that page. So you can go watch them over there at the Urban Events page that’ll link you over to the Urban YouTube channel. So I hope everyone is well and safe and healthy. So until next time, this has been the PolicyViz podcast. Thanks so much for listening.