This post was originally published on the Urban Institute’s blog, Urban Wire on December 17, 2015.
Earlier this week, I wrote about how new and nontraditional forms of data hold promise for public policy researchers, but also raise a lot of new questions. I want to now skip ahead a bit and posit the following: In a world where researchers understand how to access, download, and analyze new data sources, are they prepared to do so responsibly and ethically?
While data breaches at your favorite store or online dating website are now almost commonplace, researchers have worried about data privacy for some time. Anyone using administrative data from the Internal Revenue Service or Social Security Administration, for example, must go through extensive training to access the data and often must use it in secure buildings on local networks. The Census Bureau takes data privacy very seriously because without it, people might be less likely to answer surveys (other issues of nonresponse and imputation issues notwithstanding).
But say a researcher matches credit card transaction data to administrative earnings records or data on participation in some government program. These data might be initially considered “big data,” but if the researcher merges the different data together and then focuses on a specific, small group, the researcher or others may easily identify individuals. Anonymity is fundamental to ethical research, but are researchers prepared to recognize these types of issues with such types of data? Or will the excitement of using new, real-time data overwhelm the requirements to be responsible?
Furthermore, will researchers be prepared to use analytic techniques that are accurate and replicable? How many graduate programs in economics are offering lessons in machine learning (which can move the typical, hypothesis-testing driven research approach to a data-driven approach)? I have seen a number of papers using, for example, Twitter data, where the researchers are themselves hand-coding and categorizing responses. Because such approaches are subject to error and might restrict researchers to small(er) sample sizes, this method is neither responsible nor replicable.
As “big data” and nontraditional data become a larger part of the tools available to researchers, we must be prepared to answer difficult questions about data security and privacy, computing power, and analytic methodologies.
This is very timely and insightful essay; these are significant points. Especially coming out of the Urban Institute, the call is for attention to the substance of data of any size. Data-driven means not just massive parallel processing, coding, minimal visualization, even in R. Health care policy certainly uses ‘Big data’ for professional substantive decisions, smart-city efforts do as well. Attention to true decision making, ethics, data privacy, and intellectual property need attention, and less so to sometimes vainglorious notions of ‘coding.’ Policy Analysis must be guided by substantive analysis as well as sound and viable techniques. Even the notions of volume, variety, velositiy, and veracity need to be applied to the growth of “Big Data’ applications themselves. Even major big data applications use semantic and graph triple stores rather than plain old retional or NoSQL databases, which is my cup of tea. Anyway, that’s my soapbox speech.