It’s not clear to me that public policy researchers have agreement on what is meant by the term “Big Data” and how to utilize such data in their work. An example is the yesterday’s preconference “Big Data and Public Policy Workshop” workshop that preceded at this year’s Association for Public Policy Analysis and Management (APPAM) conference. The workshop, I think, failed to foster this and other relevant questions and discussions about new forms of data in research.
What does “Big Data” mean to researchers? Are administrative data “Big Data”? What about linked administrative and survey data? How about Electronic Benefit Transfer (EBT) card data, which would consist of food transactions for the more than 46 million people receiving benefits? How do researchers get these data and how do they use them? Yesterday’s workshop missed an opportunity to provide researchers with tools, knowledge, and skills to tap into new and different data sources now available.
The workshop description starts with what I had hoped I would see:
In the social science context, the new data can potentially offer information for policy-makers that is much more current, granular and richer in environmental information than data produced by statistical agencies from surveys.
I was looking forward to seeing and engaging in discussions about how new data sources might affect the practice of science and research. Unfortunately, the workshop was closer to the last sentence of the description:
This JPAM workshop seeks to assess as well as showcase cutting edge empirical work in this vein.
While having a day dedicated to showcasing papers that use new kinds of data is perhaps valuable (though I’ll note that there is at least one session in the regular conference on “Big Data”) there could have been more discussion about what constitutes “Big Data,” how to get it, and how to use it.
Papers at yesterday’s workshop featured city-level administrative data on restaurants and education; education data from online universities; and a couple of months of Twitter data. Are those files considered “Big Data”? Perhaps for public policy research; perhaps not if you work at Google.
There was also very little time devoted to the process by which researchers could get, process, and analyze “Big Data.” I saw only one paper (there were two sets of concurrent sessions, so I couldn’t see everything) that talked about process, and mentioned tools like R, Python, and the importance of version control (which should be unique to new data sources; tools like Github and Subversion should become part of researchers’ toolboxes). And issues of process are clearly important: Researchers on two papers were manually assessing their “Big Data” and conducing sentiment analysis one record at a time. Budget constraints and lack of knowledge seemed to curb use of platforms like Amazon’s Mechanical Turk.
In the area of social and policy research, I think “new and nontraditional data” is perhaps a better phrase than “Big Data.” And yes, “new and nontraditional sources of data” (do you think “NANSOD” will take off?) can offer researchers different ways to explore important policy topics. But such data sources are not a panacea. We can’t move away from in-depth, sophisticated analysis simply because we have new, large data from new sources. Although the supply and demand of these data continues to increase, there should be concerns that researchers (and others) will use these data without grounding them in high-quality statistical analysis, theory, and previous work. Access to good data does not replace good analysis.
I wouldn’t really classify yesterday’s meeting as a workshop; it was more an extension of the regular conference with multiple papers presented in 75-minute sessions. Instead, I would have liked to see a series of presentations, panel discussions, hands-on tutorials, or other roundtables on topics to help researchers understand what is meant by “Big Data” and how they can use it in their own work. Here is a list of examples of topics that I think would benefit researchers:
- What is “Big Data”? What are new and nontraditional sources of data?
- What are the advantages and disadvantages of using large or new sources of data?
- How can we bridge different fields—such as economics, computer science, psychology, mathematics, and sociology—to use data in better ways?
- What are the technical constraints of using large data sources?
- What is an API, how do they work, and how does a researcher use them to extract data (with perhaps an applied exercise using Twitter or Facebook)?
- What does “data scraping” mean, and how can researchers do it?
- What are the advantages and disadvantages of open source software programs and packages?
- How does version control work? What are the best platforms or software tools, and how do they work (again, perhaps with an applied exercise using Github)?
- What tools can researchers use to extract, process, visualize, and publish their data and analysis? What are the advantages and disadvantages of those data, especially with organizations that may have staffing or budgetary constraints?
I also think there are other skills that researchers might benefit from learning including developing a social media presence, communicating with reporters, blog writing, data visualization, and presentation skills. These kinds of workshops—which we’ve been developing at the Urban Institute—can help researchers in many ways.
The closing keynote was the bright spot for me. Brett Goldstein, former Chief Data Officer in Chicago and now a Senior Fellow in Urban Science at the Harris School of Public Policy at the University of Chicago, talked about his various projects while with the city and now at the Harris School. His Plenar.io open data portal looks really interesting, as does his Weather-and-Crime project (though the red-green color palette will anger the color police). His broader theme that it is increasingly important to bring together people from different fields and expertise is vitally important in this era of more data. It is one that researchers should take to heart when conducting research either with traditional or new data sources.
The workshop organizers remarked that Goldstein’s presentation was exactly what the workshop was about. I would agree that his presentation embodied what the workshop should have been about, but unfortunately I don’t think the workshop sufficiently addressed the issues and challenges he raised.
As the value and availability of data continues to grow, researchers need to understand what those data sources are, and how they can get and use them. Yesterday’s workshop was a missed opportunity to provide researchers with tools, resources, and knowledge to tap into these new sources of data.