As a computer scientist, I have always been fascinated by constructing useful insight from large-scale data. When I was a junior faculty member, I wanted to design “clever” algorithms that were efficient. I may have been atypical because I was less connected to the computational task and more connected to the data. I wanted to understand what the data represented, how it was generated, what processing had been done on it, and then develop new ways to describe and use it. Thinking about traditional computational/data mining tasks (association rules, dynamic clustering, anomaly detection) and new ones (stable alliance, prominent actors, bias, and event detection) was a fun mathematical puzzle. I was also very partial to algorithms that used certain data structures (different representations of the data for more efficient processing), particularly graphs (what many other disciplines would call ‘networks’).
Once I learned all I could about the data, I typically moved on. This focus on different types of data may explain how I stumbled into interdisciplinary research. From dolphin observational data to experimental medical data to streaming financial/purchase data, my focus was on developing algorithms that highlighted some aspect of the data that was hidden because of their size or that I needed to keep hidden because of privacy constraints. The algorithms I developed helped other scientists understand their populations better, but even more importantly, the disciplinary theories, ideas, and methods we shared opened up new ways of thinking about old problems in our disciplines.
In the last few years, I have shifted my focus to organic data. Organic data are data that are not designed — survey data are designed by researchers to help research specific hypotheses. Instead they are considered “data in the wild” that are generated in a natural setting. Social media is the largest example of organic data. But instead of just looking at classic and innovative data mining tasks, I have been working with a team of interdisciplinary researchers to help answer social science questions using these data. What is the impact of news and social media on elections? Or more generally, public opinion? How does social media shape parenting attitudes? How can we use organic data in conjunction with traditional data sources to help forecast forced migration? How can we better understand representativeness of different online populations? How does online conversation drive policy and cultural change? These data are a game changer for social science and public health researchers. They provide a new avenue for learning about human behavior, attitudes, and decision making that are hard to capture in surveys or at scale using traditional ethnographic research methods.
As I began working on these different questions using organic data sources, it became clear that every project I was involved with was reinventing a “new” methodology – from study design to analysis and interpretation. I understood a great deal about the data, but very little about the social science disciplines that wanted to use these data – from their methodological traditions to their substantive theories. It became clear to me (and many of my collaborators) that a meta-problem existed. These data contained properties that differ from some of the more traditional forms of data used in social, behavioral, and economic (SBE) disciplines and every discipline was developing its own independent standard for using organic data sets. Actually, it would be more accurate to say that every discipline was applying new customized methods for using social media data to help them advance their research. There are cases when the organic data are similar to disciplinary data. In those cases, the scale of the data made it impossible to use traditional approaches for measuring and modeling the data. Every study being conducted was fairly adhoc, focusing on the disciplinary research question instead of a robust, repeatable research design.
So how could/should we tackle this? A group of 12 faculty in 7 disciplines at Georgetown University and University of Michigan (s3mc.org) recognized the need to bring together researchers from different disciplinary traditions to develop frameworks, standards, and designs for extracting significant research value from social media data and other new forms of publicly available text data. Together, we have all begun learning from each other. We have a clearer understanding of why disciplinary questions require social scientists to understand the data generation process completely to answer their research questions, while computer science disciplinary questions do not. Because of the scale of these data and the mismatch in disciplinary research traditions and outputs, it became clear that computer scientists did not always understand how their algorithms would be used by those in other disciplines and those in other disciplines were sometimes using algorithms without really understanding the underlying assumptions and limitations of the algorithms. It became clear that we had a lot to figure out.
Recently, this group received funding from the National Science Foundation to establish standards for using social media data in SBE disciplines. We are very excited. At a high level our plan is to (1) develop a cross-disciplinary methodology for using social media data in the context of different study designs, (2) create research exemplars (case studies) that use the methodology, (3) build a community of scholars interested in tackling relevant issues, and (4) help develop materials to teach scholars and students how to use these new forms of data in their research.
It is very exciting for me to be part of this large interdisciplinary team looking at how to responsibly use social media data for social, computer, and data science research. I hope that as the project moves forward, we teach each other the important ideas that our individual disciplines can bring to this problem, while learning how to combine them to generate new ideas and thought that transcend any one of them. My long term vision is for researchers to have the ability to blend data from many sources that capture data at different speeds, resolutions, and forms using new computer science approaches and algorithms that incorporate clear notions of validity and reliability, enabling SBE research that addresses disciplinary questions in a more holistic way – explaining and predicting a wide range of social, behavioral and/or economic phenomena. If this vision is going to become a reality, we are going to have to leave our disciplinary comfort zones, toss out our disciplinary hubris and try a few things that we would normally not consider. I hope in five years I write a blog post about how we came together, converged, and made this vision a reality. That we fundamentally changed how quantitative research involving organic data is approached and conducted.
If you are interested in helping create a roadmap and establish best practices for responsible, replicable, and reliable social media research, connect with us. We need everyone to participate. Below are links to different parts of our project.
- The Social Science and Social Media Collaborative – https://s3mc.org
- The NSF Supported Social Media Methodology Project – http://www.smrconverge.org
- The Google group we are using to help create a broader scholarly community interested in these issues – SMRConverge@googlegroups.com
- Hashtag for social media – #SMRconverge