Blog Post

What’s in a Sample? Acquiring and Preparing Social Media Data for Analysis

In the midst of the COVID-19 pandemic we maintained our accelerated efforts to identify key features of social media data collection and analysis that scholars need to consider in their research. Our third methodology meeting was focused on data acquisition, sampling, and data preparation. More than twenty researchers covering ten disciplines (statistics, survey methodology, sociology, psychology, communication, political science, public policy, economics, linguistics, and computer science) met virtually over a day and a half to discuss considerations related to these topics. These discussions culminated in a white paper that was produced in the following months. The meeting discussions and the follow-up paper raised critical considerations and led to reflections and recommendations related to acquiring, sampling, and preparing social media data for social science research. Collectively, these considerations frame the data collection process and help determine what any given study using social media data will be able to conclude.

The first consideration focuses on the manner in which social media data are acquired. The primary modes researchers use to collect social media data are through collaborations with social media platforms, APIs, and scraping. Researchers vary in their technical capabilities and access to different strategies in ways that can challenge replicability and can introduce inequities in who can reach which conclusions. Collaborations with social media platforms are sometimes required for understanding the intricacies behind the sample acquired and can allow for more control over research design. Unfortunately, this is not the most common method and collaborations are not uniformly based on equitable access. The two more commonly used methods, acquiring data through APIs and scraping raise other inequity concerns as they require different technical resources, skill sets, and scaling budgets. Thus teams with different levels of skills, and institutions with limited resources may be forced to use one method rather than another leading to differences in the reliability of the process to acquire continuous data and difficulty of replication. In light of these considerations, the paper highlights the need for shared data sources and data portals that can make social media data more accessible to researchers with different sets of skills and institutions with different available resources.

A second consideration is focused on the type of sample and how it relates to the research questions. An interesting discussion between computer science and social science researchers revolved around which study designs lend themselves to focusing on the particular data acquired  and which designs yield inferences based on a deeper understanding of how the sample was generated or how the data collected might relate to a larger population (i.e. beyond the sample itself). The paper establishes principles for understanding which social media study designs align well with the former and what approaches might be used with a population inference goal. Social media studies with population inference objectives could theoretically rely on probability sample designs (where every member of the population has a known non-zero probability of beings selected) or non-probability samples that are adjusted or analyzed in particular ways. In addition to differences in the feasibility of acquiring a probability vs. non-probability sample of social media, this distinction raises another equity concern. If findings from social media research studies are being used to generalize to a larger population or to drive policy decisions, who are the entities that made it into the data sample and do they in fact represent others (not selected) in the population. Are there hidden or silent groups? Probability sampling, in theory, ensures that every member of the population has a known chance of being selected and thus is viewed as a more equitable sample. In the absence of the ability to acquire a probability sample of social media posts or users–and if the research aims at representing a larger population beyond the observed sample–the white paper discusses what we need to know about the acquired sample and its entities, the population it came from, and the methods and the necessary conditions needed to bridge the gap between the sample and the population.

The third set of considerations are related to preparing the sampled social media data before analyzing them. Issues related to data exclusions, cleaning, and aggregation have considerable implications for research conclusions. These implications call for transparent decisions and documentation of these essential steps if we are to have any hope of building a replicable and reproducible science based on social media data.

Finally, the paper dedicates a section to the ethical considerations raised for the use of social media data as it relates to Institutional Review Board (or ethics committee) approvals, informed consent, and data dissemination. In terms of informed consent, the paper describes a general hierarchy of conditions that could contribute to establishing a framework on when informed consent is needed for social media research.

The importance of these considerations and others discussed in the white paper is particularly pronounced given the ever-changing nature of social media data and the evolving tools researchers are using to collect, prepare, and examine them. The white paper thus articulates a formal approach to considering the choices. Without a clear understanding of the set of decisions researchers make, there is reason to worry that researchers may reach incompatible conclusions due to different norms of data collection and preparation. The set of options underlying each of these decisions will continue to evolve. As it does so, we need a shared vocabulary for articulating what methods were chosen. Future conversations should continue the interdisciplinary dialogue begun in these meetings to ensure a collective understanding of the set of procedures that define the scope of social media research.