The key problem in applying computer scientists’ data selection methods to investigate a social science phenomenon is that unlike a pre-planned survey or observational study, the collection of organic data is typically done through APIs, third party companies or web scraping without full knowledge or control over the frame from which the samples are generated or explicit description of the algorithmic selection properties. Moreover, most researchers who mine social media data do not understand the characteristics of the individuals in their sample or the mismatch between their sample and the target population. Adjusting for this mismatch between the social media population and the research target population is not straightforward since many social media platforms do not have accessible specified fields and do not require users to fill in their socio-demographics or other characteristics that are essential for understanding the pool from which the participants originate from. While work exists, standards need to be established across disciplines to understand the samples researchers are using, sampling frame coverage, sampling procedure, sample design features and size, and population mismatch adjustments.
The meeting took place online and had 26 attendees, including 14 guests from outside the project team who spent a day and a half discussing, presenting, and writing about measurement issues. The team is currently working on a white paper about the data acquisition and sampling challenges associated with research involving social media data. We are also updating our interactive glossary of terms used differently across different disciplines, and growing our new Google group discussing social media research.