With the modeling meeting coming up soon, it is a good time to reflect on the main takeaways from the white paper on Measurement that emerged from our meeting in September 2020. That meeting’s focus was on the role of standard social science measurement criteria in evaluating measures from social media data. The paper discusses some of these classic measurement criteria, such as construct validity, reliability and bias, and whether they apply to social media measures. This question is especially tricky when these measures are generated with computer science methods. Computer scientists generally do not use these criteria to evaluate their measures, instead, they use concepts like “correctness,” “accuracy” (which can include concepts of precision and recall), “efficiency” (as defined in computer science) and “reliability” (as defined in computer science).
One of our goals in the white paper is to help social scientists and computer scientists understand the different criteria and vocabulary that they use to evaluate measures. We found one major difference was that social scientists make these evaluations by comparing a measure to a theoretical construct. All of these types of social science measurement evaluations are different ways of comparing the measurements to the theoretical construct. But computer scientists usually don’t conceive of a theoretical construct that is separate from their measures, and so do not evaluate measures on that basis. Most computer science measurement evaluation criteria are based on different ways to conceptualize measurement. We do not believe that one approach to measurement evaluation is better than the other. But we do think that, because some of these methods were developed in computer science and are still used predominantly by computer scientists, they have not been sufficiently evaluated by social science criteria. Many have not been compared according to how well they correspond with the theoretical constructs they are intended to capture, and they should be.
Thus, a second (related) goal of the white paper, in addition to teaching computer scientists and social scientist about each other’s vocabulary and evaluative methods, is assessing major existing measurement methods using traditional social science criteria. In the white paper, we discuss strategies for ensuring reliability such as formalizing procedures and metrics for calculating interrater reliability on codes or labels, and those for establishing validity, including the use and selection of ground truth.
Particularly relevant to our next meeting and planned white paper focusing on modeling is our discussion about the difference in goals between creating measures for constructs and creating features for algorithms. We argue some of what computer scientists call “features” are what social scientists call “measures,” but some are not – they may instead serve to capture the data in a way that helps improve the model of an ML classifier. And in that way, assessing reliability and validity may be hard, or at least the bar for reliability and validity doesn’t need to be as high. Defining more clearly the distinction between feature engineering and measurement (and thus between “features” and “measures”), as social scientists understand it, is a question that was challenging for us in this paper, and one we hope we will continue to discuss in the modeling meeting next week.
Another take-home lesson from the measurement meeting and white paper that is relevant to our upcoming modeling discussions is the need for measurement methods to have some level of human involvement to work well. This was one of the key themes of the conclusion of the measurement white paper. What is the role of “fully automated” methods for analysis of social media data today, if any, and what are the best practices for human involvement in building models, particularly those that produce measures for use in social science? We conclude that, at this point, we would rarely recommend “fully automated” measurement strategies in social media research. Computer methods should usually be semi-automated, in which the researcher still plays an active role in some way reviewing the results and modifying the method when necessary. The main reason is that, the brevity of social media posts and the constantly changing nature of the posts themselves and the language conventions in the posts mean these algorithms are constantly being applied to changing circumstances. Humans monitoring and (when necessary) modification is, we believe, the best way to ensure that automated methods produce valid, reliable, precise and relatively unbiased measures
Finally, writing the white paper on measurement made especially clear to us how much all of these steps in the process that we are examining in this Convergence initiative – design, sampling, measurement, modeling, and visualizing – speak to and depend on each other. We hope that this brief review helps us connect our discussions last Fall and to our final two meetings on modeling (next week!) and on analysis and visualization in March.