computational ethnography #3: the battle continues
A few weeks ago, I suggested that one can use techniques from computer science to assess, measure, and analyze the field notes and interviews that one collects during field work. The reason is that computer scientists have made progress in writing algorithms that try to pick up the emotional tenor or meaning of texts. Not perfect by any means, but it would be a valuable tool that can be used to help qualitative researchers identify themes and patterns in the text.
In the last round, there were two comments that I want to address. First, Krippendorf wrote: “Why call it computational ethnography and not just text analysis?” Answer: There are two existing modes of analyzing text and techniques like sentiment analysis and topic modelling new things in new ways. Allow me to explain:
- The traditional way of reading qualitative texts is simply for the researcher to read the texts and develop a grounded understanding of the meaning that the text represents. This is the standard mode among historians, most anthropologists, and some sociologists. Richard Biernacki in Reinventing Evidence in Social Inquiry argued that is the only valid mode of qualitative analysis.
- The other major way to deal with qualitative materials is to conduct a two step operation of having people code the data (using key words or other instructions) and then performing an inter coder reliability analysis (i.e., assign codes to texts and compute Krippendorf alpha’s).
So what is new? Techniques like topic models or sentiment analysis do not use people to code data. After you train the algorithms, it is all automated. This has advantages – speed, reproducibility, and so forth – for large data. Another novel aspect is that these algorithms are usually built with some sort of model of language in mind that gives you insight into how the text was coded. For example, the Stanford NSL package essentially breaks down sentences by grammar and then estimates the distribution of words with specific sentiment. Thus, there is an explanation for every output. In contrast, I can’t reproduce even my own codes over time. Give me a set of text next week, and it will be coded a little different.
Second, a number of commenters were concerned about the open ended nature of notes, the volume of materials, and whether the sorts of things that might be extracted would be useful to sociologists. These comments are easily addressed. Lots of projects produce tons of notes. I recently collected 194 open ended interviews. My antiwar project resulted in dozens and dozens of interviews. We have the volume. Sometimes they are standardized, sometimes not. That’s an empirical issue – how badly does it do with unstructured text? Maybe better than we expect. There is no reason for an a priori dismissal. Finally, I think a little induction is helpful. Yes, we can now pick up sentiment, which is an indicator of emotion, but why not let the data speak to us a little? In other, there’s a whole new world around the corner. This is one step in that direction.