unrepresentative samples, part deux
Today, I’ll situate my argument about unrepresentative samples a little bit more. Next week, I’ll provide some commentary on Sam Lucas’ (2012) article, “Beyond the Existence Proof: Ontological conditions, epistemological implications, and in-depth interview research,” which provides a counter-argument to my view.
First, it’s a relief to not be alone on this issue. For example, Pete Warden, who runs a consulting firm, made the following comment on his blog:
It’s uncontroversial in the commercial world that biased samples can still produce useful results, as long as you are careful. There are techniques that help you understand your sample, like bootstrapping, and we’re lucky enough to have frequent external validation because we’re almost always measuring so we can make changes, and then we see if they work according to our models. The comments on this post are worth reading because the approach seems to offend some sociologists viscerally.
The main issue is that academics strive toward purity. We want perfect experiments, logical proofs, and random samples. These aren’t necessarily bad things, but they aren’t needed in many cases when you have evidence that a non-random sample is close enough. We prefer “a reviewer can’t argue with me” to “this is probably correct.” Another commenter also draws our attention to a debate in the epidemiology literature, which is worth reading.
Second, I wanted to clarify a few things. For example, no one should read my post and think that I don’t like random samples. Just for the record:
- Random samples are good!
- If you have a choice between a good representative/random sample and one with bias, go for the random sample!
- If you are conducting research, random samples should be your first option in research design. Reject random samples only if you have an overwhelmingly good reason.
- You can’t just automatically make broad conclusions with non-random samples.
- I use random samples in my research. When I can’t, I try to approximate them as best as I can.
So, where do I differ with mainstream consensus?
- The lack of randomness doesn’t automatically mean that model estimates are biased. The situation, for lack of a better word, is “indeterminate.” The model may be way off, or a little off, or off in a way that can be corrected. Or it may be right on. You just don’t know. This is simply the basic logic of mathematics (e.g., X –> Y does not imply not X –> not Y).
- If we have good reason, non-random samples should be the focus for more research that provides external assessment.
- For specific research designs that produce unrepresentative samples, we can probably provide a well motivate analysis of when it produces data that can yield model estimates that are very close to the true model, or can be systematically corrected.
Why is this an important issue?
- Some fields are resistant to random samples. The homeless population, for example, is notoriously hard to sample.
- It is impossible to sample other fields. For example, it would be hard to sample Paris residents circa 1500. We’re stuck with non-random samples from the historical record.
- Some fields are riddled with vagueness. In social movement studies, there is not a universally accepted definition of “activist.” Any definition will probably introduce some biases.
So we have to work with non-random samples to make progress in many important social science areas.
Finally, let me lay out my basic assumptions. At heart, I’m a Bayesian. I don’t believe that every study lives by itself. Instead, I have lots of information. That doesn’t mean that I accept all unrepresentative samples. Rather, it means that I accept them when I have additional information that suggests that the bias is small or systematic and correctable.