unrepresentative samples, part deux

Today, I’ll situate my argument about unrepresentative samples a little bit more. Next week, I’ll provide some commentary on Sam Lucas’ (2012) article, “Beyond the Existence Proof: Ontological conditions, epistemological implications, and in-depth interview research,” which provides a counter-argument to my view.

First, it’s a relief to not be alone on this issue. For example, Pete Warden, who runs a consulting firm, made the following comment on his blog:

It’s uncontroversial in the commercial world that biased samples can still produce useful results, as long as you are careful. There are techniques that help you understand your sample, like bootstrapping, and we’re lucky enough to have frequent external validation because we’re almost always measuring so we can make changes, and then we see if they work according to our models. The comments on this post are worth reading because the approach seems to offend some sociologists viscerally.

The main issue is that academics strive toward purity. We want perfect experiments, logical proofs, and random samples. These aren’t necessarily bad things, but they aren’t needed in many cases when you have evidence that a non-random sample is close enough. We prefer “a reviewer can’t argue with me” to “this is probably correct.” Another commenter also draws our attention to a debate in the epidemiology literature, which is worth reading.

Second, I wanted to clarify a few things. For example, no one should read my post and think that I don’t like random samples. Just for the record:

  • Random samples are good!
  • If you have a choice between a good representative/random sample and one with bias, go for the random sample!
  • If you are conducting research, random samples should be your first option in research design. Reject random samples only if you have an overwhelmingly good reason.
  • You can’t just automatically make broad conclusions with non-random samples.
  • I use random samples in my research. When I can’t, I try to approximate them as best as I can.

So, where do I differ with mainstream consensus?

  • The lack of randomness doesn’t automatically mean that model estimates are biased. The situation, for lack of a better word, is “indeterminate.” The model may be way off, or a little off, or off in a way that can be corrected. Or it may be right on. You just don’t know. This is simply the basic logic of mathematics (e.g., X –> Y does not imply not X –> not Y).
  • If we have good reason, non-random samples should be the focus for more research that provides external assessment.
  • For specific research designs that produce unrepresentative samples, we can probably provide a well motivate analysis of when it produces data that can yield model estimates that are very close to the true model, or can be systematically corrected.

Why is this an important issue?

  • Some fields are resistant to random samples. The homeless population, for example, is notoriously hard to sample.
  • It is impossible to sample other fields. For example, it would be hard to sample Paris residents circa 1500. We’re stuck with non-random samples from the historical record.
  • Some fields are riddled with vagueness. In social movement studies, there is not a universally accepted definition of “activist.” Any definition will probably introduce some biases.

So we have to work with non-random samples to make progress in many important social science areas.

Finally, let me lay out my basic assumptions. At heart, I’m a Bayesian. I don’t believe that every study lives by itself. Instead, I have lots of information. That doesn’t mean that I accept all unrepresentative samples. Rather, it means that I accept them when I have additional information that suggests that the bias is small or systematic and correctable.

Adverts: From Black Power/Grad Skool Rulz

Written by fabiorojas

October 1, 2013 at 12:01 am

10 Responses

Subscribe to comments with RSS.

  1. Looking forward to hearing you comments on Sam’s article.



    October 1, 2013 at 1:18 am

  2. I am a user of longitudinal studies, which, over time inevitably become unrepresentative. We do what we can to deal with biases using statistics etc. but the samples are not usually representative after 50 years of attrition. The results from analyses are nevertheless useful in helping us try to figure out underlying mechanisms behind the observed findings. Then, you need to go and test your hypotheses about the mechanisms in other datasets.


    Michelle Kelly-Irving

    October 1, 2013 at 6:32 am

  3. I had to deal with asimilar critique of my work but was always uncomfortable with the argument that one sample in a May Day union rally in one city (NYC) months after the core activities of OWS was going yield a “true random sample” of this national and global movement. When I look at the methodology and results of this one study looks to me that sampling in a union-OWS-immigrant rights rally yields a particular type of sample (with heavy representation from union members as is confirmed in the data). There are perceived gains to be had by criticizing some work as “non-random” while presenting other work as the “result of a random sample…” So, I am not just concerned with the methodological and analytical issues involved (which are both interesting and complex) but also with “research spin.” Here is one sociologist criticizing others and work they don’t know or bothered to learn about while positioning work that at the time this article was written had not even been done yet as “random sample” before a single piece of empirical data had been collected…a random sample of who or what is what I want to know…in order to understand some social phenomena we need different tools, strategies and techniques that, together, draw a picture (and approximation) of the underlying social complexity and reality…I agree with what you say in your post and relate to the complexities you are bringing to the table…



    October 1, 2013 at 11:49 am

  4. I was going to write about Sam Lucas’s article but you beat me to it. I may do it anyway in the next few weeks, because you don’t really address it here more than just to note its existence.

    One glaring concern with what you’ve posted here is that you conflate “random” with “representative.” Randomness is one (probably the best) way of approaching representativeness, but the two are not synonymous.

    The other major issue is your fascination with the indeterminacy of the findings achieved from nonrepresentative samples. Consider a situation in which there are two possible findings, A and B. Prior to research, it is possible that A is correct, and possible that B is correct. In other words, the state of knowledge prior to the study is indeterminate. You do research using a nonrepresentative sample, and in our own words, the findings are indeterminate — “you just don’t know.” You have now advanced the state of knowledge not at all, as you’ve moved from indeterminacy to indeterminacy. The fact that your nonrepresentative finding might be the same as the finding from the representative sample you didn’t use is of no use whatsoever.



    October 1, 2013 at 6:39 pm

  5. Andrew:

    1. Lucas’ article has a lot of important arguments. It needs it’s own separate discussion. I’m glad that it will get multiple comments.

    2. Ok: guilty on the ambiguousness of random vs. representative.

    3. I think you should consider a Bayesian framework. Absent *any other knowledge* you remain indeterminate in the face of non-randomness/biased sample. But we aren’t in that world. We can start to resolve indeterminancy by looking for other samples or other forms of evidence that would help us move from a flat prior to a non-flat prior.



    October 1, 2013 at 7:06 pm

  6. @andrewperrin: “You have now advanced the state of knowledge not at all, as you’ve moved from indeterminacy to indeterminacy.” Well, that was quick! But it hardly seems sporting. Shouldn’t you give our host a bit more rope before you pull the floor out from under him?



    October 1, 2013 at 7:35 pm

  7. Fabio, the question in your “Bayesian” world is: what additional knowledge does the nonrepresentative sample add? If we add to my scenario that we already have a vector of knowledge about the field, X, and A vs. B remains indeterminate, then we’re right back where we started after the study: X remains true, and A vs B remains indeterminate!

    To hint at what I plan to write next week (after my son’s bar mitzvah is over), I think there are more uses for what Lucas calls the “existence proof” than he gives credit to. But I agree with him that nonrepresentative samples are limited to the existence proof.



    October 1, 2013 at 8:25 pm

  8. Andrew: Think about phone surveys. Lots of evidence that they are biased. Method (cell vs. landline), SES effects (less educated answer them), age effects, etc. These biases are strong enough that we might get seriously misleading results.

    But hold on – I have other evidence that can help me judge the survey. For example, we have census data. Also, I might have surveys that are linked with other external predictors (e.g., we can compare political opinion surveys with actual voting, or survey data with voter registration records). There might be other surveys of populations that were excluded or under/over reported in my phone survey.

    With this extra data, I can now make a judgment about how good or bad my phone survey is likely to be. For example, we can use census data or voter registration data to create sample weights. Or, if we have data that young people (who don’t pick up the phone) are not much different than old people for the variable we care about, we are justified in believing the error is small.

    So to return to your question: what does the non-representative sample add? In my example, it adds information about a current population of people, or about an issue that wasn’t pertinent before. Prior research gives us bounds on the size of bias or the sorts of bias that matter.



    October 1, 2013 at 9:11 pm

  9. Fabio, the phone survey example is revealing. It is useful insofar as you know parameters of the population being sampled — precisely because you can use this information to approximate representativeness! This is why the conflation of “random” and “representative” is relevant.

    Another way of putting it, based on your final paragraph: it only adds information about “a current population of people” if it is representative *of that current population*. There’s no inherent reason why “representative” means “nationally representative” or “globally representative”.



    October 1, 2013 at 11:50 pm

  10. I’m finding this discussion annoying because it ignores the central importance of sampling frames and the distinction between the population from which a sample is drawn (the population framed by the sampling frame) and the population of theoretical or substantive interest. A random sample from a bad frame is a bad sample, in the sense that results based on it have no relation to the population of interest. A sample from a good frame that is drawn with some attention to achieving representativeness and avoiding gross types of selection bias, even if that is done purposively rather than randomly, has a better chance of providing useful information about the population of interest.

    Some criticisms of sampling are not about their randomness but about the poor fit between the population of interest and the population from which the sample was drawn.



    October 2, 2013 at 12:20 am

Comments are closed.

%d bloggers like this: