# orgtheory.net

## the joy of unrepresentative samples

Let’s start with a question: Why should you believe that representative samples are good? The answer: representative samples produce estimates of the mean that are normally distributed. In plain English, a nice, big random sample will produce an estimate that is probably close to the real answer.

Follow up question: Does it logically follow that all unrepresentative samples produce systematically biased estimates? No, it doesn’t. To see why, you need a little Logic 101. In logic, it is well known that (A –> B) does not automatically entail ( not A –> not B), which is called the “inversion.” To see why ( not A –> not B) might be false, think about this conditional: “If it is a bat, it is a mammal.” Obviously, “if it is not a bat, then it is not a mammal” is not true. In general, you can’t really make any inference about an inversion from a conditional. They are simply different animals.

Let’s return to social science research methods. Our original conditional is: ( the sample is random –> the estimates are normally distributed around the real mean). It doesn’t automatically follow that ( sample is not random –> the estimates are not normally distributed around the real mean).  It *might* be true, but it’s not automatically true. It requires a separate argument.

So far, I have not read a general argument showing that unrepresentative or biased samples in *all* cases leads to systematic biases in the estimated parameters. That’s because there probably isn’t such an argument because samples can be biased for all kinds of reasons. In some cases, the bias may matter but in other cases it may irrelevant.

What’s the point? The point is that social scientists should abandon their knee-jerk rejection of unrepresentative samples. Instead, we should take a “case by case” approach to unrepresentative samples. We have to individually investigate each type of unrepresentative sample to determine if it can be used to estimate a parameter.

If you can accept that, then you open up a whole new world of data. For example, a lot of people can’t accept that futures markets are accurate forecasters of events. One of the arguments is that traders do not accurate resemble voters. True, but that doesn’t logically entail that it is impossible for trader behavior to mimic voter preferences. It *might* be true, but it is not automatically true based on the principle of random sampling. So what does the research say? Well, trading markets predict presidential election tallies better than random samples of voters (polls) 74% of the time. Not bad.

Bottom line: Yes, random samples are good, but they aren’t the last word. Social scientists should be on the lookout for data sources that perform well despite biases in sampling.

Written by fabiorojas

September 19, 2013 at 12:01 am

### 36 Responses

1. This is really minor, but there’s a typo in the title of the post. (Please delete this comment once you’ve read it.)

Like

Chris M

September 19, 2013 at 1:15 am

2. Thanks!

Like

fabiorojas

September 19, 2013 at 1:19 am

3. Bullshit. We are in a crisis of method, and the orientation that large convenience samples collected online are somehow able to adjudicate sociological questions is simply pathetic. What this represents, mostly, is the desire for a few people who have access to grant money outside of scientific realm to further their research agendas using shit data. Either some of of you were not paying attention in basic statistical analysis, or you no longer care–since your careerist interests trump your dedication to social science. This isn’t about argumentation, this is about statistical theory and how we make inferences about populations based on samples which are supposed to be randomly drawn from those populations. If we don’t have the latter, we can’t make the former.

People can assert their heterodox beliefs about basic research methods all they want, but I can assure you that I will never advocate publishing a paper using crap data. I’m disturbed that this wasn’t the lesson learned by the Regnerus affair…but then again many others want to use shit data to prove their own points.

Like

sherkat

September 19, 2013 at 1:30 am

4. Fabio, I think you’re mixing up two issues. The first question is whether a nonrepresentative sample will sometimes be right in its predictions using probabilistic statistical inference. The answer is: yes, sometimes it will be right by chance alone, but we have no way of knowing when it’s right and when it’s not, so the random chance of it being right is useless.

The traders example is a different matter. Trading markets aren’t right because they’re representative; they’re right for an entirely different reason, which is that a large number of people with good information and an incentive to get it right will tend, overall, to predict correctly. But traders don’t report information on themselves (in which case you’d need them to be representative of the population). They report information on what they believe to be true of others. For that, they don’t need to be representative, just relatively observant. (It doesn’t even matter if they’re right for the right reasons – all you measure there is predictivity, not dimensions of the phenomenon being predicted.)

Like

andrewperrin

September 19, 2013 at 1:57 am

5. Was the non-randomness really the problem with Regnerus’ work? I didn’t follow that affair particularly closely but I thought his problems were more about misrepresenting what his survey items actually measured. Aside from that, do we really have good reason to believe that phone surveys (with abysmally low response rates) are actually doing better than the type of online survey that Regnerus used in terms of producing representative samples?

Like

JD

September 19, 2013 at 1:59 am

6. @darren: First, I want to thank you for not bringing bunnies into this.

Second, please don’t say the name of you-know-who-in-Texas because you-know-who-insane-commenter will show up.

Third, I have no idea what you are talking about when you say: “What this represents, mostly, is the desire for a few people who have access to grant money outside of scientific realm to further their research agendas using shit data.”

1. The belief that unrepresentative samples are to be rejected is based on a misunderstanding of the Central limit theorem. I.e, the inversion of a conditional doesn’t follow from the original conditional. This is philosophy 101, nothing controversial or heterodox about that.

2. There is no a priori argument that precludes the possibility that unrepresentative samples can, in some cases, provide accurate estimates. This is the controversial argument. The way to rebut it is to show me your argument. And you can’t simply cite the CLT, because it doesn’t imply its inversion. You need a separate argument. Don’t go R—–us on my sorry self.

3. There are well documented cases where unrepresentative samples of people appear to provide data that leads to accurate estimates of behavior in larger populations. As Jeremy noted in another post last year, telephone surveys, which have low response rates and all kinds of biases,actually do pretty well. This suggests that I am right on #2.

The conclusion I draw from #1, #2 and #3 is that we should assess the value of non-representative samples rather than just reject them.

In the case of you-know-who-in-Texas, no such effort was made. In fact, the inconsistency of that research with prior research using better methods suggests that the sample was bias in a bad way. That, however, doesn’t imply that other samples can be useful.

Like

fabiorojas

September 19, 2013 at 2:00 am

7. @sherkat: the lesson of the Regnerus affair is that an ideologue with a lot of money and an ax to grind can manipulate the scientific apparatus into publishing nonsense as science. The quality of the data is the least of the problem there.

Like

andrewperrin

September 19, 2013 at 2:00 am

8. @Andrew: Good point. I think a better example would be telephone surveys (low responses rates, biased, but they often give quite good answers).

Like

fabiorojas

September 19, 2013 at 2:02 am

9. Generally telephone surveys are weighted later based on known distributions in the population, so they are pseudo-representative. Again, a different beast.

Like

andrewperrin

September 19, 2013 at 2:03 am

10. “yes, sometimes it will be right by chance alone, but we have no way of knowing when it’s right and when it’s not, so the random chance of it being right is useless.”

That’s where external validation comes into play. If we obtain multiple samples from the same process and we have some external measure (e.g., repeated phone surveys vs. election results or census data), then we can look at the mean and the distribution of errors. It might be the case that it yields systematically biased results, but it might not. It’s an empirical, not a theoretical, question.

So I am not claiming: non-representative samples –> accurate results.

Instead, I am claiming: non-rep samples + external validation –> judgment about the degree of bias, which may be small or even zero.

Phone surveys have multiple issues: There are biases introduced by response rates and biases introduced by people who don’t use phones or are unwilling to speak with surveyors. On the second issue, we have a concrete theoretical reason to believe that a specific bias in the data is connected to our dependent variable – race is correlated with telephone survey response and race is a predictor of voting. Thus, we have a good theoretical reason to reject biased, unweighted samples *for this question.* However, that doesn’t imply that all unrepresentative samples will yield incorrect answers to other questions.

Like

fabiorojas

September 19, 2013 at 2:20 am

11. I’m going to ignore the Sherkat gambit and NOT tie this to the he-shall-not-be-named debate which, in my opinion, hinged on very different issues than whether the sample was random. Seems to me that we have a crisis all around as response rates on random samples have fallen well below 50%. My own thinking is pretty close to Fabio’s but I think Fabio puts the emphasis wrong. The question is NOT whether a well-constructed random sample with a high response rate from a well-defined universe is better than a non-random sample or a sample with a poor response rate. Of course it is.

The question is about whether and when SOME data is better than NO data, or whether the data being used is the best available to address the topic at issue. This becomes a matter of trying to get evidence about the nature of the sample biases and then trying to figure out how badly they distort the particular results. I think consensus is that sample biases are much more likely to give simply wrong answers for descriptive one-variable studies but that sample bias MAY be less of a problem in assessing correlations, especially if you can control for other factors and/or do fancy tests for the sensitivity of results to various kinds of sample distortions.

And see Becky Pettit Invisible Men for an indictment of all the official random samples that systematically under count Black men because they don’t count prisoners and they don’t count people with no fixed address.

Those of us who work in, say, social movements, where there are no equivalents of the Census all the time have to explain why the imperfect samples we study are better than no samples at all, even as I and other knowledgeable reviewers push people in the field to think hard about what the sample constraints are and how that could be affecting the results.

Like

olderwoman

September 19, 2013 at 2:54 am

12. This is an interesting topic to come up, especially with Sherkat’s response, on the heels of the epic threads on critical realism. There’s a particularly American problem at the heart of this, and it’s a problem of–wait for it–ontology.

Say all of us were in Norway having this discussion. The debate would be basically, “You don’t need to have a representative sample of Norway to say interesting things about social life” vs. “You must have a representative sample of Norway to adjudicate any important sociological question.”

And anybody from elsewhere looking on would think it remarkable that the first position was cast as provocative, and the second position would seem hopelessly provincial. Don’t these people realize there is an entire freaking world out there that is not Norway? Or do they think once you have a representative sample about Norway you can speak in transcendent language about the entire world?

Same logic applies to the US: Why would the sine qua non of “social science” be an average causal effect estimate that is based on the American population? Is the idea that people in other countries should all be using American samples, because the exact proportions of diversity of America are the only diversity that matters for social science? Or should everybody be using their own countries? There’s lots of datasets that have representative data on a single state, or a single metropolitan area. Are these completely worthless data that only careerists would use, or is the idea that it’s important that a sample needs to represent something in order to be useful, but it doesn’t really matter what?

Perhaps it seems like there is something wrong with all this, maybe even more than wrong, may confused from top to bottom. Especially for the whole raft of social science that is trying to understand cause-and-effect relationships. Trying to boil causal inference in social science down to population parameter estimates is inherently problematic, however you feel about the role of different sampling techniques at achieving such estimates. To a first approximate, science is about trying to understand how the world works, not trying to assign summary numbers to populations.

Like

jeremy

September 19, 2013 at 3:45 am

13. Predicting the outcome based on futures markets is not the same at all as inferring a population parameter from a measurement of that parameter in a sample (i.e., a statistic). It would only be the same if our prediction were based on asking futures traders who they would vote for. Instead the information we get from the financial market is what outcome is predicted by people who are in the business of being right about their predictions.

Also, it is important to distinguish between non-representative samples (which have no formal meaning, and are not necessarily biased) and biased samples, which have a formal meaning: the mean of the sampling distribution produced by the sampling procedure has a different value than the mean of the population. So your statement “So far, I have not read a general argument showing that unrepresentative or biased samples in *all* cases leads to systematic biases in the estimated parameters” is both vacuous and a tautology: vacuous because “representativeness” is not a statistical concept but a heuristic one that basically means, “sample in such a way as to avoid bias,” tautological because, yes, in all cases biased sampling techniques lead to systematic biases in the estimate parameters, that is the definition of statistical bias.

Like

seriously?

September 19, 2013 at 7:17 am

14. Fascinating how this evoked such different responses. Jeremy: Fabio didn’t write “representative sample of Americans,” he wrote “representative sample.” OW: he didn’t write “official random samples,” he wrote “representative sample.”

@Fabio: I don’t think there is anything controversial about saying that external validation plus some kind of data lets you say something interesting. But I don’t think it approaches representativity.

Like

andrewperrin

September 19, 2013 at 10:52 am

15. @seriously?: Thank you. Mark this date/post. It will be remembered as the precise moment that Orgtheory jumped the shark. Embarrassing. Something is broken here. Irreparably. I’m out. [drops mic]

Like

otnomore

September 19, 2013 at 11:16 am

16. Random samples need comprehensive sample frames. Many populations lack anything that resembles a good sample frame. Groups such as homeless people, political protesters, or sexual minorities. The fascination with random samples is great for some research questions, but it can block the publication of studies of important sub populations in the world. Sometimes data with weaker research designs is better than no data.

Like

Eric Swank

September 19, 2013 at 12:58 pm

17. Andrew: I wasn’t attacking the Census, I was alluding (albeit cryptically) to the need always to think about who is or is not in the sampling frame, as well as how the sample is drawn from the frame. The problem with “representivity” is figuring out what or who is being represented and a random sample from a poor frame is no better and possibly worse than a purposive sample from a better frame. I’m not trying to slap anybody, I’m trying to say that sampling is actually a deep topic that requires a lot of thinking and work and testing, not a set of bullet points you can fit on a placard. AND qua Jeremy that samples for identifying causal relations ARE different in their requirements from samples for estimating population parameters.

Like

olderwoman

September 19, 2013 at 2:51 pm

18. I never took Logics 101 but I think there is a distinction here between a systematically biased estimate and an estimate that is close to the true or real value (hence one connection to the CR debate).

An unrepresentative sample will, by definition, always yield a systematically biased estimate (in the probabilistic sense of bias). A systematically biased estimate could, by chance, be close to the true value in the population. I think this is andrewperrin’s point. If non-response is not randomly distributed, then there will be systematic bias in the estimates. So from a pure probability perspective, an unrepresentative sample is a useless means of obtaining the true population estimate. How useless it is depends on how unrepresentative the sample is.

That being said it seems to me that the broader question of unrepresentative samples being useful relates to the issue that sample error may be the least of the problems researchers are confronted with. Social scientists often want to make universal claims about their samples without explicitly noting the limits of the population sampled (this is jeremy’s point).

Furthermore there is an assumption in sampling methods that populations are stable, or even definable. Here is where Fabio’s insights about the value of non-representative samples is most useful. Is there such a thing as a population of social movements for example. What does that mean? At any moment in time? We see the same issues in marketing were representative samples provide incorrect estimates of the true value (e.g., do you prefer Pepsi or the New Coke?). Focus groups, which are non-representative, may provide better indication of the success of new product introductions.

This discussion does relate to the CR debate because random samples of the population of the actual do not typically reveal the essence of the mechanisms we are trying to uncover. What an actual populations looks like, at a particular time and place is what sampling methods provide. If that is what you are looking for, you should seek representative samples. But I agree with Jeremy that this has little to do with helping us understand how the world works.

Like

William Ocasio

September 19, 2013 at 5:15 pm

19. @otnomore: Agreed. You could see the Fonz slipping on a pair of swimming trunks (and even reaching for his water skis) during some of the worst of the a\$\$hattery surrounding the CR debate, but he’s surely jumped the shark now!

But seriously – what is it called when you troll your own blog? – the OP has taken to saying all sorts of things he may or may not believe lately. (He can’t really mean it when he feigns doubt that biased samples will yield…biased estimates.) In short, I believe we are all being played. Regardless, stick around and look on the bright side: we all get access to fair bit of properly “backstage” information as those trolled by the OP rush to express support/indignation, useful info to keep in one’s back pocket for non-virtual academic gamesmanship going forward.

Like

MLJ

September 19, 2013 at 6:38 pm

20. @MLJ: I am dead serious. Write out the equations – the only way that selection bias *automatically* entails biased estimates is when the selection bias is correlated with the estimated parameter. When selection bias is uncorrelated anything can happen. The final bias can be large. Or the empirical size of the bias may be small, or there may be none at all. There is no a-priori argument for predicting the size of biased estimates. It really has to be judged on a case by case basis.

On a less serious note, this blog combines serious scholarly commentary, announcements, humor, hype, taunting (mainly from me), and whatever else the authors think is interesting. Except to April Fool’s day, I don’t think I have ever read any post that was written as a “put on.” We mean what we write. And finally, I accepted a long time ago that I’ve jumped the shark and have moved into an Aqua Man phase of my career. It suits me well.

Like

fabiorojas

September 19, 2013 at 6:48 pm

21. There are many examples in which people use “nonrepresentative samples” to refer to samples in which the distribution of control variables departs from the distribution of those same variables in the population. So, for example, people take Mechanical Turk samples to be “non-representative” because they are much more female, much younger, and lean much more to the left than the general population.

If one showed that regression estimates for some study using a MT sample were similar to population parameters, I don’t think usual usage would lead people to conclude that, despite being radically different from the population in terms of the univariate statistics, MT is nevertheless a “representative sample” for this purposes of the study. Instead, people would say, “huh, it’s surprising how good those estimates turned out considering what a non-representative sample it is.”

I guess one could say that folks are just using the word “non-representative sample” incorrectly, but if that’s the position, it would be useful for somebody to explain what terminology should be used for samples whose univariate characteristics diverge dramatically from that of the population. Because it’s an important concept.

Anyway, as Fabio points out, it’s a fairly basic point that regression estimates are unbiased if the model is correctly specified even when the sample distribution of covariates differs dramatically from the population covariates. So, while I sort of disagree with the entire premise of his post in terms of the place it gives to population parameter estimation in the work of social science, this idea that his argument is “tautological” and brings shame to the whole of sociology blogging, or whatever, is wrong.

Like

Jeremy

September 20, 2013 at 5:53 am

22. I think it’s just a new thing for commenters to get huffy and drop the mic and stomp out of the room whenever they don’t like something Fabio or somebody else on the blog says. I hope it keeps happening because it adds a sorely needed melodramatic element to our interactions. I’m really hoping that somebody will challenge Fabio to a duel soon as this is something I would fly in to see.

Like

brayden king

September 20, 2013 at 12:06 pm

23. Jeremy’s comment made me think of Winship and Radbill’s work on the use of sampling weights in regression. The basic gist of their argument is that while you probably want to use sampling weights when doing things like constructing univariate statistics, there are very good reasons to forget about weights when running regression. In the presence of a large oversample of one kind or another, dropping the weights is akin to using an unrepresentative sample, no?

To be fair, Winship and Radbill do walk through conditions where the failure to include weights in the regression can lead to problems (some of which are easily solved), so this is in no way a completely unqualified argument in favor of unrepresentative samples. So why bother mentioning this article at all? Two reasons:

1. It does a nice job at articulating the logic behind why a sample that is unrepresentative with respect to, say, univariate measures of central tendency can still produce valid estimates of the relationship between those variables

2. It makes clear that our ability to use unrepresentative samples in this manner is not a function of magic or chance: we can walk through the conditions under which unrepresentative samples will succeed or fail

Like

September 20, 2013 at 12:07 pm

24. Right on, Brayden! (Does anyone still say this?) As I stumble inexorably forward into my dotage, I often forget why I have been reading this blog since its inception. I am not a sociologist, but I recollect believing that organizational sociology/org theory had much to offer to my prosaic, pedestrian, applied work. And this blogsite has been an important portal. The portal has diminished to become a peephole in a construction fence, but luckily, the discernible but disorderly activity inside the fence often breaks into an amusing cat fight. Pass the popcorn.

Like

Randy

September 20, 2013 at 1:33 pm

25. A duel with Aqua Man?…. I dunno….Would it occur in the domain of the Actual?….Is a duel a representative sample?….Will it empirically validate his causal powers?.

Like

willieo

September 20, 2013 at 2:09 pm

26. Fabio, I don’t buy the “deadly serious” and still believe there has been some deep trolling going on here lately. (Mercenary trolling is still trolling, isn’t it?) @Adam Slez: There is a fundamental difference between what Winship is arguing and Fabio’s “go out and grab some data from the internets and use classical statistics to make inferences to a population…it *might be* all good…as long as you are realllly smart and think realllly hard about it.” But I digress: Fabio, Aqua Man or Ocean Master? And do I detect a whiff of the critical realism fellow traveller in the inequality, brains+hard thinking > GIGO?

Like

MLJ

September 20, 2013 at 4:04 pm

27. @MLJ: I will freely admit to hyping blog posts, but at the core of every post is a very serious issue. In the critical realism debate, I really do think that sociologists are best served by avoiding that school of thought. In the case of this post, every line is serious. I really do think that sociologists have a knew jerk reaction against non-representative samples that is simply not justified. No “deep trolling.” Just my opinions, served with some rhetoric. There is no Struassian interpretation of this blog (http://en.wikipedia.org/wiki/Leo_Strauss#Strauss_on_reading).

Like

fabiorojas

September 20, 2013 at 5:01 pm

28. Oh my … I just don’t know where I would start in engaging this. Just a few points. On the narrow point, Fabio is correct that, by chance, a sample that is based on a design with none of the virtuous random draw schemes invented by survey statisticians, can indeed generate a sample mean that is equal to the true mean in the population. His usage of the central limit theorem, however, is incorrect. This is for the convergence of an estimator in a general asymptotic framework. (The law of large numbers says that the parameter converges to the true point defined by the population, and the central limit theorem comes in to say that the bias around the true point, across repeated sampling, is normally distributed. The math is a bit more complicated than that, but that’s the essence, and its typically all done within a fictitious super population anyway to avoid finite-population corrections, etc.) So, Fabio just means the more basic point in this part of his post: One can get lucky, and you can’t develop a mathematical argument to eliminate such luck. The related point he could have made is that even a simple random sample can produce a sample mean that is quite far from the population mean, especially if the sample is relatively small. This happens much more than people realize, but they don’t know it because they don’t know what the target true parameter is.

The broader point, which I think is what Fabio really means (just channeling here) is: When analyzing data of some type, we are interested in estimating parameters of some type. Rarely in sociology are we interested in the population mean. So, it isn’t at all clear why we have a fascination with random samples of one sort or another. This is especially true in cases where we are considering data on collections of units that are countable (number of events or administrative units, etc.) or where the population is evolving so rapidly that a random sample at the time of the study isn’t worth that much going forward. In other words, if we have information about the population, then we can get a handle on biases in parameter estimates and discuss them. We need to do this even for simple random samples, and we don’t do that often enough.

I do think that sociology graduate programs don’t spend enough time teaching basic fundamentals of probability theory (or requiring that students somehow learn this material). We run off into models too quickly, start producing coefficients, standard errors, and too much muck results.

BTW: Sociological Science doesn’t require random samples, which we are clear about in our submission guidelines. And I have good reason to believe that at least some of our first published papers will not be based upon them.

Like

Steve Morgan

September 20, 2013 at 5:52 pm

29. Thanks, Steve. You channel me well.

A random idea: maybe down the road SS could publish symposia exploring new data sources, or controversial data sources. In a short, concise format!

Like

fabiorojas

September 20, 2013 at 5:57 pm

30. Also the claim about regression models and the correct specification is correct but usually misinterpreted. If you drop constant coefficients (which are basically never justified for the specification in textbooks or in practice), it all starts to unravel. You have to adopt a very encompassing definition of “correct specification” to save the claim, which essentially brings everything relevant about the study design into the model and higher-order interactions between everything. If you do that, you can get back to a point where you can in theory accept the claim based on the structural “correct specification” logic.

Like

Steve Morgan

September 20, 2013 at 6:08 pm

31. Fabio: Never, ever us that acronym. It is SocSci. Don’t forget it.

Like

Steve Morgan

September 20, 2013 at 6:10 pm

32. Branding message received!!! SocSci should run debates about new data or controversial data!!!

Like

fabiorojas

September 20, 2013 at 6:11 pm

33. @jeremy, forgive me for having a more specific criticism of fabio then “I kind of disagree with the entire premise of his post.” Anyway, this is encouraging — I’ll assume that you all follow fabio around in seminars, meetings, panels etc. clarifying his “basic” and “broader points” and why they’re actually not so unreasonable after all.

Like

seriously?

September 20, 2013 at 8:04 pm

34. […] The joy of unrepresentative samples – It’s uncontroversial in the commercial world that biased samples can still produce useful results, as long as you are careful. There are techniques that help you understand your sample, like bootstrapping, and we’re lucky enough to have frequent external validation because we’re almost always measuring so we can make changes, and then we see if they work according to our models. The comments on this post are worth reading because the approach seems to offend some sociologists viscerally. (via Trey Causey and Benjamin Lind) […]

Like

35. There’s a point-counterpoint debate on the same topic in the current issue of International Journal of Epidemiology:

Rothman et al. “Why representativeness should be avoided”
http://ije.oxfordjournals.org/content/42/4/1012.extract
“Why do so many believe that selecting representative study populations is a fundamental research aim for scientific studies? This view is widely held: representativeness is exalted along with motherhood, apple pie and statistical significance. …”

Like

Markus

September 25, 2013 at 1:24 pm

36. I’m surprised you haven’t connected this argument to the logic underlying qualitative research – “oversamping” on the variable of interest in order to discover properties, different assumptions about generalizability/reliability, etc… It seems like we have some pretty good answers in other domains.

Like

Erica

September 26, 2013 at 3:54 am