three ways of talking about the variables that you don’t care about

It would be nice if the world of quantitative data analysis in social science was like the one envisioned by people like Christopher Achen and every regression was a bivariate regression or at worst a model with 2-4 right-hand side variables.  Instead, we are stuck living in a “garbage can” world where we write papers in which even though we only care about the relation between one thing and some other thing, end up including long vectors of other stuff that we don’t care about.

How we talk about the rest of the garbage can however, is important since it reflects mini-theories about the research process (sometimes consciously held most of the time mindlessly deployed) that reflect what you (think you) are doing when you regress Y on whatever.  Different people recommend different lingo, and this lingo is related what they think you are accomplishing by augmenting your regression specification.  So get clear on what you are doing and modify your vocabulary accordingly.

In my view, there are two broad practical purposes of regression analysis and they are reflected in the relevant vocabulary.

Descriptive inference.- For instance, Andrew Gelman prefers to talk about “input” variables (rather than independent variables); and what do you do with input variables? Well, you “adjust” for them.  This kind of language is in my view appropriate for the bulk of quantitative analysis of observational data in social science. Here, the researcher cannot, does not or should not be making strong claims that they favorite X has a causal effect on their favorite Y. Instead, I’m just interested in the “net effect” of X on Y, and adjust for other inputs to make sure that the effect is there within levels of their (additive) combination.  This language is nice and neutral and does not commit you to problematic assumptions about inference.

Causal inference.- People like Stephen Morgan on other hand, like to talk about “conditioning on” the values of the other Xs. This language is borrowed from a long tradition of causal inference in experimental and non-experimental research and it is appropriate when that’s exactly what the researcher thinks that he or she is doing.  In particular, if you follow the recent systematization of what it is that those other Xs that you directly don’t care about actually do in the context of causal inference using directed acyclic graphs (in the recent writings by Pearl and Morgan and Winship), then it is clear why is it that you are “conditioning” on them (you block backdoor paths linking your favorite X to your favorite Y by conditioning on those other pesky Xs–and only those— that connect the fave with the outcome).  The language of “conditioning” here also makes explicit that you are working from within the conditional independence assumption which is the fundamental bedrock of causal inference with observational data.

So there you have it.  If you are just running descriptive regressions to see what goes with what and are interested in “effects” (as long as you don’t fall into the trap of talking about these effects as having anything to do with causes), then when you write your papers, you just say “adjust for.”  Also, I think it would be good to follow Gelman and stay away from the language of “dependent” and “independent” variables; this is just mindless importation of lingo that comes from the experimental tradition into a setting where it does not belong.  On the other hand, if you are working from an explicit framework in which your main interest is in the estimation of causal effects, then you should say “conditioning on.”  This makes it clear that what you are referring to (“conditional on X2, when I jiggle my favorite X my favorite Y also jiggles”).

What does that leave out?  Well it leaves out the language that most people use when talking about the Xs that they don’t care about:  the language of “controlling for.”  I think this is a generally stupid way of referring to what you are doing.  This vocabulary reveals that you may not even be self-conscious as to what exactly you are doing in that piece of research. It imports categories from the experimental tradition (the notion of a “control group”) that have no business in observational data analysis and which makes it sound as if you are not aware of the spectacular inappropriateness of such assumptions in that context (every paper I’ve written that analyzes quantitative data uses this phrase making me moron number one).  So stop saying “control for”!

Of course, I am under no illusions that people will actually stop using this phrase given how mindlessly and insidiously it has become insinuated into the language of social research; but you can always hope.

Written by Omar

December 3, 2011 at 2:42 pm

7 Responses

Subscribe to comments with RSS.

  1. I appreciate your argument, Omar. But can we get back to your seminal point: the literature is rife with articles that use the basketful-of-variables approach. And, of course, we have to sit through tedious discussions of Model 1, Model 2, … Model N, where handsful of variables are thrown into the basket with minimal justification, then pulled out and replaced (repeat). So arguing whether we are conditioning or inputting is a bit like polishing a cow flop, as grandma used to say.



    December 3, 2011 at 10:11 pm

  2. Randy: Right. But I think is that part of the point is that you are just doing descriptive inference and doing the model 1, model 2, model X…thing then you should certainly not use the pseudo-experimental language of controls, independent, dependent, etc. Just talk about response variable, inputs and net effects. But if you are following a more disciplined approach (e.g. you are interested in causal effects) then there are rules that determine inclusion and exclusion, so that a reviewer can, with all legitimacy, say “you conditioned on an irrelevant/inappropriate variable” (e.g. you conditioned on a variable X2 that happens after X1 when you were interested in the causal effect of X1 on Y, like the silly specification that adjusts for characteristics of occupations when estimating the causal effect of education on wages). This sort of determination is not possible when all that you are doing is fitting descriptive models to observational data.



    December 3, 2011 at 10:34 pm

  3. I’m curious about what changed your mind about how to write about observational models, especially since you had up to now been using the language you now think is stupid. I’m putting the finishing (I hope) touches on a quantitative article where I’m clearly doing thing #1, so this is very relevant…


    Daniel L

    December 4, 2011 at 4:53 am

  4. I am pretty unsophisticated about such stuff, but I am pretty confident about this: Our multiple regression models (and their elaborated upgrades) are efforts to simulate experiments in which we would have manipulated treatments. The models are poor efforts at simulation but often all we have, given the limits on our ability to experiment on people, organizations, or nations. At the least, however, our write-ups of these ersatz causal tests should remain humble.



    December 4, 2011 at 9:03 pm

  5. The real problem is the fallacy that sociology can only be a science when humans are reduced to billiard balls. To skip and dance around the lack of a true scientific method, caused by poor epistemology, buzz phrases such as “data conditioned on” attempt to fend off attacks from post modernists.

    People are complicated. They lie. They lie to themselves. Researching atheism, I found a Pew Poll reporting that 3% of the people who self-identify as “evangelical Christians” are uncertain about the divinity of Christ and (perhaps the same 3%, but maybe not) the existence of God. Clearly, these people have no idea what they are talking about. Yet, 3% is the number of Jews and Muslims in America. We assume that they know what they mean when they self-identify.

    A long time ago, working on a TRS-80 with census data from the Lansing, Michiga, tri-county area, I was surprised at the large number of Eskimos living in our community. Then, I caught on…

    Bivariate that.


    Michael E, Marotta

    December 4, 2011 at 10:56 pm

  6. Omar, Thanks for helping us to sort out our semantics (or as Lazrasfeld called it awhile back ‘the language of social research’).

    I favor ‘outcome variable’, ‘causal variable’, and either ‘covariate’ (when doing descriptive stuff) or ‘adjustment variable’ (when doing causal analysis). I tend to use ‘conditioning on’ language when discussing identification issues, shifting toward adjustment language when talking about what models actually do (when they are models that don’t actually stratify the sample).

    I don’t really like using the word “effect” at all unless I am doing causal analysis. I seem to get by fine with just labeling coefficients as coefficients and using associational language in descriptive analysis projects. Effect just connotes too much causality for me, and so I don’t like the practice of labeling coefficients as ‘net effects’ unless the effect part is supposed to be causal.

    Of course, it is hard to be a purist about any of this stuff, since in the end reviewers and editors make us do all kinds of stuff that we would prefer not to. Even in moments of weakness, I am sure I’ve strayed from my own standards. And I don’t judge others too harshly on such things, as long as what is behind the language has merit.


    Steve Morgan

    December 5, 2011 at 5:30 pm

  7. […] Three ways of talking about the variables that you don’t care about. Omar of OrgTheory posted two excellent, medium-length think pieces in the last few weeks. The first concerns “control variables” and why we usually shouldn’t use that term, with interesting thoughts about different modes of thinking about what we are actually doing when we “control” for X1..Xn. In teaching a research methods course, I struggled with this language and my students’ attempts to employ it, and this post was an excellent quick summary of my frustrations. […]


Comments are closed.

%d bloggers like this: