is this the beginning of the end?
Brayden
I’m not talking about the end of the world (which would need much better statistical forecasting tools than we have now), but the end of a dominant paradigm in sociology and in most social sciences: the belief that statistical significance tests are useful criteria for distinguishing real effects. Here is the abstract of a recently published article by J. Scott Armstrong in the International Journal of Forecasting:
I briefly summarize prior research showing that tests of statistical significance are improperly used even in leading scholarly journals. Attempts to educate researchers to avoid pitfalls have had little success. Even when done properly, however, statistical significance tests are of no value. Other researchers have discussed reasons for these failures. I was unable to find empirical evidence to support the use of significance tests under any conditions. I then show that tests of statistical significance are harmful to the development of scientific knowledge because they distract the researcher from the use of proper methods. I illustrate the dangers of significance tests by examining a re-analysis of the M3-Competition. Although the authors of the re-analysis conducted a proper series of statistical tests, they suggested that the original M3-Competition was not justified in concluding that combined forecasts reduce errors, and that the selection of the best method is dependent on the selection of a proper error measure. I show that the original conclusions were correct. Authors should avoid tests of statistical significance; instead, they should report on effect sizes, confidence intervals, replications/extensions, and meta-analyses. Practitioners should ignore significance tests and journals should discourage them. (italics added for emphasis)
Andrew Gelman has some thoughts on the paper. Erin Leahey has written about the development of statistical significance standards in sociology. Jeremy blogged about how statistical significance criteria may shape research practices.
I haven’t read the article so I can’t say too much. That said, watch as I say too much!
I agree that statistical tests are often misused- particularly as people frequently attach more import to significance tests than is necessarily warranted. This is doubly true when we consider that often the assumptions of various tests are casually violated. Suppressing this sort of behavior is a necessary part of the scientific process. I also agree with the author that attention must be paid to methods- stats are a way of processing data but, really, garbage in, garbage out.
That said I think that the use of stats as a counter-agent to the confirmation bias is a very good idea. In many ways, I find them indispensable in this role. Similarly, I’m amused at the emphasis on confidence intervals. A confidence interval is most often based on the same probability and sampling logic as a signigificance test. In fact, they can and are used for a crude type of significance testing. In my view, arguing for confidence intervals instead of significance tests is a little like arguing for a yard stick instead of a meter stick. Different instruments, but they’re still measuring the same stuff.
Just my admittedly poorly thought out $0.02 worth…
Drek
June 18, 2007 at 9:40 pm
“Even when done properly, however, statistical significance tests are of no value. Other researchers have discussed reasons for these failures. I was unable to find empirical evidence to support the use of significance tests under any conditions.”
My Sciencedirect thingy isn’t working tonight, but I HOPE there is an implied “in forecasting” in both sentences. Otherwise, this is literal nonsense…as in it Makes. No. Sense. At. All. Armstrong is simply MSO…which I suppose is par for the course at the “International Journal of Forecasting.”
james
June 19, 2007 at 9:16 am
Yes the implied conditional in that statement is “in forecasting.” From a quick look at the peer commentary following the article, I’d say that Drek is seriously undervaluing his own currency as most of the opinions appear to concur with his assessment. For people that do theoretical science I doubt that Armstrong’s diatribe against significance testing is going to be of much consequence. Sometimes for purposes of testing theory, statistical significance is all that you are looking for. Unfortunately the word significance is usually interpreted as “it makes a huge difference.” However a tiny effect (say a mean difference 0.02 in some arbitrary scale) can a lot of times be statistically significant. Sometimes a giant difference can fail to reach some conventional threshold because of sampling issues or issues of statistical power. In writing results sections, I usually follow McCloskey’s (1985) advice and write “statistical significance” when I mean either a coefficient that is discernably different from zero or a difference that passes the conventional threshold. I only use the term “significance” without the statistical qualifier when I mean “huge” or “really big difference” between two groups or “really big effect” of X on Y (like the effect of WORDSUM on the logged odds of not knowing that the earth goes around the sun). It’s cumbersome, but it does the job.
Omar
June 19, 2007 at 1:30 pm
First, I do not see the harm in significance tests. Especially, as Omar noted, if one is doing theoretically driven work; not simply testing variables with synecdochical value. Second, significance tests, like any other social science method, are good tools for historical analysis.
Here are a few questions:
(1) How possibly can one make predictions based on past information (b.k.a. data)?
(2) Is it possible that all of the “possibilities” can be known to the researcher with only past information (b.k.a. data)?
After you chew on these, it is not so difficult to discern the problems with forecasting in the social sciences.
Brian Pitt
June 19, 2007 at 2:31 pm
First, let me say, that this isn’t my new cause or anything, but I think it’s interesting that there is a raging debate about the utility of statistical signifance testing among statisticians, and yet most social scientists are relatively silent on the topic. I’m only suggesting that we be a little more reflexive in our use of the practice. I, for one, am not about to give up the use of p-values. I like getting published and if that is our main tool of inference, then I’ll use it.
That said, I think Omar is downplaying the reach of Armstrong’s comments. He actually intends that statistical significance tests not be used in scientific research, ever! Strong statement from a guy who’s well respected for his knowledge of methods. Sure, the International Journal of Forecasting isn’t well known to most of us, but this argument has been made before in other places. For example, Armstrong cites a paper written by Hubbard and Bayarri in the American Statistician (2003).
There are numerous problems with significance tests, many of which are common to our field of study. For example,
1. Statistical significance tests only tell you when you can assume that an effect differs from the null hypothesis that the effect is zero. Much of the time we are already pretty sure that the effect is not zero – what we want to know is how big is that effect. Statistical significance tests obscure our aim to determine the size of effects.
2. Related to this, as Omar pointed out, statistical significance is not the same thing as substantive significance.
3. P-values are often interpreted as the probability that the null hypothesis is true, but this is an incorrect interpretation. As Hubbard and Bayarri (2003) argue, “This is not only wrong, but p values and posterior probabilities of the null can differ by several orders of magnitude, the posterior probability always being larger.” So the use of p-values may sometimes boost our confidence in the probability that a result is real (even when we’re not trying to forecast). And as Taleb argues forcefully in his book, The Black Swan, this only makes us more susceptible to rare events.
4. My personal favorite – p-values are used by editors and reviewers to adjudicate when an effect is real and when it is not. Really, there isn’t a huge difference in our ability to make an inference about an effect that is significant at the .05 level and one that is significant at the .06 level. I’m not saying they’re the same, but shouldn’t we be more concerned about the confidence interval of the coefficient?
Drek is right that the use of confidence intervals is based on the same sampling logic, but I think the point of statisticians who advise using confidence intervals is that they give you more information than p-values. Another alternative is to use Bayesian methods, which is what Gelman supports (I think). Researchers could calculate conditional error probabilities, which you could actually use to ascertain the error that you didn’t falsely reject the null hypothesis.
brayden
June 19, 2007 at 2:41 pm
I suggest a skim of Gadamer’s Truth and Method, but then I go to the university-in-exile.
Sam
June 19, 2007 at 4:39 pm
[...] is this the beginning of the end? [...]
bayes or bust? « orgtheory.net
June 20, 2007 at 3:18 am
[...] is this the beginning of the end? [...]
i don’t need no stinkin’ significance tests… « orgtheory.net
June 21, 2007 at 5:46 pm