Archive for the ‘mere empirics’ Category
Very few models in statistics have nice, clean closed form solutions. Usually, coefficients in models must be estimated by taking an initial guess and improving the estimate (e.g., the Newton-Raphson method). If your estimates stabilize, then you say “Mission accomplished, the coefficient is X!” Sometimes, your statistical software will say “I stop because the model does not converge – the estimates bounce around.”
Normally, people throw up their hands and say “too bad, this model is inconclusive” and they move on. This is wrong. Why? The convergence/non-convergence of a model estimate is the result of completely arbitrary choices. Simple example:
I am estimating the effect of a new drug on the number of days that people live after treatment. Assume that I have nice data from a clean experiment. I will estimate the # of days using a negative binomial regression since I have count data which may/may not be over-dispersed. Stata says “sorry, likelihood function is not-concave, model won’t converge.” So I actually ask Stata to show me the likelihood function and it bounces around by about 3% – more than the default settings. Furthermore, my coefficient estimates bounce around a little. The effect of treatment is about two months +/- a week, depending in the settings.
As you can see, the data clearly supports the hypothesis that the treatment works (i.e., extra days alive >0). All “non-convergence” means is that there might be multiple likelihood function maxima and they are all close in terms of practical significance, or that the ML surface is very “wiggly” around the likely maximum point.
Does that mean you can ignore convergence issues in maximum likelihood estimation? No! Another example:
Same example as above – you are trying to measure effectiveness of a drug and I get “non-convergence” from Stata. But in this case, I look at the ML estimates and notice they bounce around a lot. Then, I ask Stata to estimate with different sensitivity settings and discover that the coefficients are often near zero and sometimes that are far from zero.
The evidence here supports the null hypothesis. Same error message, but different substantive conclusions.
The lesson is simple. In applied statistics, we get lazy and rely on simple answers: p-values, r-squared, and error messages. What they all have in common is that they are arbitrary rules. To really understand your model, you need to actually look at the full range of information and not just rely on cut-offs. This makes publication harder (referees can’t just look for asterisks in tables) but it’s better thinking.
A recent meta-analysis of studies of critical thinking (e.g., seeing if students can formulate criticisms of arguments) shows that, on average, college education is associated with critical thinking. From “Does College Teach Critical Thinking? A Meta-Analysis” by Christopher Huber and Nathan Kuncel in Review of Educational Research:
This meta-analysis synthesizes research on gains in critical thinking skills and attitudinal dispositions over various time frames in college. The results suggest that both critical thinking skills and dispositions improve substantially over a normal college experience.
Now, my beef with the whole critical thinking stream is that there is a special domain of teaching called “critical thinking.” The looked at studies where students were in special critical thinking instruction:
Although college education may lag in other ways, it is not clear that more time and resources should be invested in teaching domain-general critical thinking.
Students are learning critical-thinking skills, but adding instruction focused on critical thinking specifically doesn’t work. Students in programs that stress critical thinking still saw their critical-thinking skills improve, but the improvements did not surpass those of students in other programs.
Bottom line: Take regular courses on regular topics and pay close attention to how people in specific areas figure out problems. Skip the critical thinking stuff, it’s fluff talk.
The New York Times has run an op-ed by Molly Worthen, a professor of history, who implores against active learning in college classes and wants to retain the lecture format:
Good lecturers communicate the emotional vitality of the intellectual endeavor (“the way she lectured always made you make connections to your own life,” wrote one of Ms. Severson’s students in an online review). But we also must persuade students to value that aspect of a lecture course often regarded as drudgery: note-taking. Note-taking is important partly for the record it creates, but let’s be honest. Students forget most of the facts we teach them not long after the final exam, if not sooner. The real power of good notes lies in how they shape the mind.
“Note-taking should be just as eloquent as speaking,” said Medora Ahern, a recent graduate of New Saint Andrews College in Idaho. I tracked her down after a visit there persuaded me that this tiny Christian college has preserved some of the best features of a traditional liberal arts education. She told me how learning to take attentive, analytical notes helped her succeed in debates with her classmates. “Debate is really all about note-taking, dissecting your opponent’s idea, reducing it into a single sentence. There’s something about the brevity of notes, putting an idea into a smaller space, that allows you psychologically to overcome that idea.”
As we noted on this blog, there is actually a massive amount of research comparing lecturing to other forms of classroom instruction and lectures do very poorly:
To weigh the evidence, Freeman and a group of colleagues analyzed 225 studies of undergraduate STEM teaching methods. The meta-analysis, published online today in the Proceedings of the National Academy of Sciences, concluded that teaching approaches that turned students into active participants rather than passive listeners reduced failure rates and boosted scores on exams by almost one-half a standard deviation. “The change in the failure rates is whopping,” Freeman says. And the exam improvement—about 6%—could, for example, “bump [a student’s] grades from a B– to a B.”
If you’d like your students to master the art of eloquent note taking, continue lecturing. If you’d like them to learn things, adopt active learning.
It is rather obvious that scholars have almost no incentives for replication or verification of other’s work. Even the guys who busted La Cour will get little for their efforts aside from a few pats on the back. But what is less noted is that editors have little incentive to issue corrections, publish replications, and commentaries:
- Editing a journal is a huge workload. Imagine an additional steady stream of replication notes that need to be refereed.
- Replication notes will never get cited like the original, so they drag down your journal’s impact factor.
- Replication studies, except in cases of fraud (e.g., the La Cour case), will rarely change the minds of people after they read the original. For example, the Bender, Moe and Schotts APSR replication essentially pointed out that a key point of garbage can theory is wrong, yet the garbage can model still gets piles of cites.
- Processing replication notes creates more conflict that editors need to deal with.
It’s sad that correcting the record and verification receives little reward. It’s a very anti-intellectual situation. Still, I think there are some good alternatives. One possible model is that folks interested in replication can simply archive them in arXiv, SSRN or other venues.Very important replications can be published in venues like PLoS One or Sociological Science as formal articles. Thus, there can be a record of which studies hold water and which don’t without demanding that journals spend time as arbitrators between replicators and the original authors.
Here’s the list (so far):
Some people might want to hand wave the problem away or jump to the conclusion that science is broken. There’s a more intuitive explanation – science is “brittle.” That is, once you get past some basic and important findings, you get to findings that are small in size, require many technical assumptions, or rely on very specific laboratory/data collection conditions.
There should be two responses. First, editors should reject submissions which might depend on “local conditions” or very small results or send them to lower tier journals. Second, other researchers should feel free to try to replicate research. This is appropriate work for early career academics who need to learn how work is done. Of course, people who publish in top journals, or obtain famous results, should expect replication requests.
Science just published a piece showing that only a third of articles from major psychology journals can be replicated. That is, if you reran the experiments, only a third of experiments will have statistically significant results. The details of the studies matter as well. The higher the p-value, the less like you were to replicate and “flashy” results were less likely to replicate.
Insider Education spoke to me and other sociologists about the replication issue in our discipline. A major issue is that there is no incentive to actually assess research since it seems to be nearly impossible to publish replications and statistical criticisms in our major journals:
Recent research controversies in sociology also have brought replication concerns to the fore. Andrew Gelman, a professor of statistics and political science at Columbia University, for example, recently published a paper about the difficulty of pointing out possible statistical errors in a study published in the American Sociological Review. A field experiment at Stanford University suggested that only 15 of 53 authors contacted were able or willing to provide a replication package for their research. And the recent controversy over the star sociologist Alice Goffman, now an assistant professor at the University of Wisconsin at Madison, regarding the validity of her research studying youths in inner-city Philadelphia lingers — in part because she said she destroyed some of her research to protect her subjects.
Philip Cohen, a professor of sociology at the University of Maryland, recently wrote a personal blog post similar to Gelman’s, saying how hard it is to publish articles that question other research. (Cohen was trying to respond to Goffman’s work in the American Sociological Review.)
“Goffman included a survey with her ethnographic study, which in theory could have been replicable,” Cohen said via email. “If we could compare her research site to other populations by using her survey data, we could have learned something more about how common the problems and situations she discussed actually are. That would help evaluate the veracity of her research. But the survey was not reported in such a way as to permit a meaningful interpretation or replication. As a result, her research has much less reach or generalizability, because we don’t know how unique her experience was.”
Readers can judge whether Gelman’s or Cohen’s critiques are correct. But the broader issue is serious. Sociology journals simply aren’t publishing error correction or replication, with the honorable exception of Sociological Science which published a replication/critique of the Brooks/Manza (2006) ASR article. For now, debate on the technical merits of particular research seems to be the purview of blog posts and book reviews that are quickly forgotten. That’s not good.