orgtheory.net

Reproducible Science

First, a quick introduction: I am sociologist from Stanford working in organizational theory at the Tepper B-school, CMU. My interest is primarily in information within social networks (i.e. how people and firms find it, change it, share it, learn from it, etc.). Currently, I am investigating corruption in communication networks.

Given my interest in information and corruption coupled with the recent data frauds in academia, I have been thinking a lot about reproducible science, particularly for the social sciences.  Creating norms or policies that enforce reproducible science may not only be cheap insurance to mitigate academic fraud but also improve our field.

Obviously, I am not the first to see the benefits of making data and methods available. There are quite a few wonderful archives out there that provide the infrastructure, but they are frightfully empty. For example, a quick search on ISPCR gave me a total of 8,369 studies.  Harvard’s IQSS also has the wonderful Dataverse, but I don’t know of many scholars who actually use it.

What we have is a good old fashion social dilemma. Why should I make all my data and syntax available if you don’t? Here, I wonder if the journals themselves shouldn’t shoulder more of this effort, requiring data and syntax files from authors. (I do recognize that this is easy to suggest since I am not an editor and never have been). But I would be in incomplete favor of a journal requiring that I produce the data & the syntax file as a requirement to publishing my work. Now, I know this won’t deter particularly crafty fraudsters but it’s just too easy for some researchers to fake findings. And I think the pressure is only going to increase. It’s Merton’s innovation, scholars feel pressure to succeed but the “legitimate” avenues don’t produce the right p-values so they commit “creative data analysis.” As researchers, shouldn’t we really look to creating an atmosphere of openness and transparency with our data and methods?

There are other benefits as well. At Stanford, Bill Barnett teaches this wonderful methods course. Through his own connections and personal solicitations, Bill has gathered the data files for many important org theory papers. In order to get the files, he did assure the authors that the data would only be used for pedagogical purposes.  In his course, Bill asks student to reproduce the findings from these papers and maybe improve upon the studies with new statistical techniques or methods. It’s a wonderful way to teach graduate students a number of important skills. But couldn’t all graduate students benefit from reproducing important studies?

Okay that was far preachy than I intended. Maybe for my next post I can report on drunken scholar sightings at AoM.

Written by brandyaven

August 3, 2012 at 12:42 am

11 Responses

Subscribe to comments with RSS.

  1. The AER, the Economist flagship journal, requires all empirical papers with freely accessible data to post data + programs online. There’s no reason the same can’t be the case for every social science journal.

    Stanford’s bill shouldn’t have had to reassure authors about the data files — any reasonable request for data should be honored by researchers; ideally they should be on the author’s website (excluding proprietary data).

    Like

    AG

    August 3, 2012 at 12:58 am

  2. You write:

    I have been thinking a lot about reproducible science, particularly for the social sciences. Creating norms or policies that enforce reproducible science may not only be cheap insurance to mitigate academic fraud but also improve our field. . . . There are quite a few wonderful archives out there that provide the infrastructure, but they are frightfully empty.

    So far, I’m with you. It seems like a good idea to make all research materials available, but not many people do so.

    But then you continue:

    What we have is a good old fashion social dilemma. Why should I make all my data and syntax available if you don’t?

    The answer is clear to me: by making your data available, you are making it more likely that others will replicate your results, continue the directions of your research, cite you, etc. Fame and fortune await.

    In that case, why not make data and code publicly accessible? If it’s so good to do, why isn’t everybody doing it? Let’s set aside the cheaters and the insecure people, those scholars who are worried that if someone else gets their hands on their data, they will come to different conclusions. And let’s set aside those researchers who are so clueless that they honestly seem to think that their particular analysis is the last word on the subject.

    What about the rest of us, the vast majority (I assume) of researchers who are doing science to learn about reality and who would be thrilled if others pick up our torch and continue our research directions where we left off? Why don’t we always share our data? I know I don’t, and it’s not because I don’t want other people to take a look at what I’m doing.

    I can think of two reasons we (those of us who would actually like our research to be reproduced) don’t routinely share data and code:

    1. Effort. This for me is the biggie. As Aven notes, there are social benefits to making data and code available, and as I note above, there are direct personal benefits as well. But these benefits are all medium-to-long-term and they pale beside the short-term costs of getting my act together to put the data in a convenient place. In fact, when I do actually organize my data, it’s often motivated by a desire to make my life easier when handling repeated requests.

    2. Rules. The default is for data and code not to be released. Often there are silly IRB rules or commercial restrictions on data. In other cases it seems like too much effort to find out. Again, though, it can be good self-interest to make data available. For example, in our wildly-popular (not yet but eventually, I hope) mrp package in R, we use CCES data, not Annenberg or Pew, for our examples. Why? Because the people at CCES were cool about it. Not only do they release their (old) data for free, they don’t mind us reposting it. Ultimately CCES benefits from this. The freer the data, the easier it will be for people to do analyses, cite CCES, suggest improvements to CCES, etc.

    In short, I agree with most of what you wrote in your post, and I agree that it would good to change the incentives to increase data sharing. But I don’t think it’s fundamentally a coordination problem, or a prisoner’s dilemma, or a tragedy of the commons, or a first-mover problem, or whatever you want to call it. I think it’s mostly about defaults and laziness (I guess that would be something like “intertemporal preference conflicts” in decision-analytic jargon).

    The big picture

    The concept of a social dilemma or cooperation problem or prisoner’s dilemma is appealing, I think because it is a crisp way of understanding why something that is evidently a good idea is not actually being done. But sometimes there is a more direct explanation. I think we should be careful about reaching for the “social dilemma” model whenever we see a frustrating or mysterious outcome.

    Like

    Andrew Gelman

    August 3, 2012 at 12:38 pm

  3. This gets very interesting when the science is from a governmental entity. Since the nonpartisan Congressional Budget Office backed Democratic claims about the Affordable Care Act, there has been a recent call amongst Republican lawmakers for the CBO to reveal how it comes up with its numbers. The CBO has thus far refused to do so, and now there is proposed legislation that would require it to.

    Like

    KMD

    August 3, 2012 at 12:40 pm

  4. “But I would be in incomplete favor of a journal requiring that I produce the data & the syntax file as a requirement to publishing my work.” If you mean after publication, ok. Or I suppose even depositing it with submission. But another side to the story. I was asked to review an article based on a computer simulation for a journal that had decided that the simulation code needed to be reviewed, not just the text of the article. This was a total pain to try to do, it took an enormous amount of time to download and set up the relevant files, and at the end I was no more able to evaluate the product than before. Checking and debugging code is a huge job, and being asked to do it for free as a journal reviewer is over the top. I know form the editor that the other reviewer was also complaining.

    Like

    olderwoman

    August 3, 2012 at 4:41 pm

  5. Great post, Brandy! I tend to agree with Andrew Gelman that the answer is simple–i.e, making one’s data and code publicly available–but I would argue that it’s the execution that is difficult. This has to do with the question, to what extent should a researcher take responsibility to guarantee that a paper’s empirical results are easily reproducible? Don’t get me wrong, I don’t mean to say that academics intentionally obfuscate the details of their methods to avoid oversight, but rather that because the nature of some analysis might be so complicated, by posting just the code and data alone, it might be almost impossible for someone else to replicate a given result. For example, the analysis in my papers sometimes spans several coding and data management platforms including R, Python, and SQL, a familiarity with each of which would be required to properly replicate basic findings (one could argue, of course, that I probably don’t need to resort to so many programs). Even if the code is well commented, another researcher would likely need a how-to manual to put together these various pieces together properly. Ideally, the methods section of paper should be written clearly and completely enough to facilitate this, but in many cases, authors omit details because of space constraints, reviewer comments, etc.

    The main dilemma in all this is the following: is it the researcher’s responsibility to post enough information, including the code and data, such that someone else could replicate an analysis in exactly the same manner in which the researcher conducted it originally? Or should a researcher provide access just to code and data and let others take the responsibility to acquire enough knowledge to replicate the analysis themselves as long as the general methods in a paper are detailed clearly? I’d say the former would take a good deal more work, but at least researchers can have more control over how others go about replicating their work. The latter involves less work on the part of the researcher but risks others misinterpreting details about the methods and code. There are, of course, countless alternatives that fall between these two options.

    Like

    Dan Wang

    August 3, 2012 at 4:45 pm

  6. While I do agree that making data and syntax available is difficult and there are many issues that would need to be ironed out, such as proprietary data (e.g. financial data) or private data (e.g. adhealth), its not enough reason to discard the idea completely.

    @Andrew Gelman,
    I completely take your point that this is not really about small-minded individuals hoarding data but rather harried academics without the time, incentives, etc. to “do the right thing.” And that is precisely why it’s a social dilemma because it’s a public good like a bridge. If only a few people are willing to take the necessary steps (publish data = pay for bridge) its will not be enough to create the desired outcome (uniform reproducible science in sociology = a bridge). Generally, a third party is needed to step in and collect funds/taxes from all (i.e. force us all to submit our data and syntax). This is why I suggest that as a community we institute policies to enforce the behaviors we would like everyone to engage.

    @Dan,
    We are in very similar boats here, in that we both use multiple systems, languages, and techniques for our analysis. I would say that the ideal would be the complete recipe and all the ingredients but I am not going to argue that it will be easy. However, I think that the benefits outweigh the costs, such as scholars producing fraudulent results and undermining the public’s trust in social science.

    @Olderwoman,
    While I am not sure that I would advocate for reviewers to run the models as a conditions of their decision. There have been a number discussions here (https://orgtheory.wordpress.com/2012/07/10/recommendations-for-speedy-journal-reviews/) and here (https://orgtheory.wordpress.com/2012/06/27/slow-journal-review-what-do-you-do-when-it-gets-out-of-hand/) about the time reviewers currently take to complete reviews. I am hesitant to advocate one more task to delay the review process. That said, as a reviewer, I would want to confirm the analysis because I would not want to endorse research that I hadn’t fully vetted if I given the opportunity.

    Like

    brandyaven

    August 4, 2012 at 6:32 pm

  7. This past winter, I enjoyed reading The Same and Not the Same, an apology for chemistry by Nobel laureate Roald Hoffmann. Chemists do replicate the works of others because they want to use those new chemicals as tools for their own researches. Not everything published proves replicable. Some number fall aside as irreproducible; but the process is practical and bountiful. As Thomas Kuhn pointed out, though, chemistry, however, largely share the same paradigm; and social sciences do not. Thus, sociologists (and economists) talk past each other and fail to benefit from the discoveries made by practitioners of different schools of thought.

    As in the discussion above about Mitt Romney’s interpretation of Prof. Diamond’s book, first we question the interpretation, then the facts, then the method, then the theory. All of that makes replication difficult.

    I like the idea of replicating research as a teaching tool, though some finesse would be required as undergraduates typically lack the resources to fully replicate a significant study.

    Like

    Michael E. Marotta

    August 7, 2012 at 1:43 pm

  8. For example, a quick search on ISPCR gave me a total of 8,369 studies. Harvard’s IQSS also has the wonderful Dataverse, but I don’t know of many scholars who actually use it.

    As an undergraduate at Eastern Michigan University, I was directed to the Social Science Data Analysis Network of the University of Michigan. I confess that I never used it… neither was it required. Hence, mea culpa for everyone.

    What we have is a good old fashion social dilemma. Why should I make all my data and syntax available if you don’t?

    The Prisoner’s Dilemma demonstrated that the moral high ground is profitable… in case 100 years of capitalism had not taught that. (“Caveat emptor” had been the slogan until Alexander Turney Stewart taught us that the customer is always right.)

    It’s Merton’s innovation, scholars feel pressure to succeed but the “legitimate” avenues don’t produce the right p-values so they commit “creative data analysis.”

    When I tell people that I am a criminologist, I explain that I am not an investigator or a police officer, but a sociologist who studies crime. That said, the sociology of crime being as it may, crimes are committed by individuals who make bad choices. No other explanation is complete and consistent. Granted that some social conditions such as poverty and ignorance are conducive to crime; and granted that youngsters who age out of crime come into it via differential learning, differential association, differential rewards, social strain, general strain, anomie, etc., etc., the bottom line is that white collar crimes in general and academic crime in particular fit none of the social models. The criminals in the university (and corporation) are educated, intelligent, rewarded, disciplined, and planfully competent. I have delivered presentations on this to high school and junior high science classes. Kids who cheat now, cheat later as adults because they continue to make bad choices despite their educations, incomes, and social status. How we deal with these criminals is another matter. I believe that the HHS Office of Research Integrity has the best workable model, so far.

    But couldn’t all graduate students benefit from reproducing important studies?
    Indeed!

    Like

    Michael Marotta

    August 7, 2012 at 5:34 pm

  9. Michael Bishop

    August 9, 2012 at 11:54 pm

  10. Well this is an interesting question. I have a quantitative paper forthcoming in SMJ and I suppose I should make the data available. There are two issues:
    1. Copyright: The dataset is partially from SDC, so I am not 100% sure I can release it. Of course almost all variables are calculated based on various sources of data and there might be very little that is directly from SDC (company names, dates of alliance deals).
    2. Insecurity: I think the dataset is fine, I ran lots of alternative analyses, and I believe in my results. But. Given so few people publish their data sets, if I had included a footnote with a URL to download the data, I would have been drawing quite significant attention from exactly the kind of people who like to look for flaws in analyses. I don’t see myself as a quant research (this is an older paper is from my quant dissertation), but as a mostly qualitative researcher I can surely do without a possible reputation for incompetence in advanced stats. Also, if there is something wrong with the analysis (even if it seems highly unlikely), it would be unfair to risk the reputation of my co-authors.

    Of course insecurity shouldn’t be a reason for withholding datasets. Science progresses through constructive criticism and all that, but as an individual it takes quite a bit of self confidence to open yourself to potential attacks (however unlikely) for no particular private gain.

    Like

    henri

    August 10, 2012 at 8:51 am

  11. […] Coming up on orgtheory: in September, a review of Gabriel Rossman’s book on the radio industry and in October we’ll do our book forum on Andreas Glaeser’s Political Epistemics book. And don’t forget guest posts by Jenn Lena, Katherine Chen, and Brandy Aven. […]

    Like


Comments are closed.