orgtheory.net

should stata just give up and die?

I love Stata. I use Stata. I make my students use Stata. But I’ve got a problem and it’s called R. The problem is that R is also amazing. My friends use it. My students use it and a lot of social science/data science is now R or Python.

Why is Stata cool? Simple – it is built for stats. Regression is simply reg y x. Weights, clustering, and using subsamples is easy. The manuals and tutorials are cool. It is also way, way stable and there are great libraries that archive algorithms and commands. There is a Stata journal showing you you how to implement the latest models.

But while Stata is amazing, it lacks two major advantages over R: Stata is not free and Stata is not consistent with the broader computer science world (i.e., once you know how to program in general, it is easy to get R and Python, while Stata has it’s own logic).

What should I do? The answer is pretty simple. Learn some R. But the deeper question is what should we do with Stata? Should we continue to use Stata in teaching? Why not dump Stata entirely and make all social science students learn R? What reason do social scientists have in continuing to use Stata, or SAS, or SPSS? Why not just make statistical education and computer science education come together? Why not just say, “look if you want to get a degree in economics, or sociology, or political science, or any field where statistics is common, you will just have to suck it up and learn a little about computer code?” It would be cheaper and prepare students better for a world were statistical programming is now subsumed into basic computer programming. The world of SAS specialists, or COBOL or FORTRAN specialists, is coming to an end. Maybe we should admit it and move on.

As every Marine is a rifleman, every social scientist is an R coder.

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($4.44 – cheap!!!!)/Theory for the Working Sociologist (discount code: ROJAS – 30% off!!)/From Black Power/Party in the Street / Read Contexts Magazine– It’s Awesome

Written by fabiorojas

September 8, 2017 at 4:01 am

28 Responses

Subscribe to comments with RSS.

  1. My concern with R is precisely that it is free and therefore has no company with a vested interest in providing customer support. It’s partly the same reason why I, and many corporations I think, favor Microsoft Windows over cheaper Linux systems. Linux is great – until you got a problem running a basic command. So too is R great, until you spend hours trying to get a basic logit to run. I’m all in favor of teaching people to code, but I don’t think R is the right product to do that. I would be in favor of stata’s company ending it’s current program and releasing a revamped program. A ‘stata 2’ that is closer to other programming languages, but has reliable customer support.

    Like

    Michelangelo Landgrave

    September 8, 2017 at 7:23 am

  2. Michelangelo:

    If it takes you hours to get a basic logit to run in R, I suggest you find some good tutorial material. For example my book with Jennifer Hill explains how to do it, and our book costs a lot less than it would cost to buy Stata!

    Liked by 4 people

    Andrew Gelman

    September 8, 2017 at 11:01 am

  3. I remember a time long ago learning SAS on Unix. It turned a number of people off stats entirely. My problem with R for myself is I’m lazy to update. For my students it’s the danger of turning people off stats if they can’t figure out how to code.

    Liked by 1 person

    Kathy

    September 8, 2017 at 11:08 am

  4. I’m a Stata user who learned a little bit of R to do some modeling that at the time was not possible in Stata. I needed a specific kind of model, I didn’t have grant money for the specialized software to do it, and I found that there was an R package for it. So I downloaded the program and tried to follow the instructions in the paper describing the package.

    I found it a horrific experience. I’m not an experienced coder and the logic of object-based programming was not easy for me to grasp. But much worse, when I asked for help with R, experienced users responded as if I was a 10 year old who wasn’t out of diapers yet. People laughed at my code, scorned me, and told me that I was a fool for trying to use the program for a specific purpose before developing general programming expertise. I wanted to use R to run a couple of models for one paper. They made it seem like I shouldn’t even bother unless I was willing to invest months into learning that mode of programming.

    There may be some validity to the idea that one shouldn’t try to use R without really understanding how it works, but in my situation and that of many others, this steep learning curve and corresponding scorn from the knowledge keepers is a big problem. R users, at least the ones I asked for help, both on my campus and online, were condescending and totally unwilling to actually be helpful. I finished that project, went back to Stata, and have no intention of using R again. How is a relatively new AP who learned Stata but has no other coding knowledge supposed to just switch to R when this is how the learning process works?

    Liked by 1 person

    Sea Gull

    September 8, 2017 at 1:17 pm

  5. As a former student of computer science and a current PhD student, I have grappled with the STATA vs R vs Python choice. I have used all three for different applications, but I have settled on using Python and Pandas for computational modelling and data manipulation, and STATA for statistical analysis. I found Python (and Pandas) excellent and intuitive when it came to doing tasks involving loops or programming logic in general. STATA seems extremely logical and clean for statistical analysis. To put it in more general terms- STATA is a statistical package for statistical analysis created by statisticians. Python is a programming language written for programming by programmers.
    R seems neither here nor there. Some of its syntax and grammatical structure is a weird mix of programming and statistics. It comes across as a programming language created by statisticians- with some object oriented concepts thrown in. Given that Python is becoming such a widely accepted programming language to carry out a wide variety of tasks- models, data scraping, machine learning- and is easy to learn, I think it is R that should be dealing with existential crises (albeit it is some way off into the future).

    Liked by 1 person

    HKet

    September 8, 2017 at 1:37 pm

  6. I’m a regular Stata user and intermittent R user.
    I know there’s an idea out there that everyone needs to be a second-degree black belt in computer science to run a logit. But there’s nothing wrong with something that is easy and accessible if it works correctly and has an internally consistent scripting language.
    I’m much more concerned with basic statistical literacy and research logic than what program people use. While using R might provide some marginal advantage to sociology undergraduate and graduate students, when it comes to statistics education I’d prefer a different hill to die on.

    Like

    Jonathan

    September 8, 2017 at 1:43 pm

  7. I’m a big proponent of R! But most people in my phd program are trained in stata. Most folks do applied econometrics–lots of fixed effects regressions (etc.) with robust or clustered standard errors (etc.). I’ve tried a bit to convince others to convert to R, but it is tough when to get people on board when running a regression with robust SEs in stata is much easier than in R. The benefit of R, at least in this regard, is that you need to engage in more deliberate decision-making regarding your model. You need to run the regression and then back in the robust SEs through the sandwich package. It requires a bit more intentionality. It’s just hard to get people to make the switch (and face the learning curve) when stata satisfies their needs just fine. Which I think for many social scientists doing applied statistics, it does. And there is enough griping about leaning stata in my department! R might take that to a whole new level.

    I got into R because I do network analysis, and it’s really the only game in town. Without that, I might still be in stata. I’m really glad I’ve learned R and I do everything in it now. So, in general, I agree, I think you get more value from learning R. Using R has helped supplement my statistics training.

    Also, as a sidenote, it is pretty easy to move back and forth between the two programs, so I can see value in doing, for example, data manipulation and visualization in R and analysis in stata.

    Like

    Richard

    September 8, 2017 at 2:47 pm

  8. I haven’t read Gelman’s book but in my experience R tutorials make you spend forever learning fundamentals like vector vs frame vs list before they teach you basic stuff for a realistic workflow. For instance the O’Reilly book Learning R doesn’t illustrate the read.table() function until Chapter 12, which is absurd. I don’t doubt that all the stuff in chapters 1-11 is really important to know eventually, but it creates a sense of futility that you can’t really do anything useful until you learn the whole thing. In contrast, the “Getting Started With Stata for [your operating system here]” PDF/manual that comes with every copy of Stata loads data on the very first page and then walks you through a very typical statistical workflow. There is no good reason that R documentation can’t start with “Your first R session” and have the first line be “my.df <- read.table("mydata.csv") instead of "my.df <- data.frame(c=(1:4),c(1:4))" or "2+2" but for some reason R has a culture that treats loading data from disk as some kind of arcane knowledge not suitable for neophytes.

    Liked by 5 people

    gabrielrossman

    September 8, 2017 at 4:36 pm

  9. I especially think that the last argument is important for the students themselves; I guess that quant social scientists who can code make fantastic job market candidates outside of academia, particularly because they are able to manipulate as well as interpret their data in a meaningful way.

    Like

    bashofstra

    September 8, 2017 at 5:45 pm

  10. So, there’s lots of arguments for R, and I suspect it’s more a matter of when rather than if I will switch to it in my teaching. But, I don’t agree with this account of the differences. For one, I don’t think it’s the case that, for someone who knows how to program “in general,” it is easier to pick up R than Stata. Stata has (by far) a more consistent syntax; it is closer to natural language syntax; and, if one makes some effort at literate programming, somebody new to the language has much more of a fighting chance at being able to read Stata code than R code.

    The other big advantage of Stata is the efficiency for doing ordinary social science. Say you were to provide a random quant sociology article that uses ordinary methods of analysis and the raw (uncleaned, perhaps unmerged) data used. I’m very confident that I could reproduce all the results in the paper using Stata than any sociology R user can do it using R. (Which again, isn’t to say there aren’t some big advantages to R, but the stickiness of Stata is not just path dependence.)

    Liked by 1 person

    jeremy

    September 8, 2017 at 5:45 pm

  11. I am glad I am not the only one who, like Sea Gull, finds the R community not particularly helpful. When I am stuck and I google for help, I often run into queries posted to R with a similar issue that I have, but find the responses are grumbling about etiquette, or telling the user their basic approach is wrong. Not helpful. Don’t get me wrong, those searches have often saved me, but I am guessing that the ratio of helpful Google links to unhelpful Google links is higher for my Stata queries than my R queries.

    Like

    Josh Klugman

    September 8, 2017 at 7:18 pm

  12. In the mid-1990s, I switched to Stata from SPSS (which I had used since 1969, originally working from the mimeographed version!) after my SPSS install committed suicide 2 months into my one year license and I vowed never to do business with them again. The hardest part about switching to Stata was getting used to the ` and ‘ marks for macros (and telling them apart). But the on-line help coupled with coaching from a couple of savvy grad students made the transition not too painless. Although I’ll admit I’ve never successfully written a program (as opposed to a .do file) and I’m just barely functional in Mata.

    Another good thing about Stata is its user community and modularity. Nearly every time I’ve tried to figure out how to do something in Stata, if I Google Stata + keywords for what I want to do, I have found either a package to do it or someone else who has written up the Stata language for the thing. Over the years, I’ve figured out how to make Stata do things it wasn’t really meant to do. I can, for example, use Stata to write Stata code, and use Stata to parse lines of text. Both of these are jobs that would be much better and more easily done in Python, but first I’d have to get far enough along in Python to figure out which package to load and look again at the introductory tutorial to remind myself of the basic syntax.

    I’ve bought licenses for Stata and the ~$100 gradplan price has always seemed quite reasonable and affordable, given that I have a job. Now my university provides it for “free” on a campus license.

    I’ve been advising younger people to learn R because it is open source and free and they may end up on a campus with a poor computer facility, but the point about the lack of good support is worth considering.

    Like

    olderwoman

    September 8, 2017 at 7:29 pm

  13. @Gelman The case of running a logit is of course an example. I agree that reading a good book for R (I’ve heard praises about yours) is a must do. However my experience is that people who aren’t inclined towards coding have an awful time learning R. Stata coding, as any language, takes time as well but its learning curve is much lower.

    Like

    Michelangelo Landgrave

    September 9, 2017 at 3:08 am

  14. I just wonder whether someone can create a library that restructures the command functions for R so that it uses Stata’s function and option names, as well as loading the required packages within this wrapper package. e.g. if we want robust standard errors in linear regression, we don’t need to use lm and load the sandwich package ourselves. Just load this wrapper package and use the name reg. (reg(y~x,vce=”robust”))

    Like

    TW

    September 9, 2017 at 4:39 am

  15. […] is a partial response to Fabio Rojas recent post on the fate of Stata, a statistics package, given the rise of a free alternative, R. Rojas and […]

    Like

  16. What an autodidact thinks is easy to use probably isn’t the best choice for teaching students that may be somewhat resistant to the topic. That said, once past regression I don’t see why R shouldn’t be part of the curriculum – or even before, it isn’t that bad. Well, it sort of is for data cleaning. Anyway, students that refuse to learn how to do any programming will find a way to do that regardless of what package the instructor chooses.

    I rather like using Stata to teach, but there is a scarcity of labs with it at my new place nor are there cheap licenses for students (other than the Grad plan – which isn’t cheap). So I am much more motivated to push R than I have been in the past.

    Re: Stata’s learning curve: That may be true, but if one learns to program on Stata first they’ll have picked up so many bad habits that moving to something else later will be more painful than if they had done it the other way around.

    Like

    micah

    September 9, 2017 at 3:42 pm

  17. I don’t see why you should have to be a programmer to use statistics in your social science research. We would be better off if this wasn’t necessary. They’re different skills/languages.

    Like

    Philip N. Cohen

    September 10, 2017 at 9:04 pm

  18. Gabriel Rossman: Sorry that my book has completely put you off learning R. That was absolutely not my intention. And I agree that the chapter order makes it a slow start. If I ever get around to writing a second edition, I’d definitely make it easier to get started faster.

    Liked by 2 people

    richierocks

    September 12, 2017 at 3:36 pm

  19. Richie Rocks,
    I was only using yours as an example. Like Mark Twain quitting smoking, I’ve learned R dozens of times. I’ve read/used half a dozen R books/MOOCs and pretty much all of them are like that so it’s not just you who doesn’t fit w my need for instant gratification.
    A 2nd edition that starts off with a “your first R session” walkthrough (especially one that uses real data, preferably using read.table but mtcars or something could work) on the model of the Getting Started with Stata PDF would be a great improvement and I would be eager to try it. Obviously starting off w a complete workflow means basically everything would be treated in a cursory fashion when first introduced, but there’s nothing wrong with walking through a workflow and then backfilling detail.

    gabriel

    Like

    gabrielrossman

    September 12, 2017 at 4:15 pm

  20. I am actually switching from Stata to R in my undergraduate statistics course this semester. This was partially prompted by the fact that my current institution does not have a campus-wide Stata license, but was also a switch I had been planning to make for some time.

    It’s true Stata has a somewhat gentler learning curve than R and is a nicer environment for regression analysis and other standard social science methods. That said, I think it’s increasingly important for students to get an introduction to a general purpose programming language like R – it seems like this is, to an increasing extent, a basic form of literacy that everyone should have leaving college. I think it’s also worth considering that R is a much, much more valuable skill on the non-academic job market than Stata. This is obviously a concern for undergraduate students, most of whom won’t become academics, but I think it’s also a concern we need to start taking more seriously for graduate students.

    Liked by 2 people

    JPD

    September 14, 2017 at 8:34 pm

  21. I’d like to echo HKet’s comment above. If I need to run quick and dirty analysis on some data, STATA is still the best. If I need to do something that requires doing something “interesting,” modeling that is not available quick and dirty in STATA, R is much better–I found writing dofiles to do something “interesting” in STATA a horrific experience, although it is a lot better in current versions than in the past. But R is not a real programming language and there are weird quirks that you run into if you want to write a real program–for that, you really need some Python knowledge (I will confess that I am still thankful that I’m not a real programmer, though.)

    Like

    Henry Kim

    September 18, 2017 at 12:44 am

  22. @Gabriel Rossman,

    I think a significant part of the problem is that a lot of R books/manuals are written by and for programmers. Most of us in social sciences are trained to do stats. The programmer-stats people gulf is huge and relatively little has been done to address this.

    Like

    Henry Kim

    September 18, 2017 at 12:47 am

  23. I switched from Stata to R a year ago. @GabrielRossman is right that most tutorials are not friendly for the beginner.

    A notable exception is found here: http://r4ds.had.co.nz/. R for Data Science is the best introduction to R I’ve encountered. It begins right away with data visualization, transformation, and analysis. Then the remaining chapters take a deeper dive into cleaning, programming, and modeling data.

    Liked by 2 people

    Daniel

    September 18, 2017 at 3:53 pm

  24. Since I posted my comment last week, Chris Bail told me Kosuke Imai’s Quantitative Social Science has a good topic progression. I checked it out and it deals w reading data into memory on p 21 and generally seems well written. https://books.google.com/books?id=KLwODgAAQBAJ&lpg=PP1&dq=quantitative%20social%20science&pg=PA21#v=onepage&q&f=false

    Daniel,
    Nice recommendation, will check it out too.

    Like

    gabrielrossman

    September 18, 2017 at 6:57 pm

  25. I’m very late to the party but I have some thoughts on this. I’m a longtime Stata user (15 years!) and I love it. But I switched to R for my grad teaching this year for a few reasons:

    1) R is the package Kieran Healy, Chris Bail, and Scott Lynch use to teach their later courses so I want to provide knowledgeable students along to them. Learning advanced stuff is hard enough without simultaneously having to learn what an object is for the first time.

    2) It seems like the balance has tipped where most of the quant social science “cool kids” are now using R. I want my students to be cool kids, so they probably need to know R.

    3) It’s likely much easier to pick up Stata as a second language and first-year students have time (and willingness) to learn hard things.

    I’d like to point out that one can entirely believe that Stata is “better” and still decide to use R. Betamax was probably better but VHS won anyway. Network externalities are an important thing to consider when choosing what to learn (and teach).

    Liked by 1 person

    Steve Vaisey

    September 19, 2017 at 8:18 pm

  26. I started out with SPSS. I now think Stata is a lot better than SPSS and it’s easy to run Mplus from within Stata. I realize that cost is an issue but, come on! If you’re willing to pay for quality when it comes to cell phones, cars, and computers, is it really necessary to try to get free statistical software?

    I’ve tried to use R three times, and found that the manuals had mistakes in them, the user community was not necessarily helpful, and that the interface was not user friendly. (Before I became a graduate students, I was in the human-computer interaction world, and I think user-friendliness should be a high priority for everyone.) I could go for a decade without using Stata or SPSS, and still remember how to use them. If I go for a week without using R, I can barely remember how to use it.In any case, I’d be genuinely curious to see empirical work on whether students who learn R can still remember how to use it a year later if they haven’t used it in the interim.

    Like

    chrismartin76

    September 20, 2017 at 2:44 pm

  27. Re: externalities, Joseph Cohen, Leslie Hinkson, and Gabriel Rossman talked about this on the Annex podcast a couple of days ago. Cohen raised a good point about how Stata is confined to an academic niche but R has gained broader acceptance outside of academia, and that motivated his Stata-loving colleagues to switch to R for teaching. Something to think about for folks like me who teach in post-grad programs grooming students for applied work.

    Liked by 1 person

    Josh Klugman

    September 20, 2017 at 3:12 pm

  28. I was first taught SPSS in my graduate statistics courses, but was encouraged by one of my statistics instructors (who was not an SPSS user but required to teach it per department policy) to learn R, which I did slowly but surely. I ultimately did work that just couldn’t be done in SPSS, which kind of presses the issue and is a helpful motivator. I had no real background in programming short of playing around with very basic web development when I was a pre-teen.

    I think the lack of programming background may have been something of a plus for me coming to R. I have the sense that the real programmers get irked by R for reasons that general programming languages irk me (why is this object out of scope? why do I have to define this thing? why can’t I change this object to a different type later? why won’t Python let me indent my code however I want?). Compared to Python et al., R is far more designed for the sporadic user, at least in my view. Many of the design decisions seemed driven by the goal of a person starting up an interactive session and getting something done as quickly as possible.

    I consider Stata my second favorite statistical software, but it’s a pretty distant second. The syntax is a bit harder for me to understand, especially with how many things are implicit. Documentation is good and point and click is nice when you, like me, don’t want to learn the syntax. Part of me feels like if you could know in advance that you’d probably never need to do anything that Stata can’t do reasonably well, then maybe it isn’t worth the startup costs (of time and effort) to get going in R. But just speaking for myself, I wanted to go the route that left me most flexible for whatever laid ahead.

    The huge plus that R now has that I didn’t know about when I learned it is RMarkdown. Sharing data analysis results is so much more convenient than SPSS or Stata using RMarkdown. I now do all my “production” analyses for my published works in RMarkdown; the entire analysis lives in the document, which I can turn into an annotated set of analyses to share with non-R-using colleagues. I’ve heard Stata 15 has a similar function, which is great if true.

    Like

    Jacob

    September 22, 2017 at 4:18 am


Comments are closed.