what programming tools should computational sociologists know?

Right now, very few sociology programs teach what you need to know to participate in the current revolution that is data science. In many cases, people self-educate. They pick up some R in a stats class, then Python and so forth. But let’s be systematic about this. If you were designing a curriculum for a graduate or upper division computational sociology core, what would it look like?

Let’s get the discussion rolling and we’ll pick up after Christmas. Until then, happy holidays.

50+ chapters of grad skool advice goodness: Grad Skool Rulz/From Black Power

Written by fabiorojas

December 23, 2014 at 12:23 am

Posted in academia, fabio, sociology

10 Responses

Subscribe to comments with RSS.

  1. I would argue that the question is not only what “programming tools” should computational sociologists use, but also, “what should sociologists know about using computational tools?”. Speaking from experience, I have found it relatively straightforward to understand (the big picture) of what tools like NLP packages do, and to write code (R and python) to compute variables, and much harder to integrate these new tools with existing theory and methodologies. Surprisingly, I have found that humanities (English, history) have been giving this issue a lot of thought (eg, can you use topic models to examine the emergence of a historical phenomenon? or to map out a literary trend?). So I’ve sometimes turned to their blogs for guidance.

    Liked by 2 people

    Andreea Gorbatai

    December 23, 2014 at 2:42 am

  2. Reblogged this on SocioTech'nowledge.


    Pedro Calado

    December 23, 2014 at 2:44 am

  3. I tend to be asked this question quite a bit in some form or another. My first reaction is to try and lecture people on the importance of knowing how to code in general, and of not feeling like any language is a limitation. For instance, I forgot all the Stata and Visual Basic I ever knew, but that does not mean I would feel too scared taking on a project using these languages, though I might curse a bit.

    Here’s my list:

    0. The UNIX / Linux command-line
    I think this is often neglected, and it’s really here that I see a lot of people get stuck. Of course, basic text utils like head, tail, cat, cut, paste, tr are worth mentioning here. The idea of piping needs to be explained properly. Also, some basic operations in bash scripts: loops, command-line args, etc. This is likewise the place where it’s worth introducing sed, awk, and even a bit of command-line Python. Web stuff should also go in here: wget, curl. Finally, Linux stuff: package managers (apt-get / yum), dealing with errors, reading the Linux forums, knowing what to look for. Also, this is where the students get to choose sides in the vim-vs-emacs debate (vim rules!).

    1. R
    I think R is really the foundational tool. It’s a freewheeling mess of a language, but for researchers that can also be a strength. You can do a lot of stuff really fast in R. Need a choropleth? You can do that in something like 5 lines of code? An HMM – there’s a package for that. Need to mine Twitter? There’s a package for that too. I am always in awe of just how many packages there exist in R.

    2. R extensions
    I think it’s also worth emphasizing that R has a bunch of packages which more or less have their own DSLs (domain-specific languages). Data.table comes to mind (for manipulating very large datasets), as does plyr (still alien to me). Also, ggplot with its functional approach to plotting, and the geo packages (sp, maps, maptools, rgdal, etc.). And I couldn’t forget igraph here either. You don’t really need to know any of these packages to use the core functionality of R, but they are definitely very handy tools to have.

    3. Python
    I prefer Python for two things: string manipulation and streaming computation. It’s way superior to R for those tasks, and it’s a must for pretty much anyone in the social sciences. Seriously, even if you never touch statistics, a little bit of Python will probably make your life a LOT better.

    4. Basic MapReduce / Hadoop
    This is not a programming language, but I feel like its idiom requires its own section. To me, coupling it with Python and the UNIX command line makes a lot of sense, since you can replicate Streaming Hadoop with pipes (e.g. cat input | python | sort -k1,1 | python > output ).

    5. SQL
    Strange to have Hadoop before SQL, I know, but I think it makes more sense (for people who will deal with large datasets) to piggy-back on the idea of pipes and Python streaming to emphasize the streaming nature of computation in these cases. SQL is a lot more imperative as an idiom, and I think it makes sense to introduce it later in a sequence.

    6. Java
    I think students also need exposure to an object-oriented language and with everything that comes with it. Of course, you could teach the idea of classes and inheritance with R (shudder) or with Python, but there’s hardly any other language where this idea is easier to grasp than in Java. Also, Java opens up a lot of other programming languages (e.g. at Stanford there is a natural transition from Java to C++ in the CS intro sequence; similarly, writing Hack, Facebook’s heavily-restructured version of php oftentimes feels like writing Java).

    I think there are also a few slightly more marginal languages that are nonetheless useful to know: C++ (for huge projects where speed matters a lot), Matlab (for people who will be writing their own estimators), C (hard to think of a directly-practical use for social scientists, but learning about it will be a deep excursion into the inner workings of a computer, or into about the seventh circle of hell, depending mostly on how you frame it), Scala (seems like a cool language), Spark (a very promising in-memory computation framework), Giraph (graph computation), Mahout (very large matrices).

    As someone who is self-taught, I would emphasize the need for infrastructure and support rather than having “the programming class” as part of a Sociology program. Even UNIX is something that’s hard to pick up if all your departmental computers run Windows. A good department should also have a Hadoop cluster for students to muck around on. Also, give students freedom to take classes in Computer Science, Statistics, even Electrical Engineering.

    Finally, I think the learning process would be a lot more fun if it were driven by real Sociological questions rather than by the desire to burnish one’s data science skills. We’re in this because we care about people and all these tools are very handy means to an end.

    Liked by 1 person

    Bogdan State

    December 23, 2014 at 7:46 am

  4. I agree to the above, but perhaps replace straight Java with a functional language like Clojure which might be better for teaching parallelism. Ideally, you’d leave R until later as it’s pretty terrible as a programming language and students might pick up bad habits if they learn it first.

    Python is definitely a must as the mainstay of OO scientific computing.

    I’d also make a case for PowerShell for die hard Windows users. It’s more powerful than the UNIX command set – CSV, XML and JSON can be treated directly as hashes, and it has the LINQ query syntax too, which makes for very simple data manipilation. It wouldn’t be difficult to use SQL in the same way as well.


    Naadir Jeewa

    December 23, 2014 at 8:57 am

  5. Being a wildlife ecologist focused on conservation biology, I cannot re-emphasis the significance of various free and open source software (FOSS) under the broader umbrella of GNU (GNU is not Unix)- GPL (General Public License). In this remit, if you are a serious conservation ecologist (sociology is an important element of conservation science) mastering fundamental mathematical skills focusing Quadratic, Logarithmic, Exponential, Trigonometric Functions, System of Linear Equations, Metrices, Polynomial and Rational Functions, Complex Numbers, Analytic Geometry, Linear Algebra, Calculus (both differential and integral in 100 level course), Mathematical Modeling and finally a solid grounding on 100 level Statistic course that lay foundation on Analysis of Varience (ANOVA), Chi-Square Test, Regression Analysis, Logistic Regression, Bay’s

    Theorem, Student’s t test and Probability Distribution Function (pdf, that requires knowledge on Calculus). Therefore, a serious student possibly will depart from windows based software and will venture into GNU-GPL Linux based software utilization in order to perch himself at the forefront of contemporary technological apex. I always advice my students that they should start with several Linux Distributions notably Ubuntu, Linux Mint, Open Mandriva (you will need 4 GiB of RAM and possibly 64 bit machine), Open SUSE, PCLinuxOS and Bodhi Linux. These are, generally speaking, relatively easy to install and use and student may choose to learn how to nevigate through command language interphase (CLI) along side commonly favorable GUI. However, emphasis on CLI can be compromised if students have other priorities in terms of gaining sophisticated mathematical skills as mentioned above.

    Therefore, in order to master the skills in math, a student can start with GNU-GPL mathematical software that in fact will be packaged through all the Linux distros I have mentioned above. The easy way to build the learning curve steep, would be to start with LaTeX: a typesetting markup programming environment that will enable students to write all the the mathematical formulas and equations without relying on wordprocessors. Most often Linux will come with various LaTeX based software but I personally will advice students to install TeX and LyX. Once students learn how to use any of these software, chances are unlikely he would prefer returning to wordprocessors considering to the fact LaTeX (TeX or LyX) is way more powerful than wordprocessors when it comes to mathematical and scientific writing/publishing. Also once student master the skills, they can integrate LaTeX into WordPress very easily hence would be able to write mathematically intelligent blog articles at ease.

    After LaTex, student may prefer to learn how to carry out mathematical operations by using Python programming language. Linux will come with Python and my suggestion to students would be to install Integrated Development Environment (IDLE), a solid Python programming environment. IDLE is easy to operate and it will also be packaged through the Linux distros I mentioned above. I would suggest students to use Python 2.7 and not 3.0. Once students know how to use Python, they possibly will never need to use commercial mathematical software that windows motivate students to purchase, for example MATLAB and MAPLE. Python will do all the works and beyond in a lot better and powerful manner as oppose to MATLAB or MAPLE. Students can choose R Commander- a GUI environment of R, which is lot more easy than R itself and once again Linux will come with it. However students will need to install R to access R Commander. I personally use R Commander and there are some excellent books students can read to master R Commander. There is not much need to learn R programming Environment (R is not a programming language itself, it is classed as programming environment, just like LaTeX which is typesetting markup programming environment). Python and R commander are more than sufficient to carry out all the works I mentioned above. In a nutshell, start with solid grounding on Mathematics (main focus should be on Functions, Statistical and Algebraic Modeling and Calculus), install Linux, get a grip of LaTeX, learn Python and learn R Commander. If you do that you should be more than fine and you will be able to beat all the general windows based software users who possibly will remain many years backward than you in terms of their academic and technical skills.


    Ecosphere Science

    December 23, 2014 at 10:55 am

  6. Bash, R, Python/Perl/Ruby, LaTeX, NetLogo


    December 23, 2014 at 9:31 pm

  7. Let’s not forget about Git/Mercury/Subversion, some basic principles of project organisation, and literate programming. My worst experiences have been with co-authors whose code was messy and whose results were irreproducible.

    I would also add that Java is a nice language to learn because it is the de-facto lingua franca in agent based modelling: using MASON and RePAST depend on it, and NetLogo modelling can greatly benefit from Java skills.



    December 29, 2014 at 5:34 pm

  8. […] the holiday, we asked – what should computational sociologists know? In this post, I’ll discuss what can be done from the view point of sociology […]


  9. for learning i would say python (networkX, nltk, igraph, matplotlib, numpy, pandas, ipython), netlogo, latex, git, shell. parallelization would be the next step if necessary after having basic knowledge in the former tools.



    January 2, 2015 at 6:08 pm

  10. From my personal experiences, I would make recommendations as follows:
    1. Statistics language, eg. R and SAS. R is very powerful and many packages should be learned thoroughly.
    2. Website design, eg. html5+CSS, jQuery, PHP
    3. Python. Life is short, thus you need python…
    4. LaTeX is necessary for composing.
    5. other useful languages, eg. Ruby, JavaScript
    6. On the basis of programming languages, algorithm courses should be learned.


    Alex Jiang

    January 5, 2015 at 8:11 am

Comments are closed.

%d bloggers like this: