waiting for big data to change my world

I keep hearing about the coming big data revolution.  Data scientists are now using huge data sets, many produced through online interactions and media, that shed light on basic social processes.  Big data data sets,  from sources like Twitter, Facebook, or mobile phones, give social scientists ways to tap into interactions and cultural output at a scale that has never been seen before in social science.  The way we analyze data in sociology and organizational theory are bound to change due to this influx of new data.

Unfortunately, the big data revolution has yet to happen. When I see job candidates or new scholars present their research, they are mostly using the same methods that their predecessors did, although with incremental improvements to study design. I see more field experiments for sure, and scholars seem more attuned to identification issues, but the data sources are fairly similar to what you would have seen in 2003. With a few notable exceptions, big data have yet to change the way we do our work. Why is that?

Last week Fabio had a really interesting post about brain drain in academia. One reason we might see less big data than we’d like is because the skills needed to handle this type of analysis are rare and much of the talent in this area is finding that research jobs in the for-profit world are more lucrative and rewarding than what they’re being offered in academia. I believe that’s true, especially for the kinds of people who are attracted to data mining techniques. The other problem though, I think, is that social scientists are having a hard time figuring out how to fit big data techniques into the traditional milieu of social science. Sociologists, for example, want studies to be framed in a theoretically compelling way. Organizational theorist would like scholars to use data that map on to the conceptual problems of the field. It’s not always clear in many of the studies that I’ve read and reviewed that big data analyses are doing anything new other than using big data. If big data studies are going to take over the field they need to address pressing theoretical problems.

With that in mind, you should really read a new paper by Chris Bail (forthcoming in Theory and Society) about using big data in cultural sociology.  Chris makes the case that cultural sociology, a subfield that is obsessed with understanding the origins of and practical uses of meaning, is prime for a big data surge. Cultural sociology has the theoretical questions, and big data research offers the methods.

More data were accumulated in 2002 than all previous years of human history combined. By 2011, the amount of data collected prior to 2002 was being collected very two days. This dramatic growth in data spans nearly every part of our lives from gene sequencing to consumer behavior. While much of these data are binary and quantitative, text-based data is also being accumulated on an unprecedented scale. In an era of social science research plagued by declining survey response rates and concerns about the generalizability of qualitative research, these data hold considerable potential. Yet social scientists – and cultural sociologists in particular – have ignored the promise of so-called ‘big data.’ Instead, cultural sociologists have left this wellspring of information about the arguments, worldviews, or values of hundreds of millions of people from internet sites and other digitized texts to computer scientists who possess the technological expertise to extract and manage such data but lack the theoretical direction to interpret their meaning in situ….[C]ultural sociologists have made very few ventures into the universe of big data. In this article, I argue inattention to big data among cultural sociologists is particularly surprising since it is naturally occurring – unlike survey research or cross-sectional qualitative interviews – and therefore critical to understanding the evolution of meaning structures in situ. That is, many archived texts are the product of conversations between individuals, groups, or organizations instead of responses to questions created by researchers who usually have only post-hoc intuition about the relevant factors in meaning-making – much less how culture evolves in ‘real time’ (note: footnotes and references removed).

Chris goes on to offer suggestions about how cultural sociology might use big data to address big theoretical questions. For example, he believes that scholars studying discursive fields would be wise to use big data methods to evaluate the content of such fields, the relationships between actors and ideas, and the relationships between different fields. Of course, much of the paper is about how to use big data analysis to enhance or replace traditional methods used in cultural sociology.  He discusses how Twitter and Facebook data might supplement newspaper analysis, a fairly common method in cultural and political sociology. Although he doesn’t go into great detail about how you would do it, an implicit argument he makes is that big data analysis might replace some survey methods as ways to explore public opinion.

I continue to think there is enormous potential for using big data in the social sciences. The key for having it accepted more broadly is for data scientists to figure out how to use big data to address important theoretical questions. If you can do that, you’re gold.

Written by brayden king

June 28, 2013 at 8:17 pm

13 Responses

Subscribe to comments with RSS.

  1. assuming you have enough statistical power, the biggest problem faced in empirical analysis is that we dont get to directly observe the data generating process most of the time, so theory-building has to be based on a combination of hand-waving and empirical evidence. “Big data” oftentimes represents “wide data” rather than “deep data”, and in the absence of the latter, for example, a billion data points about wikipedia edits isn’t much different (or sometimes worse) than smaller but deeper datasets about (say) collaboration. So, yes, i agree big data is great to have, but apart from the joy of seeing more stars in regression models I think it offers little for theory-building unless it also contains more observable information about the data-generating process (than say, an ethnography, or a hand-collected newspaper dataset).



    June 28, 2013 at 9:05 pm

  2. This post could be written with any data in mind. Surveys, field notes, experiments – all useless without good questions. Once you’ve been academia you realize that the quality of research is more often about the idea and the researcher than any specific data.

    In my own work on big data, I have tried to shy away from novelty and focus on questions. Instead, I have come to think of big data as a compliment to older models. The more tweets more votes paper for example builds on prior politic science work and shows that social media is picking something up that other models don’t. The follow up paper then looks at use styles and so forth.

    In the end, there will a process where we figure out what works. The revolution is coming. The only question is what we’ll learn from it.



    June 28, 2013 at 10:23 pm

  3. I agree with Fabio’s and Brayden’s arguments about the need for a core theoretical orientation and that big data analysis must be question-driven if it’s going to be relevant to sociology. To Brayden’s underlying question, “when is big data going to change my world?” I think we are in a position where the data has come before the question. To borrow from Mel Brooks, we didn’t land on Wikipedia. Wikipedia landed on us. In a very short amount of time, we gained access to a lot of large, unstructured, typically administrative data; and we have to figure out what questions we want to ask and what questions we might be able answer.

    To Brayden’s re-introduction of Fabio’s point about a brain drain, I’ve actually taken a co-op job at IBM, in part, to work on these kinds of questions. You’ll have to get back to me in six months or year before I can tell you whether the other side is more attractive, but I’ve seen a lot of good social scientists start looking into the private sector for opportunities. The big impetus seems to be the competitive academic job market; but, in my own job search, I saw many rewarding opportunities out there and had a number of employers tell me I was the kind of person they were looking for. The door is open for computational social scientists on the private side; and I think more and more people are going to go in that direction.

    I think a wave of change is coming and it will really put a spotlight on social science. Right now, I think it’s up to us to help bring about that transition by making something scientifically meaningful out of these data.


    Jason Radford

    June 28, 2013 at 11:59 pm

  4. Shameless plug:

    Liked by 1 person

    Prison Rodeo

    June 29, 2013 at 12:19 am

  5. Thank you for this generous mention of my work, Brayden. I certainly share “sd” and Fabio’s concerns about the limits of big data. In fact, I devote nearly half of my paper to such concerns (not only the very important issue of the social context in which social media texts are produced but also the equally formidable logistical and technical issues that inevitably arise when one works at this level of analysis).

    In another paper, I propose a solution to the point about social context in the form of social media research “apps” that not only integrate standard survey research with social media extraction, but also provide incentive for public participation in social science research in an era where our relevance is repeatedly questioned. Still, this new strategy presents additional challenges related to sampling- and specifically respondent-driven sampling- which will not be resolved until we have better baseline measures of the parts of the population that frequent social media sites.

    I also do not quite agree that big data does not enable new lines of theoretical inquiry. To my mind, the greatest strength of big data is that it enables analysis of meso and macro-level processes that were once thought to be beyond the scope of conventional qualitative methods. Indeed, I build upon Ann Swidler’s suggestion that “the greatest unanswered question in cultural sociology is whether and how certain cultural elements anchor and constrain others.” My own interest is in ecological and evolutionary theory at the macro-level, but I also suggest many other applications. But there are also many others who have signaled this as an important issue (e.g. Ghaziani and Baldassari).

    Also, I should clarify that the paper does not suggest that big data methods should or can supplant the existing options within the cultural sociologist’s toolkit. Rather, I argue that the future of supervised learning techniques- which are currently quite popular among proponents of big data- hinges quite critically upon the integration of inductive, iterative, and middle-range coding strategies pioneered by cultural sociologists. As an example I point to the fine work of John Mohr and Robin Wagner Pacifici, though Paul DiMaggio, Francesca Polletta, and Gabe Ignatow are also doing exciting work in this area as well.

    In sum, the paper is cautiously optimistic about the integration of big data and cultural sociology but also insists that the former is no panacea. I have already mentioned several limitations above but the paper also discusses the formidable privacy issues that can arise in rigorous, detailed, big data research. Above all, it is a call for cultural sociologists to begin exploring big data, since we are already far behind other fields such as political science (see Gary King’s recent work on internet crackdowns in China), the digital humanities, and communications scholars such as Daniel Kriess, Zeynep Tufecki, and Phil Howard. This call might also be extended toward sociology more broadly. Apart from Fabio, Michael Macy, and Matt Salganik, I am not aware of any other sociologists working with big data, though I’m sure/hope there are many more that I have not yet encountered.


    Chris Bail

    June 29, 2013 at 2:43 am

  6. In response to your question of “theoretical inquiry,” I want to paraphrase/distort the question this way: What kinds of theories would be required to make internet data-mining (i.e. Big Data) more central to sociological practice? So that I can respond in the following way:

    An answer that jumps out to me is the internet values theories fully conversational with the sociology of knowledge and the sociology of place, because the internet mostly stores knowledge (content) and geo-spatial (location) data. Such theories would have to see beliefs and locations as facts, as data, because the internet does.

    In practice the internet values methods ordered around analyzing text, both numerically and in context. The internet is code, code is programmed text, and the content produced is basically text. Problem is, most sociologists study variables, not text.

    I agree the skills made valuable by the internet are largely priced out of academia. Academia’s best hope is in the freedom departmental and intellectual life allows, and the prestige it bestows. Or put another way, what techies think about the freedom and prestige of academia could decide whether the skills enter social-science departments.

    The obvious disclaimer to the above is that no understanding of what these data will mean for social structure can be real confident, at least at this moment. Internet text might turn out to be noise, not signal.



    June 29, 2013 at 9:47 pm

  7. The skills necessary for big data are rare in sociology, not in economics and (to a lesser extent) political science. But economists don’t do language because Behaviorism (a position which ignores Speech Acts). A friend of mine actually left his econ grad program because his advisors discouraged him from doing programming work and working with internet corpora — he’s now getting emails from JP Morgan because they want him to predict stock prices with sentiment analysis.

    There are, as I understand it, almost no behavioral models to guide searches through big data. So it’s all correlational. A lot of the early results here are very obvious — more talk is correlated with more voting; more Nazis were correlated with less talk about Jews, etc. But nobody was blown away either when economists first started aggregating macro-data in GDP figures and discovering that for instance agricultural production was a large piece of the GDP pie in the early 20th century. Nevertheless, macroeconomics would never have become a subfield without these data (though we ought to be warned that we will face the same aggregation problems that empirical macroeconomics has always faced).

    Computational linguists can probably help here, and a collaboration between sociologists who can put speech acts into their pragmatic, structural context (as measured by the SES and other meta-data attached to corpora) will be fruitful. Even linguists who do pragmatics, which is the “social context matters” branch of linguistics, seem to migrate toward language-as-philosophical-object, rather than language-as-proxy-for-social-forces. That’s the impression I got on an early reading at least. And I don’t think there are a lot of sociolinguists who do content and corpus linguistics as well, focusing on survey methods and subject-observation.

    It’s worth noting that there are I think fully four Corpus Linguistics journals already long-established, and linguists are relatively hostile to people creeping on their turf. Erez and Jean Michel made linguists and the folks in the humanities very angry by publishing their (extraordinarily preliminary and exploratory) Google Ngram stuff in Science, and relying on Pinker as a guide (though, in their defense, both guys have herculean brain power and are extremely collegial). Even the digital humanities people, from within the English department, are facing quite a bit of hostility.

    I think the movement is pretty rad though, and will be an enormous boon. There is at least some contradiction in claiming that a majority of importantly motivating social forces are macro-social, and then sampling them with descriptively thick micro-social snippets. Corpus and network analyses will round out and help-along, rather than replace, these traditional results.

    Professor Ball: To add to your list, James Evans’ computation lab at Chicago is growing, and he’s pretty vigorously invested in content analysis.


    Graham Peterson

    June 30, 2013 at 1:51 am

  8. While I’ve generally been intrigued by “big data” (and, being a bit of a hardware nut, the kind of computing resources it demands/allows you to claim as a research expense), like Brayden I’m still waiting for the “there” to be there. When I’ve seen job talks heavily reliant on “big data,” their interface with theory has generally fallen into at least one of three categories: (1) empiricist claims about investigating heretofore-unreachable areas of social life; (2) synechdotal claims that by looking at twitter data, for example, you have a somehow “purified” view of social relations from big data; or (3) causal claims (that whatever “big data” data being analyzed has an important effect on more general cultural, political, or social processes). All fair enough, and sometimes quite interesting. But I don’t see how any of these (except for #2, which, as Chris notes in his article’s conclusion, suffer severe epistemological challenges) actually amount to a theoretical contribution. To transform this into a question: is there a theoretical insight or finding that big data has provided that didn’t exist/wasn’t broadly persuasive before it?

    It seems to me, moreover, that there’s also a social-studies-of-science kind of explanation for the rise of “big data” that works well: taken together (integrating as much data as possible) can predict certain classes of behavior (perhaps voting, perhaps cultural consumption) that business and government interests (i.e., corporate marketing and government agencies that suspiciously need liquid cooled mass-GPU supercomputer farms with dedicated power plants) are quite interested in. Translation: the driving interest for “big data” in the university seems to be a technical one about increasingly precise measurement and prediction, and not really theoretical puzzles that demand its development (though I suppose its momentum could be bent to these purposes.)

    Anyway, I enthusiastically support the pluralistic spirit of Chris’ piece. More data, more analytic techniques, etc. are always better for social science. But for both intellectual and political reasons, I’m much more CAUTIOUSLY optimistic than cautiously OPTIMISTIC.



    June 30, 2013 at 1:43 pm

  9. To borrow from Mel Brooks, we didn’t land on Wikipedia. Wikipedia landed on us.

    And by “Mel Brooks,” we mean “Malcolm X.”



    July 2, 2013 at 5:35 pm

  10. An interesting point (towards the end of the video – from some folks doing more practitioner research in big data contexts… you have to worry about bots and spam. One of the researchers in the audience makes the point that we assume that interactions in contexts like social networks are actually people but that might not always be the case.

    Depending on the volume of traffic generated by bots, one could be drawing inferences about scripts rather than actual interaction. I know that there’s a real concern about fake reviews on yelp, amazon and several other platforms (see Similarly, algorithms used by these platforms can significantly influence results – I know this from my own work a large data set.

    My biggest worry with the buzz around big data is that the efforts to wrangle it may come at the expense of researchers getting to understand their data. Of course this is a problem in “small data” research too… but its not as much of a time investment to understand what’s going on.

    And I think that’s actually the sentiment of a bunch of big data researchers out there. When I attended a computational social science at Columbia last year, people were far more excited about creative field experiments on online platforms than the big data analysis. These experiments were in some sense “big” in that they capture a lot of passive data on the participants but were often relatively “small” in sample size – more along the lines of Salganik & Watts’s music lab stuff. And here’s where I agree with Chris that these may be opportunities to really make micro to macro links.

    Personally, I think this approach makes a lot of sense. After all, if there are standard selection/unobserved heterogeneity issues in data, the size and the detail doesn’t really matter. And my experience has been that practitioners but also academics often equate big data with greater confidence about casual relationships, a belief that I feel is seriously flawed.

    So the bottom-line is that I agree with Brayden – the same techniques on big dig don’t solve the problems that plague small data. In fact, I’ve argued that they might exacerbate some of them. That said, they seem to hold a lot of promise for new ways of understanding more micro mechanisms in larger (and real) social contexts. Just beware of the bots.




    July 3, 2013 at 6:39 am

  11. Elizabeth, what an incredibly embarrassing mistake on my part. Thank you for your correction.

    I think Jacob points to a wonderfully interesting direction in online and big data research: the ability to perform wide-ranging experimental and interview-based research. Whether it’s deploying different get-out-the-vote messages through Facebook friends or aggregating opinion through a series of one-to-one comparisons; there is a scale and technologies for accessing participants that are very different from what we’ve had before. Maybe we would benefit most from further basic research on internet-based methodologies whether running experiments across platforms, content analysis across modes of communication, alternative survey questionnaire formats, new forms of internet-mediated semi-structured interviewing online, and so on.


    Jason Radford

    July 3, 2013 at 10:32 am

  12. My guess is that there are more sociologists using big data than you’d think but most of us are not aware of their research because it’s not appearing in sociology journals. It’s showing up instead in traditional science journals. Take, for example, Brian Uzzi’s research on instant messaging and trading:


    brayden king

    July 3, 2013 at 6:45 pm

  13. Could we have a diffusion scholar weigh in about what elements of the big data bundle of methods and data are likely to be imported into sociology?

    It strikes me that Stata 13’s ability to read text files directly into variables likely means that a lot more sociologists will become interested in quantitative text analysis. That said, it doesn’t seem like Stata has any new built in tools for doing anything with the long text files after you’ve imported them. Most areas of sociology don’t really do experiments, so I would be surprised if we started doing them just because they were easier to do. Many of the CS methods used in Big Data are designed for prediction and aren’t really that useful when you need to show a reviewer whether a specific parameter’s p value is less than .05, so I don’t think we’ll see many random forest models showing up in ASR anytime soon, but I could be wrong.


    neal caren

    July 4, 2013 at 10:20 pm

Comments are closed.

%d bloggers like this: