big data: a definition

People often complain, justifiably, that “big data” is a catchy phrase, not a real concept. And yes, it certainly is hot, but that doesn’t mean that you can’t come up with a useful definition that can guide research. Here is my definition – big data is data that has the following properties:

  • Size: The data is “large” when compared to the data normally used in social science. Normally, surveys only have data from a few thousand people. The World Values Survey, probably the largest conventional data set used by social scientists, has about  two hundred thousand people in it. “Big data” starts in the millions of observations.
  • Source: The data is generated through the use of the Internet – email, social media, web sites, etc.
  • Natural: It generated through routine daily activity (e.g., email or Facebook likes) . It is not, primarily, created in the artificial environment of a survey or an experiment.

In other words, the data is bigger than normal social science data; it is “native” to the Internet; and it is not  mainly concocted by the researcher. This is a definition meant for social scientists-  it is useful because it marks a fairly intuitive boundary between big data and older data types like surveys. It also identifies the need for a skill set that combines social science research tools and computer science techniques.

50+ chapters of grad skool advice goodness: From Black Power/Grad Skool Rulz 


Written by fabiorojas

February 25, 2014 at 12:32 am

15 Responses

Subscribe to comments with RSS.

  1. In our computational social science group at UMass, there is an additional contrast with conventional data, big data is often “relational” and so there is a role for relational theory and methods (network or field theory). While the internet is certainly a source of such “big data” other machine readable data can be big in these senses. For example, social scientists are now using administrative data on whole countries over time. These data allow us to embed people in workplaces or neighborhoods. Administrative data, while less cool than internet data, often has better sampling characteristics (i.e. none at all). The US has bad administrative data, but other places – the Nordic countries, Germany, France, Portugal, Slovenia, Czech Republic, New Zealand, Korea have some pretty good big, non-internet, data.


    Don Tomaskovic-Devey

    February 25, 2014 at 2:19 am

  2. Also, though less obviously useful to social scientists, a lot of genomic data would probably qualify as “big” data as well.



    February 25, 2014 at 2:39 am

  3. My frustration with “big data” is the flurry of people who talk about how we need radically different methods to tackle it. But there’s not really _that_ much difference between ‘big data’ and other data, except maybe computational time. It seems like it’s being made a big deal out of mostly by people who never interacted with data in the first place (and probably still won’t).



    February 25, 2014 at 4:14 am

  4. ^ what he said. the worrying thing is that projects based on “big data” often have the same (or magnified) problems with endogeneity and selection on the DV, so apart from “bigness” there isn’t as much empirical value in many such datasets



    February 25, 2014 at 7:16 am

  5. I am hesitant to get involved in this discussion, which has been had a million times, but here we go. One common way that big data has been defined is not only by its size, but by the “three Vs”: volume, variety, and velocity. I won’t take up the debate as to whether there is “empirical value” or if different methods are required.



    February 25, 2014 at 1:49 pm

  6. The size criterion does very little work in the definition, even if it is the foundation for the “big data” label. Twenty years ago, I wrote a paper using a data set with 250 million cases, using pooled censuses. That doesn’t make me a “big data” scientist, just a boring old demographer with a lot of CPU.

    The “big data” discussions always remind me of a bunch of schoolboys comparing the size of their wankers. “Oh yeah, but my (nonrandom and biased) data set has 18 TRILLION cases…” Not surprising, I guess, given “data science” seems to be the academic version of Silicon Valley.



    February 25, 2014 at 3:04 pm

  7. I don’t buy the argument that big data don’t require different methods/skills. They certainly require a different set of data management skills. This is one reason that social scientists working with big data often partner with computer scientists. Big data also often require a different kind of computational power, i.e., high performance computing, to which not every researcher has access. When dealing with terabytes of data, rather than good old megabytes, you simply can’t run data analysis off your desktop if you want to get it done in a reasonable amount of time.

    I also think there is a slightly different valuation of research among data scientists. Empirical description is just as valued as causal arguments, perhaps even more so. If you can capture a population of observations, then it’s interesting to note particular empirical patterns without having to make hypotheses or identify a causal model. It’s not to say that one replaces the other, but data scientists are more likely to see empirical patterns/descriptive data as interesting in their own right.


    brayden king

    February 25, 2014 at 5:23 pm

  8. I was into “big data” before it was cool – yea, I have it on vinyl – but I have to agree with “anon” that the evangelists often come too hard with the hand-waving and smoke and mirrors. I get it, and I suppose that, like other academic fashions, it will generate some jobs for technicians and some of the more clever evangelists before we all move on to the next world-shaking development, but, in the end, it is just data. The really hard part – as always – is doing the science and, unfortunately, writing python script != social science.



    February 25, 2014 at 5:31 pm

  9. I agree with Don Tomaskovic-Devey, why do you specify the source as the Internet? There are plenty of other machine readable data that should count as big data. For example New York City records every turn-style click in its subway system, surely that is big data? Traffic/commute patterns through tolls? Items scanned and bought at malls or grocery stores? Etc.



    February 25, 2014 at 6:56 pm

  10. I’ve found a lot of value in big data. The reason for this is that big does not JUST mean large n. It is often, but not always, accompanied with a lot of opportunities for identification — you just have to be able to understand the institutional details of the data. And this can take time and patience that some people are not willing to dedicate.

    In addition to the data management skills mentioned by Brayden, you need algorithmic skills. Lots of network algorithms, for instance, need to be modified or finessed to work with larger datasets. A lot of old tools just don’t work at scale!



    February 25, 2014 at 11:40 pm

  11. I’ve been thinking a lot about getting data from the Internet (realizing, of course, that data from the web is only one type of “big” data — e.g., genomic and astrophysical (dark matter) data, as well as administrative (e.g,. micro census, linked employer/household) data can also be BIG). That said, data gathered from the web differ from data gathered from traditional sources in five key ways (Einav and Levin 2013 — “The data revolution and economic analysis” –NBER Working Paper) that shape the direction future research can take. First, the web provides access to real-time data, in sharp contrast to traditional sources, which often report on conditions months or even years after they occurred. As a result, social scientists can develop, test, and update theories about ongoing events (such as political campaigns or behavioral responses to weather-related emergencies) immediately, which will dramatically accelerate the knowledge-production cycle. Second, the web provides access to many new kinds of data and to data on events that older technologies could not capture. For example, Yelp and Rotten Tomatoes give us more information about ordinary people’s views of restaurants and movies than we could possibly learn from survey or interview methods.

    The last three differences between web-derived databases and traditional databases derive from the sheer magnitude of former. Many websites provide access to very fine-grained data, which makes it possible to trace social structures and dynamics in greater detail than ever before. For instance, we can use data from social-networking sites like LinkedIn to trace how people’s social circles evolve on a daily basis, and data from Factiva to trace the rapid evolution of public opinion and news media coverage on “hot” topics. Having very large web-derived databases makes it possible to test the robustness of statistical analyses in many more ways than traditional, much smaller, datasets allow; it also makes it possible to investigate heterogeneous effects of causal processes, rather than estimating homogeneous “average” effects. Researchers could, for example, divide up Rotten Tomatoes’ movie reviewers by age, gender, ethnicity/race, and location, and assess the impact of sampling bias on their findings, and heterogeneity in the impact of a movie’s straddling multiple genres (rather than staying within a single genre).

    The sheer magnitude of some web-derived databases will also force social scientists to explore their “big” data, discover what patterns exist in them, and imaginatively induce explanations for those patterns out before they conduct multivariate analyses to test hypotheses derived from extant theories. This will push social scientists of their comfort zone, where they empirically test theory and refine it (the context of justification), and into a zone where they theorize from empirical data (the context of discovery) (Swedberg 2012 — “Theorizing in sociology and social science” — Theory & Society). Theorizing involves several related processes: labelling the observed phenomena (clarifying their nature), conceptualizing them (abstracting away from the specific cases of data under study), constructing typologies (assessing similarities and differences between cases and phenomena), and abduction (conjecturing plausible but unproven explanations for observed patterns) (Peirce 1957 — Essays in the Philosophy of Science). All of these processes will require social scientists to harness their “disciplined imagination” (Weick 1989 — Theory construction as disciplined imagination — Academy of Management Review; see also Mills 1959 — The Sociological Imagination), to employ analogies, metaphors, and inspired (if sometimes unlikely) comparisons in order to associate the phenomena under study with something else that is better known. Theorizing can involve counterfactual thinking, conjuring up alternative patterns to the ones discover in data; it can also involve contrasting patterns found in one part of the data with those found in another, a process that is considerably easier with the kinds of very large databases that can be gathered from the web.

    Liked by 1 person

    Heather A. Haveman

    February 26, 2014 at 2:34 am

  12. One more thing about “big data” & I’ll give it a rest. It’s not really a comment — just a notice to sociologists that there are FREE resources for managing very large datasets. The existence of these resources helps solve one of the problems Braydon noted in his original post.

    The NSF has funded the development of supercomputer clusters at several locations (e.g., U Pittsburgh, UCSD, UIUC), and the administrators of those supercomputers have joined under the umbrella XSEDE — Extreme Science and Engineering Discovery Environment ( This virtual infrastructure, which is free to academic researchers, allows university scientists to interactively share computing resources, data, and expertise.

    The XSEDE team is very interested in reaching out to areas, such as sociology and political science, that do not generally use large computing resources. For example, the recent XSEDE conference best paper award went to a project that is digitizing the 1940 census micro records (

    You can easily sign up for a startup account & learn about the system from online tutorials/lectures as well as from “campus champions,” computational scientists who serve as liaisons & guides for novice users of XSEDE’s resources.


    Heather A. Haveman

    February 26, 2014 at 5:42 pm

  13. “When dealing with terabytes of data, rather than good old megabytes, you simply can’t run data analysis off your desktop if you want to get it done in a reasonable amount of time.”

    Funny, that’s exactly what I said 20 years ago when writing the paper with 250 million census cases. Models that took 8 days to converge on a mainframe unix machine back then would now converge in about 30 seconds on my laptop. My point? “Big data” aren’t particularly special in this regard, either.

    The “real time”-ness of “big data,” are, I think, special, and as Heather says open up opportunities to generate new theories and to test old theories in different ways. But I also question whether we really need a interdisciplinary team of sociologists and computer scientists to work on a data base with 500 million tweets in order to learn, for example, that people are happier on weekends. (Correction: people *who tweet,* a group that is likely self-selected for sociability, among other characteristics.)



    February 26, 2014 at 7:43 pm

  14. @anon, I’m not sure I follow your point. Are you suggesting that we shouldn’t learn the data management skills necessary to do this work today with today’s technology, because in 20 years it will probably be easy to do on our laptops (or whatever the equivalent of a laptop is then)?



    February 26, 2014 at 7:52 pm

  15. Reblogged this on The Cyber Tribe.


    Richard Hudak

    February 28, 2014 at 12:14 pm

Comments are closed.

%d bloggers like this: