big data: a definition
People often complain, justifiably, that “big data” is a catchy phrase, not a real concept. And yes, it certainly is hot, but that doesn’t mean that you can’t come up with a useful definition that can guide research. Here is my definition – big data is data that has the following properties:
- Size: The data is “large” when compared to the data normally used in social science. Normally, surveys only have data from a few thousand people. The World Values Survey, probably the largest conventional data set used by social scientists, has about two hundred thousand people in it. “Big data” starts in the millions of observations.
- Source: The data is generated through the use of the Internet – email, social media, web sites, etc.
- Natural: It generated through routine daily activity (e.g., email or Facebook likes) . It is not, primarily, created in the artificial environment of a survey or an experiment.
In other words, the data is bigger than normal social science data; it is “native” to the Internet; and it is not mainly concocted by the researcher. This is a definition meant for social scientists- it is useful because it marks a fairly intuitive boundary between big data and older data types like surveys. It also identifies the need for a skill set that combines social science research tools and computer science techniques.