one limitation of big data: sampling people
I am a huge proponent of “big data” – the data that millions, no – billions, of people generate as they use the Internet. For the first time in human history, we have a portal into the collective chatter of humanity. And if you can’t see the promise in that, you sorely lack in imagination.
But, as with anything, big data has problems and limitations. Perhaps the most fundamental limitation of big data is that it is a corpus of text, not a sample of people. In other words, the typical big data project uses information gleaned from social media, search engines, email, crowd shared editing (e.g., github or wiki), and the World Wide Web. It simply is not a sample of people.
Skeptics will just wave their hands and say “I told you so! It’s garbage!” My response is different. First, there are lots of data that are immensely useful even if they aren’t perfect random samples – historical archives, dinosaur bones, field notes, etc. What is important is that the data source tell you some important about a case that you care about, or it addresses the possibility that something might be true.
Second, even though big data is not a sample of people, it is still a sample of a very important type of communication. And surely, this would be of enormous value to social science.
Third, when I hear that method X has a limitation, I usually see it as an opportunity. For example, survey data on income are often truncated on the left or the right (i.e., the poor or wealthy are often lumped together). Instead of saying, “Garbage! No survey data for the study of income!” – you should say, “Can we model the bias? How do we qualify the findings?” This is exactly what was done in the Heckman model, and other models of survey bias.
So yes, big data has problems. But so do all data and if you take a moment to think about it, a lot of problems are simply opportunities for developing new insights into research methods.