orgtheory.net

one limitation of big data: sampling people

with 2 comments

I am a huge proponent of “big data” – the data that millions, no – billions, of people generate as they use the Internet. For the first time in human history, we have a portal into the collective chatter of humanity. And if you can’t see the promise in that, you sorely lack in imagination.

But, as with anything, big data has problems and limitations. Perhaps the most fundamental limitation of big data is that it is a corpus of text, not a sample of people. In other words, the typical big data project uses information gleaned from social media, search engines, email, crowd shared editing (e.g., github or wiki), and the World Wide Web. It simply is not a sample of people.

Skeptics will just wave their hands and say “I told you so! It’s garbage!” My response is different. First, there are lots of data that are immensely useful even if they aren’t perfect random samples – historical archives, dinosaur bones, field notes, etc. What is important is that the data source tell you some important about a case that you care about, or it addresses the possibility that something might be true.

Second, even though big data is not a sample of people, it is still a sample of a very important type of communication. And surely, this would be of enormous value to social science.

Third, when I hear that method X has a limitation, I usually see it as an opportunity. For example, survey data on income are often truncated on the left or the right (i.e., the poor or wealthy are often lumped together). Instead of saying, “Garbage! No survey data for the study of income!” – you should say, “Can we model the bias? How do we qualify the findings?” This is exactly what was done in the Heckman model, and other models of survey bias.

So yes, big data has problems. But so do all data and if you take a moment to think about it, a lot of problems are simply opportunities for developing new insights into research methods.

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/Theory for the Working Sociologist/From Black Power/Party in the Street 

Written by fabiorojas

January 9, 2017 at 3:08 am

Posted in fabio, uncategorized

2 Responses

Subscribe to comments with RSS.

  1. There’s been interesting work on correcting for demographic bias for political forecasting, e.g.

    http://www.stat.columbia.edu/~gelman/research/published/forecasting-with-nonrepresentative-polls.pdf

    Liked by 2 people

    Alex Hanna (@alexhanna)

    January 10, 2017 at 2:41 pm

  2. Thank you. This is important.

    Liked by 1 person

    fabiorojas

    January 10, 2017 at 4:05 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: