the internet is an unstable research environment
I am a huge advocate for big data research and it pains me to see otherwise rational people get in a tizzy over what is simply very interesting and rich observational data. What I find puzzling is that there actually is a big drawback to “big data” that few people mention – the lack of reproducibility that results from the inherent instability of the Internet. Let me give you a few examples:
- You are harvesting tweets and then the API changes and you no longer get the same data.
- You are working with a data provider and they loose their license.
- The start up you were working with gets new management and you loose access to the data.
- A start up firm goes bankrupt, or nearly bankrupt, and all data is removed or is simply erased. For example, you can no longer get MySpace data.
- Google changes what you can get from their Google search data.
These are all serious problems because it means you can never go back and reproduce results. From the perspective of the firms, it’s ok. You probably changed for a good business reason. For a researcher, it is a disaster. You can never fix problems, explore new models, or reproduce the research for later generations. Since most firms don’t archive data – poof, it’s gone!
I hope that one day, we’ll get to a “big data Github” where firms will archive data so that people generations from now can use it. Sadly, we are in the opposite situation now. We are lucky if I can access data that was used last week.