data science is engineering – a guest post by karissa mckelvey
This is a guest post by Karissa McKelvey. She is affiliated with the Complex Systems PhD program at Indiana University’s School of Informatics. She works on the intersection of social media and political mobilization and has co-authored papers on Occupy Wall Street and the More Tweets/More Votes phenomenon.
Why Data Science is just a fad, and the future of the academy
We expect students to write research papers as well as do statistics in R or STATA or Matlab on small datasets. Why don’t we expect them to deal with very very large datasets? We are told that “Data Science” is the answer to this “Big Data” problem.
I’d like to redefine Data Science: it is the act of gluing toolkits together to create a pipeline from raw data to information to knowledge.There are no innovations to be made in Data Science. The innovations to be made here are in Computer Science, Informatics, Statistics, Sociology, Visualization, Math, etc. — and they always will be.
Data Science is just engineering.
A Bit of Background
Recently named the Sexiest Job in the 21st Century by Harvard Business Review, Data Science has emerged as a new discipline, with skillsets applicable to handling large datasets from social media, mobile phones, online purchases, genomes, and other datasets.
“Data science incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.”
The description goes on to reinforce the perception that data science is very difficult, by stating that “there is probably no living person who is an expert in all of these disciplines.” Of course there isn’t. Data science is usually handled by teams of competent individuals with varied backgrounds that can cover most of these areas.
The truth: only 1 of the 9 above disciplines are taught in primary and secondary schools in the United States. Students are lucky to have a Computer Science AP course in high school, and they are sub-par to prepare you for the real world of computing.
Thus, we are hitting an impasse, wherein the computational abilities required to handle our data output are far outpacing our abilities or workforce size. I argue that Data Science has simply emerged from the inability to teach a generation — my generation — about software engineering.
Why do we trust you, Karissa?
I have been a research assistant at Indiana University for the past year. By using my computer science skills in conjunction with some social science collaborators with good questions, I’ve been able to put myself in a position that most people don’t find themselves so early in their academic career. The social sciences are loving studies that use these “Big Data”-sets, as we are able to compute things about humans on a larger scale than ever possible before in human history. The field is wide open. You can easily become a wizard … or at least make them think you are.
Run through 400 million social media posts? Get access to a big enough computer and know the library to use, and everything is practically done for you. Nowadays, something that used to take a couple hundred lines of code just a few years ago can now be done like this:
res = db.execute(map_fun, reduce_fun)
The truth is, someone equipped with a couple years of computer science training can complete these tasks (I did!). They are not so difficult. The incentives to pass these skills on, however, are not institutionalized. As demand increases, my salary only goes up… especially if the “data scientist” supply is stagnating.
So, what does this mean for the academy? The picture at the top of this blog post shows how many schools have pounced on the idea of a Data Science program, fellowship, masters, or PhD. But what is actually happening, here? We are funneling people into “Data Science” who would otherwise be studying Biology, Physics, Sociology, Political Science, or other fields.
What if sociologists didn’t have to hire a data scientist, or become data scientists; what if political scientists didn’t have to get a MS in computer science; what if… what if… we saw programming as a fundamental life skill? What if instead of focusing on learning computational infrastructure in a program like Data Science, students could just dive into their field of interest, using computational tools?
What if along with Reading, wRiting, and aRithmetic, there was pRogramming? The “fourth R” (or ‘rithms for “algorithms” it has been called in the past) has been touted as the 21st-century literacy. Once the academy builds programming and algorithms into the basic curriculum, I think we will see Data Science stop being this catch-all for every student who wants to learn how to harness “Big Data.”
Don’t get me wrong — I’m sure Data Science will continue to exist for a long time, and the current set up is great for people like me. Not to mention, having pods of people who do similar things allows resources to be shared easily. But I predict this field will flow back into it’s respective subfields as computer science literacy increases. The programming parts of “Data Science” are actually pretty easy.
I imagine an academy where computer science becomes a fundamental skill in Sociology, Biology, Journalism, Economics, Physics, Political Science; where the idea of “Data Science” as a separate entity seems absurd, because that’s just engineering. That’s just another way we can produce and transmit results, a standard practice taught alongside reading and writing.
Subscribe to comments with RSS.
Comments are closed.