orgtheory.net

data science is engineering – a guest post by karissa mckelvey

This is a guest post by Karissa McKelvey. She is affiliated with the Complex Systems PhD program at Indiana University’s School of Informatics. She works on the intersection of social media and political mobilization and has co-authored papers on Occupy Wall Street and the More Tweets/More Votes phenomenon.

Why Data Science is just a fad, and the future of the academy

We expect students to write research papers as well as do statistics in R or STATA or Matlab on small datasets. Why don’t we expect them to deal with very very large datasets? We are told that “Data Science” is the answer to this “Big Data” problem.

I’d like to redefine Data Science: it is the act of gluing toolkits together to create a pipeline from raw data to information to knowledge.There are no innovations to be made in Data Science. The innovations to be made here are in Computer Science, Informatics, Statistics, Sociology, Visualization, Math, etc. — and they always will be.

Data Science is just engineering.

A Bit of Background

Recently named the Sexiest Job in the 21st Century by Harvard Business Review, Data Science has emerged as a new discipline, with skillsets applicable to handling large datasets from social media, mobile phones, online purchases, genomes, and other datasets.

From Wikipedia:

Data science incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.”

The description goes on to reinforce the perception that data science is very difficult, by stating that “there is probably no living person who is an expert in all of these disciplines.” Of course there isn’t. Data science is usually handled by teams of competent individuals with varied backgrounds that can cover most of these areas.

The truth: only 1 of the 9 above disciplines are taught in primary and secondary schools in the United States. Students are lucky to have a Computer Science AP course in high school, and they are sub-par to prepare you for the real world of computing.

Thus, we are hitting an impasse, wherein the computational abilities required to handle our data output are far outpacing our abilities or workforce size. I argue that Data Science has simply emerged from the inability to teach a generation — my generation — about software engineering.

Why do we trust you, Karissa?

I have been a research assistant at Indiana University for the past year. By using my computer science skills in conjunction with some social science collaborators with good questions, I’ve been able to put myself in a position that most people don’t find themselves so early in their academic career. The social sciences are loving studies that use these “Big Data”-sets, as we are able to compute things about humans on a larger scale than ever possible before in human history. The field is wide open. You can easily become a wizard … or at least make them think you are.

Run through 400 million social media posts? Get access to a big enough computer and know the library to use, and everything is practically done for you. Nowadays, something that used to take a couple hundred lines of code just a few years ago can now be done like this:

res = db.execute(map_fun, reduce_fun)

The truth is, someone equipped with a couple years of computer science training can complete these tasks (I did!). They are not so difficult. The incentives to pass these skills on, however, are not institutionalized. As demand increases, my salary only goes up… especially if the “data scientist” supply is stagnating.

The Academy

So, what does this mean for the academy? The picture at the top of this blog post shows how many schools have pounced on the idea of a Data Science program, fellowship, masters, or PhD. But what is actually happening, here? We are funneling people into “Data Science” who would otherwise be studying Biology, Physics, Sociology, Political Science, or other fields.

Here we see the transformation of academia. Students are funneled from various disciplines to this epheremal “Data Science,” where they learn parallel computing, machine learning, and other applicable skills from Computer Science and Informatics. As programming is adopted as the “fourth R” alongside reading, writing, and arithmetic, we see the disciplines separate and data science disappear.

What if sociologists didn’t have to hire a data scientist, or become data scientists; what if political scientists didn’t have to get a MS in computer science; what if… what if… we saw programming as a fundamental life skill? What if instead of focusing on learning computational infrastructure in a program like Data Science, students could just dive into their field of interest, using computational tools?

What if along with Reading, wRiting, and aRithmetic, there was pRogramming? The “fourth R” (or ‘rithms for “algorithms” it has been called in the past) has been touted as the 21st-century literacy. Once the academy builds programming and algorithms into the basic curriculum, I think we will see Data Science stop being this catch-all for every student who wants to learn how to harness “Big Data.”

Don’t get me wrong — I’m sure Data Science will continue to exist for a long time, and the current set up is great for people like me. Not to mention, having pods of people who do similar things allows resources to be shared easily. But I predict this field will flow back into it’s respective subfields as computer science literacy increases. The programming parts of “Data Science” are actually pretty easy.

I imagine an academy where computer science becomes a fundamental skill in Sociology, Biology, Journalism, Economics, Physics, Political Science; where the idea of “Data Science” as a separate entity seems absurd, because that’s just engineering. That’s just another way we can produce and transmit results, a standard practice taught alongside reading and writing.

Adverts: From Black Power/Grad Skool Rulz

Written by fabiorojas

July 10, 2013 at 12:01 am

17 Responses

Subscribe to comments with RSS.

  1. Please make a clarification about how the study of data or Big Data is relevant. Did I miss something or is there as topic here. How does someone study data analysis or programming unless it is about something?
    Let me guess: this is value-free data analysis?

    Like

    Fred Welfare

    July 10, 2013 at 3:20 am

  2. Fred: The issue is that some people are arguing that Big Data is so specialized in its skills that it deserves its own discipline, or academic space. And yes, it’s independent of content.

    Like

    fabiorojas

    July 10, 2013 at 3:39 am

  3. @Fred: “How does someone study data analysis or programming unless it is about something?” – The entire point of the post is to criticize the notion that “data science” should be a stand alone discipline divorced of subject areas. That being said, it’s definitely possible to study data analysis and programming without it being “about something.”

    Like

    JD

    July 10, 2013 at 3:40 am

  4. Then you agree that Big Data is value-free?

    Like

    Fred Welfare

    July 10, 2013 at 3:51 am

  5. Some aspects are clearly technical and value free. For example, how you store these enormous data sets is by itself a question. As K notes above, the people calling for Big Data science want a special role for people who design the tools for accessing these data sets. But of course, how you analyze them and for what purpose is obviously value laden.

    Like

    fabiorojas

    July 10, 2013 at 3:57 am

  6. Sorry — without context this post can be a little dense. I’m glad you asked for clarification, Fred.

    First, I’d like to distinguish between Big Data and Data Science. The technical side of “Big Data” is computer science — usually landing in subfields of Distributed Systems or Databases. However, “Data Science” as a discipline is starting to rise (type in Data Science degree into Google). It is applying statistics and programming to another discipline — sociology, biology, and political science being prime targets — but not necessarily learning anything about those target disciplines.

    They publish in Computer Science journals and conferences (and Science) using simulation and large datasets (there is even a Data Science journal http://www.epjdatascience.com/). Often, they are publishing findings that are devoid of theory, or claim to be the “first” but really have duplicated some sociology work from 30 years ago but never cite them. These are just a few of the crucial issues when you have a bunch of Data Scientists running around playing Sociologist.

    We might want to be careful teaching people that Big Data is value-free, because a) data does have a context, and to do good science we must be mindful of that and b) what is good applied science in one discipline could be poorly applied in another. (ex: r2 of .15 is really bad in some disciplines, but publishable in sociology). What I’m pointing out is that Data Science is not going to last once we incorporate computational tools into the curriculum.

    Like

    Karissa McKelvey

    July 10, 2013 at 5:23 am

  7. At least hopefully it won’t last…. and then people can use computational tools while studying the discipline, and actually have value laden analysis without the trouble of the data science bubble.

    Like

    Karissa McKelvey

    July 10, 2013 at 5:25 am

  8. Drawing the analogy with reading, writing, arithmetic and also mathematics, statistics, narrative analysis, content analysis and text processing, interviewing, archival methods etc., there is obviously a distinction between and need for both scholars who study and advance the methods themselves and those who use the methods as tools for other substantive agendas. We still have statistics departments and mathematics departments even as the applications of statistics and mathematics are housed in substantive departments. We still have departments of English and Communication even though everybody writes. We have more turf disputes over whether everyone who is working in a historical archive belongs in a history department.

    So it seems like the meta-issue of whether people should be getting degrees in data science depends on the analogy. In the short run, I could imagine a master’s program in data science being a useful add-on to a substantive PhD, to provide skills.

    The other side of the discussion is the critique of mindless analysis of social data by people who have computational skills but no social science theory or knowledge to tell them what questions to ask or how to make sense of all the data. Perhaps the answer to that is to require people earning degrees in “data science” to take a master’s degree in a substantive field relevant to the data they are analyzing.

    On another tangent, when I was in college in the late 1960s, I took a programming class even though I was not a computer science student, and the idea that basic programming was a skill that everyone should have was pretty widespread. In the 1970s, many people who were not programmers rolled their own computer programs to get little jobs done.Programming was not taught in high school but it was taught in college. These days, the computer science courses on my campus are over-enrolled and it is difficult to get into them. And these days you can use a computer with no programming skills at all, or any idea about what it is doing in the background. I’m wondering whether the percentage of college students who take a programming course has gone up or down in the past few decades.

    Like

    olderwoman

    July 10, 2013 at 1:27 pm

  9. I can understand the seeking of a credential in Computer Programming or Data Analysis as a generic skill. Given a data set or archive, a programmer could fashion a data mining system. What is at issue is the purpose, origin, or effect of particular hypotheses upon the outcome of a particular program. Context is everything. I wonder now how the notion of statistical validity is understood by ‘virtual’ programs.

    Like

    Fred Welfare

    July 11, 2013 at 5:19 am

  10. Nice post. You are right, large amounts of data will not likely go away anytime soon. Yet I challenge you: “The programming parts of ‘Data Science’ are actually pretty easy.” If this is true for you, learn java, or erlang, or anything you don’t know. Challenge yourself to learn more, which, as a data scientist, is more important than ‘knowing’ anything. The ability to figure out what you dont know, is the asset.

    Like

    jacob

    July 11, 2013 at 8:57 am

  11. @jacob the programming parts of Data Science are trivial — let me explain. Part of the goal of a true computer scientist education, and what makes the difference between an engineer and a programmer is the fundamental understanding of programming languages. Programming languages aren’t that different, actually, once you understand how they work from the inside. If we incorporate computer science into the educational framework, I’d expect this understanding to also be passed down.

    Understanding the fundamentals of programming languages is akin to understanding grammar or verb conjugations in latin-based languages. Once you understand how it works, you just need to memorize the spelling. Learning just R or python for data analysis is akin to learning spelling.

    Like

    Karissa McKelvey

    July 11, 2013 at 4:21 pm

  12. Preach it.

    Like

    Ariel

    July 13, 2013 at 2:32 am

  13. I’ve been following this thread and remain confused about the argument. Is the argument that all students should have two years of computer programming courses as part of a basic education? Does this proposal include adding another year in all students’ time to degree, or does it proposal eliminating some other courses from the standard curriculum? If so, what courses do you think are now expendable in the age of computer programming?

    I’m also confused about some of your (Karissa’s) comments about programming. I learned basic FORTRAN syntax as a child in the 1960s, from my father, and I took one computer science class in college and a number of short courses in various computer languages over the years. In the 1980s, I wrote simulation programs in SIMSCRIPT, a special language that was built on top of FORTRAN. I have gotten good at writing Stata macros, although I have still not figured out how to manipulate MATA or write a Stata program.

    So I know what programming languages are, I have written programs, and I agree that the syntax of one language versus another is a trivial problem. (Almost. Some languages are so different in their underlying logic that this does not apply. Have you ever tried to program in APL? That was kind of like trying to program in Martian. I also never could get my head around how Mathematica was thinking.)

    BUT, and here is the thing, my spouse is a systems programmer, and I know that what I know about programming is not the same thing as being a computer programmer. REAL programming involves knowing how to solve problems using a computer, and that is a much higher-level skill that is very different from just learning language syntax and simple routines. I had an RA once–a smart guy–who wrote a program for us that tied up the whole computer system all night. It took forever to run. I asked my spouse to look at the code. It turned out that the problem was that the program had been structured naively in a way that used up all the computer resources unnecessarily. I’m not technically trained enough to remember the language, but it was something like an infinite stack. A computer science issue, not a syntax issue is my point. One that involved understanding what the computer was doing, what the machine-meaning of the code was. And the problem was so deeply embedded in the code that there was no way to fix it. The only solution would have been to throw all the code away and start over.

    So back to my original question. What exactly do you think the curriculum should include? We all drive cars and only some of us know how to repair them. We all use computers without knowing really how they work. What kind of skills are you envisioning as the “normal” curriculum everyone should have?

    Like

    olderwoman

    July 13, 2013 at 2:15 pm

  14. @olderwoman, thanks a lot for your response. It’s great to hear your experience with FORTRAN and your spouse’s work. I’ve tried to program in APL, and yes it has a particular paradigm, but it is actually not that different than basic linear algebra and functional programming.

    Your anecdote about your spouse fixing the RA’s code sounds like a classic Data Structures 101 problem —
    the way we structure the way we store and search for data matters a lot for very large datasets. However, this sort of stuff is arguably the most important part of a computer science education.

    Like you said, “We all use computers without knowing really how they work.” I think that has to change. However, I think your analogy to cars is not a useful one, though, unfortunately. Not only do cars have little to do with research (:p), we can still drive the car even when we can’t repair it. I’d rather liken it to writing, wherein you may be able to write complete sentences, but it doesn’t mean you can write a 5 paragraph essay.

    The curriculum? I’d like to start with the 5-paragraph essay of programming: algorithms, data structures, and good style.

    Like

    Karissa McKelvey

    July 14, 2013 at 5:50 am

  15. Apropos of this and other recent discussions about big data and data science: Crooked Timber

    Like

    krippendorf

    July 18, 2013 at 10:01 pm

  16. […] of gene-behavior links. The appearance of “Big Data,” which we’ve argued about on this blog. The demand for experiments from the policy […]

    Like

  17. While it’s not yet official at educational institutions everyone I know used or using raw data to information to knowledge, really we need to emphasize it.

    Like

    Steve Lewis

    July 23, 2013 at 4:52 pm


Comments are closed.