the end of polling is near

In my Washington Post column, I discussed the possibility that social media data might displace the traditional political poll. After writing the column, I thought that I might have gone overboard. But after reading some recent research, I realized that I am really onto something. Recent research shows that social media data, when modeled correctly, does provide very good measurements of public opinion trends.

Nick Beauchamp is political scientist at Northeastern University. He has a new working paper called “Predicting and Interpolating State-level Polling using Twitter Textual Data.” This paper is the vital intermediate step between noticing that tweets correlate with votes  and using social media data by itself to forecast elections. The abstract:

Presidential, gubernatorial, and senatorial elections all require state-level polling, but continuous real-time polling of every state during a campaign remains prohibitively expensive, and quite neglected for less competitive states. This paper employs a new dataset of over 500GB of politics-related Tweets from the nal months of the 2012 presidential campaign to interpolate and predict state-level polling at the daily level. By modeling the correlations between existing state-level polls and the textual content of state-located Twitter data using a new combination of time-series cross-sectional methods plus bayesian shrinkage and model averaging, it is shown through forward-in-time out-of-sample testing that the textual content of Twitter data can predict changes in fully representative opinion polls with a precision currently unfeasible with existing polling data. This could potentially allow us to estimate polling not just in less-polled states, but in unpolled states, in sub-state regions, and even on time-scaled shorter than a day, given the immense density of Twitter usage. Substantively, we can also examine the words most associated with changes in vote intention to discern the rich psychology and speech associated with a rapidly shifting national campaign.

In other words, if you do some sensible model fits and combine with content analysis, social media time series mimic the trends produced by polls. The next step is obvious: combine election results and social media data, model the error, and if the results are reasonable, you will no longer need big polls.

Adverts: From Black Power/Grad Skool Rulz

Written by fabiorojas

October 23, 2013 at 12:01 am

9 Responses

Subscribe to comments with RSS.

  1. Do you really think the big pollsters will roll over and die though?



    October 23, 2013 at 1:49 am

  2. They won’t go. That’s why we have to pull the plug.

    On a more serious note, polls will continue because they’re fun.The NY times can instantly generate news with a poll. But for a lot of candidates, it’s not fun. It’s $$$. If you can get more or less the same answer at a fraction of the price, why not cut back on pollsters and go to social media?

    My prediction is that the big news organizations and campaigns will always have polls, but a lot of smaller outfits, who have small budgets, will cut back once it is common knowledge that there’s a real alternative.



    October 23, 2013 at 1:52 am

  3. FYI: Something seems to have broken your RSS feed as of this post.



    October 23, 2013 at 4:01 am

  4. if any large segment of the pundit class started relying on tweet-analysis instead of polls, wouldn’t tweet-analysis immediately lose its validity as campaigns tried to manipulate the tweet stream to give the media the illusion that their candidate has momentum? Campaigns aside, even supporters will change their tweeting behavior based on the amount of coverage that the media gives to tweets as an indicator of candidate strength. Sure, the tweet-analyzers can modify algorithms and screen this stuff out, but the campaigns can be just as reactive. And at any given point in the arms race, the only way to know whether the tweet-analysts or the campaign-manipulators currently have the advantage is . . . to see how well the tweet-counts correlate with polling. And the sensitivity of different subpopulation’s tweeting behavior to changes in the tweeting-incentive structure is surely going to vary across time and campaigns, so again tweet-analysis will need polling for real-time calibration.

    Polls have all the external validity, so I can only ever imagine tweets playing a subordinate, niche role. It’s great that (as the abstract notes) tweets can help us analyze less-polled states. But states are generally less-polled because they aren’t competitive. Do I need to know whether Idaho is going to give the Republican nominee 65 vs 70% of the vote in 2016? And the utility of finer time resolution in polling is, in many ways, the opposite of what the political world needs. We already have a fairly fine time resolution, and poll aggregators try to factor bounces out of the data in anticipation of regression to the mean. So saying that tweet-analysis can tell us about shifts in campaign momentum in periods shorter than a day is like saying that tweet-analysis has a greater noise to signal ratio than polling does, and if anything, tweet-analysis would exacerbate the unhelpful horse-race approach that political journalism already brings to campaign coverage.



    October 23, 2013 at 5:25 am

  5. Two thoughts @commenter:

    1. We have some empirical evidence that “gaming” isn’t changing the basic dynamic. The bivariate correlation btw votes and tweets remains about the same in the 2010 and 2012 election – the latter election is one where people were actively working through social media.

    2. Theoretically, social media spam should correlate with finances (to hire programmers, media people, etc), which in turn correlates with political strength. So the spam levels should reflect relative popularity.

    I could be wrong – maybe 2014 will be way different and my argument is incorrect. But so far, things are looking good for more tweets/more votes hypothesis.



    October 23, 2013 at 4:54 pm

  6. It’s basically a self-refuting mechanism. If the polling organisations go out of business and media relies on tweets, then tweets will surely be skewed, True, maybe the skewedness reflects financial resources, and one can tell spam from proper tweets, but how to distinguish between tweets that you have paid pennies for, and those that are not “purchased”. Basically impossible – without polling evidence. Of course stats based on tweets will be used, as many other data sources will as well. But “commenters” very clearly illustrated that the headline is necessarily wrong.



    October 23, 2013 at 5:50 pm

  7. Wonks: We have the final election tallies, which is all you need to build forecasting models.



    October 23, 2013 at 6:24 pm

  8. I have read the more tweets, more votes paper. Is there any way, you could devote a little more time to elcudiating the theory as to why you think this works? Not necessarily questioning findings, but would like to hear more about the logic that helps to explain the relationship. Or even if you suggest some of the literature worth reading?


    Scott Dolan

    October 23, 2013 at 7:12 pm

  9. We have a working paper that suggests the beginning of an answer. When you compare different types of tweets, those that use simple syntax (e.g., “Boehner is a winner!” v. “@Boehner #sucks) correlate with final vote shores while complex ones do not. This suggests that as average people get excited about the race they start talking about the candidate. So: generic excitement –> social media use [correlates with] voting.

    I should note that I am NOT making a causal argument. In fact, my suspicion is that candidate strength precedes social media use. But I have no way of directly testing that hypothesis.



    October 23, 2013 at 7:18 pm

Comments are closed.

%d bloggers like this: