more tweets, more vote – q&a and erratum

This week, there has been substantial media coverage of the More Tweets, More Votes paper, which was presented on Monday at the ASA meeting in New York. Scholars and campaign professionals have been asking questions about the draft of the paper, which can be found here. Since we have received many requests and clarifications, I will address comments through this blog post.

1. Your tweets/votes R-squared is small. The correlation between tweets and votes is actually really small when compared with other factors (such as incumbency).

Commenters have asked about the size of the twitter correlation in comparison with other models. First, no claim was made about this issue and it not relevant to the major point of the paper. The point of the paper is that social media has important information. This information may be correlated with other data. However, we can compare the twitter bivariate correlation with other correlations. The twitter correlation with Republican vote margin, for example, is .53. Incumbency has a correlation of .73 with vote margin. The proportion of people with a college education has a correlation of .15. Thus, the twitter measure is in the middle of the range of the variables we look at.

2. 404 out of 406?: In your SSRN draft, the analysis does not predict the winner in 404 out of 406 competitive races, which is what Fabio Rojas said in the WaPo op-ed. (http://www.washingtonpost.com/opinions/how-twitter-can-predict-an-election/2013/08/11/35ef885a-0108-11e3-96a8-d3b921c0924a_story.html?wpisrc=emailtoafriend)

A number of commenters have asked about the number of correctly predicted races. In the original paper, we do not perform this analysis. For the purposes of presenting the research to the public, we computed the rate of correct predictions (within the data), which was about 92.5%. I then multiplied this by all races (435). Therefore, the extrapolated number of correctly predicted races is 404 out of 435. If we use only the contested race subsample, we get 375 races out of 406 contested races. This is a correction of what I wrote in the op-ed, which accidentally combined these two estimates. The op-ed now contains the correction.

3. You don’t predict an election. “[…] just in case someone is paying attention: You, Have, To, Predict, In, Advance. If you don’t want to follow my advice follow that of Lewis-Beck (2005):”the forecast must be made before the event. The farther in advance […] the better”. Gayo-Avello (http://di002.edv.uniovi.es/~dani/PFCblog/)

Professor Gayo-Avello and other commenters have raised the issue of prediction. He is correct in that we didn’t use contemporary data to predict elections in the future. Rather, we use “predict” in the statistical sense. We use social media data to estimate a dependent variable within the sample.

4. The Pollyanna effect is unsubstantiated. There is no support to say negative tweets are a good thing for a candidate.

The Pollyana effect is merely a hypothesized explanation for what we find. It requires further research and study. We make no claim that it has been established.

5. Twitter user base is not representative of the population, self-selection bias, spam, propaganda, lack of geolocation of tweets.

A number of commenters have focused on the fact that we know little about the people who write tweets, nor do we estimate whether tweets are positive or negative. This is true, but the point of the paper is not to make an estimate of who people are, or to interpret what they say. Rather, it is simply to show that that social media contains informative signals of what people might do. Remarkably, the data shows a correlation even though Twitter users are not a random sample of the population. We are simply measuring the relative attention given to a political candidate.

6. Vote share is a more natural way than vote margin to analyze and present the results, as well as consistent with prior Political Science research. (http://themonkeycage.org/2013/04/24/the-tweets-votes-curve/)

Some readers noted that traditional political science uses vote share rather than vote margin. Our updated paper corrects that. The original paper is a non-peer reviewed draft. It is in the process of being corrected, updated, and revised for publication. Many of these criticisms have already been incorporated into the current draft of the paper, which will be published within the next few months.

Adverts: From Black Power/Grad Skool Rulz

Written by fabiorojas

August 16, 2013 at 8:27 pm

Posted in fabio, mere empirics, political science, poll

8 Responses

Subscribe to comments with RSS.

amateurs?? that’s some thoughtless rhetoric. I am all about raising the level of dignity in academic discourse. let’s try harder on that front, people!

LikeLike

doug

August 17, 2013 at 6:52 pm
Concerning “predictions” and the intention of the paper. An author of the paper wrote this in April: “Nor, as Andrew suggested in his post, do we attempt to predict election outcomes in the paper. We are simply trying to show that it is possible to construct a metric from social media data that is reliably correlated with real-worl, offline behavior.” (the discussion on monkeycage)

LikeLike

Anonymous

August 17, 2013 at 9:37 pm
I am the person who wrote that, Anonymous. Strangely, I will now respond to you as Andrew responded to me when I said that on the Monkey Cage: we use the word predict in a statistical sense. We make no claim that we predicted any election results in advance. We do, however, use a regression model that produces predicted vote share measures for each candidate. In the actual paper we spend no time evaluating the accuracy of predicted values for individual races. The paper is about establishing linear trends and explaining variance. That being said, media outlets have asked about the accuracy of individual predicted values, so we provided answers to those questions.

LikeLike

Joe DiGrazia

August 17, 2013 at 10:14 pm
I would like to offer a few suggestions.

Re: 1, add a heatmap of Pearson coefficients to the paper? It does not hurt to show the spectrum of associations if you cite one of them. (Releasing the data would solve the problem entirely.)

Re: 3, publish the RMSE instead of R^2? It would clarify the prediction/retrospective vs. forecast/anticipatory modeling approach.

Also, there’s no descriptives table in the paper so I don’t get the data very well, but isn’t the sample a good candidate for bootstrapped estimates?

LikeLike

Fr.

August 19, 2013 at 8:04 am
[…] this analysis is nowhere in their paper. Fabio Rojas has now posted errata/rebuttals about the op-ed and described this analysis they did here. There are several major issues off […]

LikeLike

Some analysis of tweet shares and “predicting” election outcomes | AI and Social Science – Brendan O'Connor

August 20, 2013 at 3:19 am
Yes, releasing the data will be done in time. We are waiting for our publication to be finally peer reviewed and published, so the data can be consistent with those results after these months of editing and recalculating. Please stay tuned!

LikeLike

Karissa McKelvey

August 20, 2013 at 11:47 pm
Hi, I wrote this in my post already (linked above), but you can’t have it both ways. With regards to the op-ed claiming that your study “predicts” election outcomes,

we use the word predict in a statistical sense. We make no claim that we predicted any election results in advance.

That’s simply not true in the op-ed. I’m sure that 99.9% of the readers of your folks’ op-ed didn’t get that, and you (or your coauthor, I can’t tell who’s responsible for the op-ed) really should have known that in advance. “Predict” doesn’t mean in-sample predictions for a layperson! Or even many scientific audiences…

LikeLike

brendan o'connor (@brendan642)

August 21, 2013 at 2:25 pm
@Brendan – A few points: first, the op-ed, as is clearly stated in the byline, was written by Fabio, not me. It represents Fabio’s opinions about the potential future implications of our research. Second, we use the term ‘predict’ to refer to in-sample predictions all the time in the social sciences. Even though we did not use this term in the actual paper, we have occasionally used it in public discussions of our work and Fabio used it in his op-ed. We never claimed that we made these predictions in advance. I know this type of usage has a lot of baggage associated with it (and a lot of people were clearly very upset by it), but I’m not sure what other terms would have been better -‘explained,’ maybe? Third, we actually did test this out of sample, though again not in advance, with data from 2012 and found almost the exact same coefficient for tweet share.

LikeLike

Joe DiGrazia

August 23, 2013 at 1:47 am

Comments are closed.

orgtheory.net