Archive for the ‘big data’ Category

“distributed framing” in social movements

In the special issue on Ethnic and Racial Studies on Black Lives Matters, BGS* Jelani Ince, Clay Davis and myslef break down hashtag networks:

This paper focuses on the social media presence of Black Lives Matter (BLM). Specifically, we examine how social media users interact with BLM by using hashtags and thus modify the framing of the movement. We call this decentralized interaction with the movement “distributed framing”. Empirically, we illustrate this idea with an analysis of 66,159 tweets that mention #BlackLivesMatter in 2014, when #BlackLivesMatter becomes prominent on social media. We also tally the other hashtags that appear with #BlackLivesMatter in order to measure how online communities influence the framing of the movement. We find that #BlackLivesMatter is associated with five types of hashtags. These hashtags mention solidarity or approval of the movement, refer to police violence, mention movement tactics, mention Ferguson, or express counter-movement sentiments. The paper concludes with hypotheses about the development of movement framings that can be addressed in future research.

Check it out.

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($4.44 – cheap!!!!)/Theory for the Working Sociologist (discount code: ROJAS – 30% off!!)/From Black Power/Party in the Street  

Written by fabiorojas

July 20, 2017 at 4:25 am

ethnic and racial studies covers black lives matter

Ethnic and Racial Studies has a special issue on Black Lives Matter. From the lead article, an analysis of counter-protest and collective identity:

Recent events related to police brutality and the evolution of #BlackLivesMatter provides an empirical case to explore the vitality of social media data for social movements and the evolution of collective identities. Social media data provide a portal into how organizing and communicating generate narratives that survive over time. We analyse 31.65 million tweets about Ferguson across four meaningful time periods: the death of Michael Brown, the non-indictment of police officer Darren Wilson, the Department of Justice report on Ferguson, and the one year aftermath of Brown’s death. Our analysis shows that #BlackLivesMatter evolved in concert with protests opposing police brutality occurring on the ground. We also show how #TCOT (Top Conservatives on Twitter) has operated as the primary counter narrative to #BlackLivesMatter. We conclude by discussing the implications our research has for the #BlackLivesMatter movement and increased political polarization following the election of Donald Trump.

From “Ferguson and the death of Michael Brown on Twitter: #BlackLivesMatter, #TCOT, and the evolution of collective identities” by Rashawn Ray, Melissa Brown, Neil Fraistat and Edward Summers.

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($4.44 – cheap!!!!)/Theory for the Working Sociologist (discount code: ROJAS – 30% off!!)/From Black Power/Party in the Street  

Written by fabiorojas

July 18, 2017 at 4:22 am

more tweets, more votes: wired magazine

A few weeks ago, Wired magazine discussed how you can use social media data to improve political forecasting. From Emma Ellis:

Traditional polling methods aren’t working the way they used to. Upstart analytics firms like Civis and conventional pollsters like PPP, Ipsos, and Pew Research Institute have all been hunting for new, more data-centric ways to uncover the will of the whole public, rather than just the tiny slice willing to answer a random call on their landline. The trending solution is to incorporate data mined from the Internet, especially from social media. It’s a crucial, overdue shift. Even though the Internet is a cesspool of trolls, it’s also where millions of Americans go to express opinions that pollsters might not even think to ask about.

And they were kind enough to cite the More Tweets/More Votes research:

According to Fabio Rojas, a sociologist at Indiana University who conducted a study correlating Twitter mentions and candidate success, “More tweets equals more votes.”…

Social media data gives you a sense of the zeitgeist in a way that multiple choice questions never will. “Say I wanted to learn about what music people are listening to,” says Rojas. “I would have to sit down beforehand and come up with the list. But what if I don’t know about Taylor Swift or Justin Bieber?” Polls are generated by a small group of people, and they can’t know everything. Social media is a sample of what people actually talk about, what actually draws their attention, and the issues that really matter to them.

That sentiment matters, and pollsters can (and in PPP’s case, do) use it to direct their questioning. “People clue us in on stuff online all the time,” says Jim Williams, a polling analyst at PPP. They even ask the Internet where and on what they should poll next, hence Harambe’s presence in its poll. But, Williams says, joke suggestions aside, Twitter’s input also helps pollsters include the finer points of local and national politics. And even the Harambe question itself actually tells the pollsters something interesting.

Interesting stuff.

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/From Black Power/Party in the Street

Written by fabiorojas

September 2, 2016 at 12:01 am

w.e.b. dubois’ illustrations of black social science data


The website Hyperallergic has a nice article on the drawings that DuBois’ did visualizing some of his data. For a 1900 exhibition, DuBois made, by hand, these interesting visualizations. Tufte, eat yer heart out!

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/From Black Power/Party in the Street

Written by fabiorojas

July 11, 2016 at 12:01 am

forum on data analytics and inclusivity, part 2

This post is the second part of our forum on data analytics and inclusivity. The forum was inspired by an essay written by Michael Wilbon about African Americans and analytics. I’ve asked several people who work in analytics to comment on the problems with and opportunities for inclusivity in data analytics, especially as it relates to sports analytics.  The first set of essays can be found here.

Today’s essays are written by three contributors who have direct experience in data analytics and sports. The essays all deal with, in some way, root causes of a racial gap in analytics. Michael Lopez is a statistician at Skidmore College who has written extensively about sports analytics at places like Sports Illustrated and Fivethirtyeight. Jerry Kim is an economic sociologist, who has been at Columbia University since 2006 and will soon join the business school at Rutgers University. His research focuses on the consequences of status for evaluation and he has written about about the effects of status bias on umpires’ decision-making in the MLB (a paper that I can say with zero bias is amazing). Our final contributor is Trey Causey, a computational social scientist who has done considerable work as a data analyst and consultant for the NFL and who is now a data scientist at ChefSteps.

I know that this won’t be nor should it be the last word on this topic. Going forward we need more discussions of this type, especially as analytics becomes increasingly central to how business and sports operate.

Read the rest of this entry »

Written by brayden king

June 15, 2016 at 12:24 am

forum on data analytics and inclusivity, part 1

Data analytics is a buzzword in the business world these days. One of the industries in which data analytics has made the biggest impact is sports. The publication of Moneyball in 2003 signaled a sea change in how baseball teams used data and statistics to make personnel changes. Basketball wasn’t far behind in implementing advanced statistics in the front office. The MIT Sloan Sports Analytics conference has become a hub of industry activity, attracting academics, journalists, and sports insiders.

In general, data analytics has been celebrated as a more enlightened way to approach sports management. But it was only a matter of time before sports analytics got some backlash. The most recent criticism comes from the respected sports journalist, Michael Wilbon, who wrote a piece for The Undefeated about how data analytics has fallen on deaf ears in the African American community.

Log onto any mainstream website or media outlet (certainly any program within the ESPN empire) and 30 seconds cannot pass without extreme statistical analysis, which didn’t exist 20 years ago, hijacking the conversation. But not in “BlackWorld,” where never is heard an advanced analytical word. Not in urban barbershops. Not in text chains during three-hour games. Not around office water coolers. Not even in pressrooms or locker rooms where black folks who make a living in the industry spend all day and half the night talking about the most intimate details of sports.

Wilbon makes the point that in sports data analytics have become just one more way for the Old Boy Network to reassert their status. Of course, I’ve heard other people make the case that analytics levels the playing field, given that it doesn’t require any sort of credentials to participate and is potentially race- and gender-blind. Other journalists have already criticized Wilbon’s claims and methodology (including this response by Dave Schilling), but it’s undoubtedly true that Wilbon’s point of view is shared by others in sports.

We’re using Wilbon’s essay as an opportunity to have a discussion about data analytics and inclusivity. This is an issue that doesn’t just affect sports. As analytics become more integral to the business world, organizations will use analytics to sort talent in many of the most lucrative jobs. Academia, especially in STEM fields, continually wrestles with questions about inclusivity as well.

I’ve invited a handful of scholars and practitioners in the field of data analytics, many of whom work in the world of sports analytics, to comment on this issue. I’ll post their responses in two parts:  half today and the other set tomorrow. Today’s essays are written by three contributors who all have different takes on how analytics can be used to overcome the problems Wilbon identified in his essay. Brian Mills is a sports economist at the University of Florida. His research applies “economic lessons and quantitative analysis to problems that sport managers face in their everyday decision making.”  Sekou Bermiss is an organizational theorist at the University of Texas McCombs School of Business. He studies the relationships between human capital, reputation, and firm performance. Laura Nelson is currently a postdoc at Northwestern’s Kellogg School of Management. She uses computational methods to analyze organizational histories and changes in the feminist movement. She’s also, like me, a San Francisco Giants fan.

Thanks to all of the contributors. Come back for more commentary later, and please feel free to leave comments below.

Read the rest of this entry »

Written by brayden king

June 14, 2016 at 2:06 am

tying our own noose with data? higher ed edition

I wanted to start this post with a dramatic question about whether some knowledge is too dangerous to pursue. The H-bomb is probably the archetypal example of this dilemma, and brings to mind Oppenheimer’s quotation of the Bhagavad Gita upon the detonation of Trinity: “Now I am become Death, the destroyer of worlds.

But really, that’s way too melodramatic for the example I have in mind, which is much more mundane. Much more bureaucratic. It’s less about knowledge that is too dangerous to pursue and more about blindness to the unanticipated — but not unanticipatable — consequences of some kinds of knowledge.


Maybe this wasn’t such a good idea.

The knowledge I have in mind is the student-unit record. See? I told you it was boring.

The student-unit record is simply a government record that tracks a specific student across multiple educational institutions and into the workforce. Right now, this does not exist for all college students.

There are records of students who apply for federal aid, and those can be tied to tax data down the road. This is what the Department of Education’s College Scorecard is based on: earnings 6-10 years after entry into a particular college. But this leaves out the 30% of students who don’t receive federal aid.

There are states with unit-record systems. Virginia’s is particularly strong: it follows students from Virginia high schools through enrollment in any not-for-profit Virginia college and then into the workforce as reflected in unemployment insurance records. But it loses students who enter or leave Virginia, which is presumably a considerable number.

But there’s currently no comprehensive federal student-unit record system. In fact at the moment creating one is actually illegal. It was banned in an amendment to the Higher Education Act reauthorization in 2008, largely because the higher ed lobby hates the idea.

Having student-unit records available would open up all kind of research possibilities. It would help us see the payoffs not just to college in general, but to specific colleges, or specific majors. It would help us disentangle the effects of the multiple institutions attended by the typical college student. It would allow us to think more precisely about when student loans do, and don’t, pay off. Academics and policy wonks have argued for it on just these grounds.

In fact, basically every social scientist I know would love to see student-unit records become available. And I get it. I really do. I’d like to know the answers to those questions, too.

But I’m really leery of student-unit records. Maybe not quite enough to stand up and say, This is a terrible idea and I totally oppose it. Because I also see the potential benefits. But leery enough to want to point out the consequences that seem likely to follow a student-unit record system. Because I think some of the same people who really love the idea of having this data available would be less enthused about the kind of world it might help, in some marginal way, create.

So, with that as background, here are three things I’d like to see data enthusiasts really think about before jumping on this bandwagon.

First, it is a short path from data to governance. For researchers, the point of student-level data is to provide new insights into what’s working and what isn’t: to better understand what the effects of higher education, and the financial aid that makes it possible, actually are.

But for policy types, the main point is accountability. The main point of collecting student-level data is to force colleges to take responsibility for the eventual labor market outcomes of their students.

Sometimes, that’s phrased more neutrally as “transparency”. But then it’s quickly tied to proposals to “directly tie financial aid availability to institutional performance” and called “an essential tool in quality assurance.”

Now, I am not suggesting that higher education institutions should be free to just take all the federal money they can get and do whatever the heck they want with it. But I am very skeptical that, in general, the net effect of accountability schemes is generally positive. They add bureaucracy, they create new measures to game, and the behaviors they actually encourage tend to be remote from the behaviors they are intended to encourage.

Could there be some positive value in cutting off aid to institutions with truly terrible outcomes? Absolutely. But what makes us think that we’ll end up with that system, versus, say, one that incentivizes schools to maximize students’ earnings, with all the bad behavior that might entail? Anyone who seriously thinks that we would use more comprehensive data to actually improve governance of higher ed should take a long hard look at what’s going on in the UK these days.

Second, student-unit records will intensify our already strong focus on the economic return to college, and further devalue other benefits. Education does many things for people. Helping them earn more money is an important one of those things. It is not, however, the only one.

Education expands people’s minds. It gives them tools for taking advantage of opportunities that present themselves. It gives them options. It helps them to find work they find meaningful, in workplaces where they are treated with respect. And yes, selection effects — or maybe it’s just because they’re richer — but college graduates are happier and healthier than nongraduates.

The thing is, all these noneconomic benefits are difficult to measure. We have no administrative data that tracks people’s happiness, or their health, let alone whether higher education has expanded their internal life.

What we’ve got is the big two: death and taxes. And while it might be nice to know whether today’s 30-year-old graduates are outliving their nongraduate peers in 50 years, in reality it’s tax data we’ll focus on. What’s the economic return to college, by type of student, by institution, by major? And that will drive the conversation even more than it already does. Which to my mind is already too much.

Third, social scientists are occupationally prone to overestimate the practical benefit of more data. Are there things we would learn from student-unit records that we don’t know? Of course. There are all kinds of natural experiments, regression discontinuities, and instrumental variables that could be exploited, particularly around financial aid questions. And it would be great to be able to distinguish between the effects of “college” and the effects of that major at this college.

But we all realize that a lot of the benefit of “college” isn’t a treatment effect. It’s either selection — you were a better student going in, or from a better-off family — or signal — you’re the kind of person who can make it through college; what you did there is really secondary.

Proposals to use income data to understand the effects of college assume that we can adjust for the selection effects, at least, through some kind of value-added model, for example. But this is pretty sketchy. I mean, it might provide some insights for us to think about. But as a basis for concluding that Caltech, Colgate, MIT, and Rose-Hulman Institute of Technology (the top five on Brookings’ list) provide the most value — versus that they have select students who are distinctive in ways that aren’t reflected by adjusting for race, gender, age, financial aid status, and SAT scores — is a little ridiculous.

So, yeah. I want more information about the real impact of college, too. But I just don’t see the evidence out there that having more information is going to lead to policy improvements.

If there weren’t such clear potential negative consequences, I’d say sure, try, it’s worth learning more even if we can’t figure out how to use it effectively. But in a case where there are very clear paths to using this kind of information in ways that are detrimental to higher education, I’d like to see a little more careful thinking about the real likely impacts of student-unit records versus the ones in our technocratic fantasies.

Written by epopp

June 3, 2016 at 2:06 pm

happiness paradox

My colleague Johan Bollen and his colleagues have been working on a project that tries to measure and verify the “happiness paradox,” which is an extension and elaboration of the “friendship paradox.” From the MIT technology review:

The friendship paradox is straightforward to explain. It comes about because of the skewed way people collect friends on online social networks such as Twitter and Facebook. Most people have a small number of friends—a few dozen or so. But a tiny fraction of people have huge numbers of friends millions or tens of millions of followers in some cases.

This has two effects. First, it makes them much more likely to appear in a random person’s list of friends. And second, it dramatically skews the answer when calculating the average number of friends that a person’s friends have.

Bollen et al then explain the analogous happiness paradox:

Bollen and co begin by analyzing the most recent 3,000 tweets sent by some 40,000 Twitter users. They use a standard algorithm to analyze each tweet to determine its sentiment—whether positive or negative—and then assume this gives a sense of the user’s happiness level. In other words, they assume that people who are less happy send more negative tweets. They also include in the analysis the number of followers and followees for each individual.

The results make for interesting reading. Bollen and co say there is clear friendship paradox at work in this network, as expected. But they also say there is a less striking but nonetheless significant happiness paradox at work, too.

Indeed, Bollen co say their evidence suggests that the more unhappy the individual, the stronger the happiness paradox they face. “Although happy and unhappy groups of subjects are both affected by a significant happiness paradox, unhappy subjects are most strongly affected,” they say.

The original paper is here. Recommended.

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/From Black Power/Party in the Street


Written by fabiorojas

March 7, 2016 at 4:25 am

congratulations to karissa mckelvey

Guest blogger emeritus Karissa McKelvey just won a huuuge award. Her project just won a Knight Foundation grant. Her team is going to build a search engine that allows people to access data and make sure the data is update. Think of it as Bit Torrent for data, not illegal downloads. Good job!

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/From Black Power/Party in the Street 

Written by fabiorojas

February 1, 2016 at 12:01 am

big data and social movements

Mobilizing Ideas has a month long discussion about dig data and movement research. From Part I:

Part II:

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/From Black Power/Party in the Street!!

Written by fabiorojas

April 8, 2015 at 12:11 am

Posted in big data, fabio, social movements

Tagged with

orchestra data

Via Shamus – a website with a big data set on the history of the New York Philharmonic:

The New York Philharmonic played its first concert on December 7, 1842. Since then, it has merged with the New York Symphony, the New/National Symphony, and had a long-running summer season at New York’s Lewisohn Stadium. This Performance History database documents all known concerts of all of these organizations, amounting to more than 20,000 performances. The New York Philharmonic Leon Levy Digital Archives provides an additional interface for searching printed programs alongside other digitized items such as marked music scores, marked orchestral parts, business records, and photos.

In an effort to make this data available for study, analysis, and reuse, the New York Philharmonic joins organizations like The Tate and the Cooper-Hewitt in making its own contribution to the Open Data movement.

The metadata, which is released under the Creative Commons Public Domain CC0 licence, is located on the New York Philharmonic’s GitHub page.

Sounds like a feast for a musical orgtheorist. Go to it!

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/From Black Power/Party in the Street!!

Written by fabiorojas

March 31, 2015 at 12:03 am

Posted in big data, culture, fabio