Archive for the ‘computer science’ Category
At Degenerate State, there was an interesting post where someone applied natural language processing models to heavy metal lyrics. From the article:
To get the lyrics, I scraped www.darklyrics.com. While darklyrics doesn’t have a robots.txt file, I tried to be gentle with my requests. After cleaning the data up, identifying the languages and splitting albums into songs, we are left with a dataset containing lyrics to 222,623 songs from 7,364 bands spread over 22,314 albums.
Before anyone asks, I have no intention of releasing either the raw lyric files or the code used to scrape the website. I collected the lyrics for my own entertainment, and it would be too easy for someone to use this data to copy darklyrics. If people are interested I may release some n-gram data of the corpus.
So what do you find? A few tidbits – the heavy metal word cloud:
Then, the most “metal words:”
And the least metal words:
The bottom line? Academia, the law and administration are the least metal topics of all time. Who knew?
“How do you feel about programming in SAS?”
“Here’s how I feel. When I program in SAS, I feel like I got my master’s degree in statistics in 1980 and I’ve been running the same basic analysis over and over again for my corporate bosses for the last twenty years. I then feel like it’s Friday afternoon and I’m just slogging through this code so I can meet my buddies after work at Chili’s and talk about this weekend’s big game.”
“That is exactly how I feel.”
This week, we’ll have a few posts about the candidacy of Donald Trump. It will have three parts:
- Tom Gill will post on Trump as a political performer.
- Then, Josh Pacewicz will dig into Trump’s poll numbers.
- We’ll wrap up with a post by me on Trump, where I’ll add some of my own thoughts.
I’ll focus on the following points. Interested readers should send me questions:
- How predictable/unpredictable was the Trump candidacy?
- Using the Entertainment Theory of the GOP to understand Trump’s nomination and likely November loss.
- Using Trump to explain when social science theories do/do not work.
What do you want to know about Trump? Use the comments or send me email.
Kieran has linked to a very interesting set of remarks at the SASE conference by Maciej Cegłowski about the mentality of computer programmers. A few choice clips:
But as anyone who’s worked with tech people knows, this intellectual background can also lead to arrogance. People who excel at software design become convinced that they have a unique ability to understand any kind of system at all, from first principles, without prior training, thanks to their superior powers of analysis. Success in the artificially constructed world of software design promotes a dangerous confidence.
About the economy of collecting information:
Surveillance capitalism has some of the features of a zero-sum game. The actual value of the data collected is not clear, but it is definitely an advantage to collect more than your rivals do. Because human beings develop an immune response to new forms of tracking and manipulation, the only way to stay successful is to keep finding novel ways to peer into people’s private lives. And because much of the surveillance economy is funded by speculators, there is an incentive to try flashy things that will capture the speculators’ imagination, and attract their money.
This creates a ratcheting effect where the behavior of ever more people is tracked ever more closely, and the collected information retained, in the hopes that further dollars can be squeezed out of it.
Read the whole thing.
A few days ago, we had a discussion about the different meanings of the word “computational sociology.” A commenter wrote the following:
Are agent based models/simulations a dead end? Are smart people still using that technique? Have there been any important results? I didn’t realize it peaked in the 1980s.
I’m a current doctoral student considering pursuing ABM, but if it’s a dead end then maybe not.
I think that olderwoman’s response is on target. There is nothing out of style about ABM’s, but sociology is mainly a discipline of empiricists. You will find scholars who occasionally to ABMs but no one who ONLY does is very, very rare. Examples of people who have done simulations: Damon Centola, Kathleen Carley, Carter Butts. In my department, I can think of two people who have published simulations (Clem Brooks, Steve Benard, and myself) and those who do methods research often employ simulations. Olderwoman is also correct in that writing simulations helps you develop programming skills that are now required for “big data” work and for industry.
So don’t write an all simulation dissertation, but by all means, if you have good ideas, simulate them!
I was having a discussion with a visiting scholar about what computational sociology means right now. In my career, the term has been used in at least three different ways:
- Statistics – for the baby boomer generation of social scientists, “computing in socioal science” meant applied statistics. Remember, it requires a lot of knowledge and skill to store data and estimate models on computes with limited computing power.
- Agent based models – in the 1980s and 1990s, “computational” meant running simulations.
- Big data/CS techniques – currently, the term seems to refer to either (a) large data generated by online behavior and/or (b) using computer science techniques (e.g., topic models or sentiment analysis) to study social science data
Use the comments to discuss other uses of the term.
Jure Leskovec describes his bail data… in technicolor.
Jure Leskovec is one of the best computer scientists currently working on big data and social behavior. We were lucky to have him speak at the New Computational Sociology conference so I was thrilled to see he was visiting IU’s School of Informatics. His talk is available here.
Overall, Jure explained how data science can improve human decision making and illustrated his ideas with analysis of bail decision data (e.g., given data X, Y, and Z on a defendant, when do judges release someone on bail?). It was a fun and important talk. Here, I’ll offer one positive comment and one negative comment.
Positive comment – how decision theory can assist data analysis and institutional design: A problem with traditional social science data analysis is that we have a bunch of variables that work in non-linear ways. E.g., there isn’t a linear path from the outcome determined by X=1 and Y=1 to the outcome determined by X=1 and Y=0. In statistics and decision-theoretic computer science, the solution is to work with regression trees. If X=1 then use Model 1, if X=2 then use Model 2, and so forth.
Traditionally, the problem is that social scientists don’t have enough data to make these models work. If you have, say, 1000 cases from a survey, chopping up the sample into 12 models will wipe out all statistical power. This is why you almost never see regression trees in social science journals.
One of the advantages of big data is simply that you now have so much data that can chop it up and be fine. Jure chopped up data from a million bail decisions from a Federal government data base. With so much power, you can actually estimate the trade-off curves and detect biases in judicial decision making. This is a great example of where decision theory, big data, and social science really come together.
Criticism – “algorithms in society:” There was a series of comments by audience members and myself about how this algorithm would actually “fit in” to society. For example, one audience member asked how people could “game the system.” Jure responded that he was using only “hard” data that can’t be gamed like the charge, prior record, etc.That is not quite right. For example, what you are charged with is a decision made by prosecutors.
In the Q&A, I followed up and pointed out that race is highly correlated with charges. For example, in the Rehavi & Starr paper at the Yale Law Review, we know that a lot of the difference in time spent in jail is attributable to racial difference in charges. Using Federal arrest data, Blacks get charged with more serious crimes for the same actions. Statistically, this means that race is strongly correlated with the severity of the charge. In the Q&A, Jure said that adding race did not improve the model’s accuracy. But why would it if we know that race and charged are highly inter-correlated? That comment misses the point.
These two comments can be summarized as “society changes algorithms and algorithms change society.” Ideally, Jure (and myself!) would love to see better decisions. We want algorithms to improve society. At the same time, we have to understand that (a) algorithms depend on society for data so we have to understand how the data is created (i.e., charge and race are tied together) and (b) algorithms create incentives to come up with ways to influence the algorithm.
Good talk and I hope Jure inspired more people to think about these issues.