Archive for the ‘mere empirics’ Category
Over at Econ Talk, Russ Roberts interviews James Heckman about censored data and other statistical issues. At one point, Roberts asks Heckman what he thinks of the current identification fad in economics (my phrasing). Heckman has a few insightful responses. One is that a lot of the “new methods” – experiments, instrumental variables, etc. are not new at all. Also, experiments need to be done with care and the results need to be properly contextualized. A lot of economists and identification obsessed folks think that “the facts speak for themselves.” Not true. Supposedly clean experiments can be understand in the wrong way.
For me, the most interesting section of the interview is when Heckman makes a distinction between statistics and econometrics. Here’s his example:
- Identification – statistics, not economics. The point of identification is to ensure that your correlation is not attributable to an unobserved variable. This is either a mathematical point (IV) or a feature of research design (RCT). There is nothing economic about identification in the sense that you need to understand human decision making in order to carry out identification.
In contrast, he thought that “real” econometrics was about using economics to guide statistical modelling or using statistical modelling to plausibly tell us how economic principles play out in real world situations. This, I think, is the spirit of structural econometrics, which demands the researcher define the economic relation between variables and use that as a constraint in statistical estimation. Heckman and Roberts discuss minimum wage studies, where the statistical point is clear (raising wages do not always decrease unemployment) but the economic point still needs to be teased out (moderate wage increases can be offset by firms in others ways) using theory and knowledge of labor markets.
The deeper point I took away from the exchange is that long term progress in knowledge is not generated by a single method, but rather through careful data collection and knowledge of social context. The academic profession may reward clever identification strategies and they are useful, but that can lead to bizarre papers when the authors shift from economic thinking to an obsession with unobserved variables.
Jure Leskovec describes his bail data… in technicolor.
Jure Leskovec is one of the best computer scientists currently working on big data and social behavior. We were lucky to have him speak at the New Computational Sociology conference so I was thrilled to see he was visiting IU’s School of Informatics. His talk is available here.
Overall, Jure explained how data science can improve human decision making and illustrated his ideas with analysis of bail decision data (e.g., given data X, Y, and Z on a defendant, when do judges release someone on bail?). It was a fun and important talk. Here, I’ll offer one positive comment and one negative comment.
Positive comment – how decision theory can assist data analysis and institutional design: A problem with traditional social science data analysis is that we have a bunch of variables that work in non-linear ways. E.g., there isn’t a linear path from the outcome determined by X=1 and Y=1 to the outcome determined by X=1 and Y=0. In statistics and decision-theoretic computer science, the solution is to work with regression trees. If X=1 then use Model 1, if X=2 then use Model 2, and so forth.
Traditionally, the problem is that social scientists don’t have enough data to make these models work. If you have, say, 1000 cases from a survey, chopping up the sample into 12 models will wipe out all statistical power. This is why you almost never see regression trees in social science journals.
One of the advantages of big data is simply that you now have so much data that can chop it up and be fine. Jure chopped up data from a million bail decisions from a Federal government data base. With so much power, you can actually estimate the trade-off curves and detect biases in judicial decision making. This is a great example of where decision theory, big data, and social science really come together.
Criticism – “algorithms in society:” There was a series of comments by audience members and myself about how this algorithm would actually “fit in” to society. For example, one audience member asked how people could “game the system.” Jure responded that he was using only “hard” data that can’t be gamed like the charge, prior record, etc.That is not quite right. For example, what you are charged with is a decision made by prosecutors.
In the Q&A, I followed up and pointed out that race is highly correlated with charges. For example, in the Rehavi & Starr paper at the Yale Law Review, we know that a lot of the difference in time spent in jail is attributable to racial difference in charges. Using Federal arrest data, Blacks get charged with more serious crimes for the same actions. Statistically, this means that race is strongly correlated with the severity of the charge. In the Q&A, Jure said that adding race did not improve the model’s accuracy. But why would it if we know that race and charged are highly inter-correlated? That comment misses the point.
These two comments can be summarized as “society changes algorithms and algorithms change society.” Ideally, Jure (and myself!) would love to see better decisions. We want algorithms to improve society. At the same time, we have to understand that (a) algorithms depend on society for data so we have to understand how the data is created (i.e., charge and race are tied together) and (b) algorithms create incentives to come up with ways to influence the algorithm.
Good talk and I hope Jure inspired more people to think about these issues.
To defeat Hillary Clinton in a Democratic primary, a challenger has to do the following:
- Push HRC’s solid support from about 45% of Democratic voters to about 40%.
Obama did that by swaying Black voters and some educated professionals. #Bern has not moved an inch on that 45%. Without losing HRC’s base, it is relatively easy for HRC to acquire few more undecideds and win. So far, the difference between Obama and Sanders is that “last yard.”
Also, this time around, HRC is prepared. I still think she is a horrible campaigner, as she’s now losing the fund raising battle and has blown big leads. Amazing, since she’s the person that sitting vice presidents have made space for. Still, though, HRC now understands the importance of caucuses and has held up in two caucus states. We are not seeing a repeat of 2008 when HRC bungled by leaving caucuses and later primary states uncontested.
So the writing is on the wall, #Bern needs to win caucuses AND needs to move a much larger portion of the Black vote. We haven’t seen either yet.
The basic truth of politics is that incumbents have huge advantages and those favored by incumbents also have an advantage. Thus, the fundamentals favor a candidate like Hillary Clinton over Bernie Sanders.It’s not an iron clad law, but it is a big factor that shapes most political races.
The presidential primary, in both parties, the process can be about four months long and it is very complex. Historically, the winners rarely knock out all opponents in the early states of Iowa and New Hampshire. Usually, early states weed out weak candidates and leave only two or three serious competitors who “settle” the election on Super Tuesday. You have to go back about 40 years, to the 1960 election, to find a primary where candidates where struggling into the summer. Since the birth of the modern primary system in the 1970s, those who lead after Super Tuesday tend to win it all since it is hard to overcome the delegate lead at that point.
So that is where Nevada fits in. Previously an unimportant state, Nevada is now important. Sanders needs to keep a positive narrative going into Super Tuesday so that he can continue to fund raise and swamp the Clinton campaign in the media and on the ground in the Super Tuesday states. A tie or loss in Nevada would dampen things and make it harder to sway Black voters in South Carolina, who might only defect in sufficiently large numbers after a Sanders win. Also, a Nevada win could soften the blow of a close loss in South Carolina since Sanders could claim that he’s 2-2 againt HRC. In other words, Sanders needs a chain of wins to overcome the advantages that Clinton has in terms of name recognition and access to easy money. If Nevada #Berns this weekend, then I will see it as the first actively visible sign that the Democratic party is tipping away from the DLC/Clinton centrist faction of the 1990s. Until then, “advantage incumbent.”
One of the basic measures of health is body mass index (BMI). It is meant to be a simple measure of a person’s obesity which is also correlated with morbidity. Recently, I spent some time researching the validity of BMI. Does it actually measure fatness? The answer is extremely confusing.
If you google “validity of BMI as a measurement of obesity,” you get this article that summarizes a few studies. The original definition of obesity is that you need to have at least 25% body fat for men and 35% for women. This is hard to measure without special equipment, so BMI is the default. Thus to measure BMI validity, you need a sample of people and then compute the BMI and the body fat percentage (BF%). The Examine.com reports that in number of studies compared BF% and BMI. In some ways, BMI survives scrutiny. In non-obese people, as defined by BF%, BMI works well, but it seems to under-report obesity in others. In a few odd cases, mainly athletes with a lot of muscle weight, obesity is over-reported. Thus, you get a lot of mis-classification: “One meta-analysis on the subject suggests that BMI fails to classify half of persons with excess body fat, reporting them as normal or overweight despite having a body fat percentage classifying them as obese.” Translation, we are much fatter than we appear to be.
Then, soon after I read some of these studies, the LA Times reported on a new study from UCLA that examines the BMI-morbitiy correlation. In that study, researchers measured BMI and then collected data on biomarkers of health. This is done using the NHANES data set. See the study here. Result? A lot of fat people are actually quite healthy in the sense that BMI is not associated with cardio-pulmonary health (i.e., your heart stopping). This reminds me of an earlier discussion on this blog, where there were conflicting estimates of the obesity-mortality link and a meta-analysis kind of, sort of, shows an aggregate positive effect.
How do I approach the BMI issue as of today?
- BMI is a rough measure of fatness (“adiposity”), but not precise enough for doctors to be making big judgments about patients on a single number/measurement.
- BMI is not a terribly good predictor of mortality, even if there is a mild overall correlation that can be detected through meta-analysis.
- BMI is probably not correlated with a lot of morbidity that we care about with some important exceptions like diabetes.
The lead author of the UCLA study said that this was the “last nail in the coffin” for BMI. She might be right.
The Obama strategy in 2008 had a plan A and a plan B. Plan A was to knock out Hillary with big victories in Iowa and New Hampshire. Didn’t work. Plan B was to pad the delegate lead by exploiting small state caucuses and minimizing the damage in Hillary friendly places like New York. That worked, especially since the Hillary campaign was simply incompetent.
Sanders has a similar plan. His Plan A, the early knock out, almost worked. I suspect that Bernie might have even won the popular vote in Iowa, given that the Iowa Democratic Party is refusing to release vote tallies as they did in previous years. So Bernie is on to Plan B. That means he has to accomplish two things:
- Max out caucus states.
- Minimize losses in large primary states.
This is the list of remaining states in February and Super Tuesday and delegate totals for Democrats according to US election central:
- Alabama 60
- American Samoa caucus 10
- Arkansas 37
- Colorado caucus 79
- Georgia 116
- Massachusetts 116
- Minnesota caucus 93
- Nevada 43
- Oklahoma 42
- South Carolina 59
- Tennessee 76
- Texas 252
- Vermont 26
- Virginia 110
You will notice that Bernie has at least three easy states: Vermont, Massachusetts, and probably Minnesota. Then, it gets really hard, really fast. This is not because Hillary will magically become a great campaigner, but the fundamentals favor Hillary.
There are two reasons. First, you win Southern states in the Democratic primary by doing well among Black voters. South Carolina (Feb 27) will be the first test of how well Bernie can move these voters. If he comes up short in South Carolina, it’s bad news because you have more Southern states coming up real fast such as Alabama and Georgia on Super Tuesday and other Southern states soon after that. Second, in March, you will see the types of big states that Hillary dominated in 2008 because of superior name recognition, such as Texas (51% for HRC in 2008), New York (57%), California (51%), Ohio (53%), and Pennsylvania (54%).
Is it impossible for Bernie to win the nomination? Of course not, but he needs to really dominate outside of the establishment friendly mega-states like Ohio and California. That means an immediate and massive turn around in the Black vote, a wipe out in the caucus states, and some strategy for containing the losses from the big states, which even challenged Obama. That sounds really hard to me.
Ever since the publication of Piketty’s Capital in the 21st Century, there’s been a lot of debate about the theory and empirical work. One strand of the discussion focuses on how Piketty handles the data. A number of critics have argued that the main results are sensitive to choices made in the data analysis (e.g., see this working paper). The trends in inequality reported by Piketty are amplified by how he handles the data.
Perhaps the strongest criticism in this vein is made by UC Riverside’s Richard Sutch, who has a working paper claiming that some of Piketty’s major empirical points are simply unreliable. The abstract:
Here I examine only Piketty’s U.S. data for the period 1810 to 2010 for the top ten percent and the top one percent of the wealth distribution. I conclude that Piketty’s data for the wealth share of the top ten percent for the period 1870-1970 are unreliable. The values he reported are manufactured from the observations for the top one percent inflated by a constant 36 percentage points. Piketty’s data for the top one percent of the distribution for the nineteenth century (1810-1910) are also unreliable. They are based on a single mid-century observation that provides no guidance about the antebellum trend and only very tenuous information about trends in inequality during the Gilded Age. The values Piketty reported for the twentieth-century (1910-2010) are based on more solid ground, but have the disadvantage of muting the marked rise of inequality during the Roaring Twenties and the decline associated with the Great Depression. The reversal of the decline in inequality during the 1960s and 1970s and subsequent sharp rise in the 1980s is hidden by a fifteen-year straight-line interpolation. This neglect of the shorter-run changes is unfortunate because it makes it difficult to discern the impact of policy changes (income and estate tax rates) and shifts in the structure and performance of the economy (depression, inflation, executive compensation) on changes in wealth inequality.
From inside the working paper, an attempt to replicate Piketty’s estimate of intergenerational wealth transfer among the wealthy:
The first available data point based on an SCF survey is for 1962. As reported by Wolff the top one percent of the wealth distribution held 33.4 percent of total wealth that year [Wolff 1994: Table 4, 153; and Wolff 2014: Table 2, 50]. Without explanation Piketty adjusted this downward to 31.4 by subtracting 2 percentage points. Piketty’s adjusted number is represented by the cross plotted for 1962 in Figure 1. Chris Giles, a reporter for the Financial Times, described this procedure as “seemingly arbitrary” [Giles 2014].9 In a follow-up response to Giles, Piketty failed to explain this adjustment [Piketty 2014c “Addendum”].
There is a bit of a mystery as to where the 1.2 and 1.25 multipliers used to adjust the Kopczuk-Saez estimates upward came from. The spreadsheet that generated the data (TS10.1DetailsUS) suggests that Piketty was influenced in this choice by the inflation factor that would be required to bring the solid line up to reach his adjusted SCF estimate for 1962. Piketty did not explain why the adjustment multiplier jumps from 1.2 to 1.25 in 1930.
This comes up quite a bit, according to Sutch. There is reasonable data and then Piketty makes adjustments that are odd or simply unexplained. It is also important to note that Sutch is not trying to make inequality in the data go away. He notes that Piketty is likely under-reporting early 20th century inequality while over-reporting the more recent increase in inequality.
A lot of Piketty’s argument comes from international comparisons and longitudinal studies with historical data. I have a lot of sympathy for Piketty. Data is imperfect, collected irregularly, and prone to error. So I am slow to criticize. Still, given that Piketty’s theory is now one of the major contenders in the study of global inequality, we want the answer to be robust.