A great read – an expose of a bunch of standard pitfalls of econometrics (done in an ever so slightly dodgy way)


From Mark Thoma.  Click through to his site or read over the fold.Bill Easterly sent me a link to the post The Vortex of Vacuousness that I posted the other day, but I like this one better:

Maybe we should put rats in charge of foreign aid research, by William Easterly: Laboratory experiments show that rats outperform humans in interpreting data… The amazing finding on rats is described in an equally amazing book by Leonard Mlodinow. The experiment consists of drawing green and red balls at random, with the probabilities rigged so that greens occur 75 percent of the time. The subject is asked to watch for a while and then predict whether the next ball will be green or red. The rats followed the optimal strategy of always predicting green (I am a little unclear how the rats communicated, but never mind). But the human subjects did not always predict green, they usually want to do better and predict when red will come up too, engaging in reasoning like after three straight greens, we are due for a red. As Mlodinow says, humans usually try to guess the pattern, and in the process we allow ourselves to be outperformed by a rat.

Unfortunately, spurious patterns show up in some important real world settings, like research on the effect of foreign aid on growth. Without going into any unnecessary technical detail, research looks for an association between economic growth and some measure of foreign aid, controlling for other likely determinants of economic growth. Of course, since there is some random variation in both growth and aid, there is always the possibility that an association appears by pure chance. The usual statistical procedures are designed to keep this possibility small. The convention is that we believe a result if there is only a 1 in 20 chance that the result arose at random. So if a researcher does a study that finds a positive effect of aid on growth and it passes this 1 in 20 test (referred to as a statistically significant result), we are fine, right?

Alas, not so fast. A researcher is very eager to find a result, and such eagerness usually involves running many statistical exercises (known as regressions). But the 1 in 20 safeguard only applies if you only did ONE regression. What if you did 20 regressions? Even if there is no relationship between growth and aid whatsoever, on average you will get one significant result out of 20 by design. Suppose you only report the one significant result and dont mention the other 19 unsuccessful attempts. You can do twenty different regressions by varying the definition of aid, the time periods, and the control variables. In aid research, the aid variable has been tried, among other ways, as aid per capita, logarithm of aid per capita, aid/GDP, logarithm of aid/GDP, aid/GDP squared, 1, aid/GDP*2, aid/GDP squared *2, aid/GDP*3, etc. Time periods have varied from averages over 24 years to 12 years to to 8 years to 4 years. The list of possible control variables is endless. One of the most exotic I ever saw was: the probability that two individuals in a country belonged to different ethnic groups TIMES the number of political assassinations in that country. So its not so hard to run many different aid and growth regressions and report only the one that is significant.

This practice is known as data mining. It is NOT acceptable practice, but this is very hard to enforce since nobody is watching when a researcher runs multiple regressions. It is seldom intentional dishonesty by the researcher. Because of our non-rat-like propensity to see patterns everywhere, it is easy for researchers to convince themselves that the failed exercises were just done incorrectly, and that they finally found the real result when they get the significant one. Even more insidious, the 20 regressions could be spread across 20 different researchers. Each of these obediently does only one pre-specified regression, 19 of whom do not publish a paper since they had no significant results, but the 20th one does publish their spuriously significant finding (this is known as publication bias.)

But dont give up on all damned lies and statistics, there ARE ways to catch data mining. A significant result that is really spurious will only hold in the original data sample, with the original time periods, with the original specification. If new data becomes available as time passes you can test the result with the new data, where it will vanish if it was spurious data mining. You can also try different time periods, or slightly different but equally plausible definitions of aid and the control variables.

So a few years ago, some World Bank research found that aid works {raises economic growth} in a good policy environment. This study got published in a premier journal, got huge publicity, and eventually led President George W. Bush (in his only known use of econometric research) to create the Millennium Challenge Corporation, which he set up precisely to direct aid to countries with good policy environments.

Unfortunately, this result later turned out to fail the data mining tests. Subsequent published studies found that it failed the new data test, the different time periods test, and the slightly different specifications test.

The original result that aid works in a good policy environment was a spurious association. Of course, the MCC is still operating, it may be good or bad for other reasons.

Moral of the story: beware of these kinds of statistical results that are used to determine aid policy! Unfortunately, the media and policy community dont really get this, and they take the original studies at face value (not only on aid and growth, but also in stuff on determinants of civil war, fixing failed states, peacekeeping, democracy, etc., etc.) At the very least, make sure the finding is replicated by other researchers and passes the data mining tests. …

I saw Milton Friedman provide an interesting example of avoiding data mining. I was at a SF Fed conference where he was a speaker, and his talk was about a paper he had written 20 years earlier on “The Plucking Model.” From a post in January 2006, New Support for Friedman’s Plucking Model:

Friedman found evidence for the Plucking Model of aggregate fluctuations in a 1993 paper in Economic Inquiry. One reason I’ve always liked this paper is that Friedman first wrote it in 1964. He then waited for more than twenty years for new data to arrive and retested his model using only the new data. In macroeconomics, we often encounter a problem in testing theoretical models. We know what the data look like and what facts need to be explained by our models. Is it sensible to build a model to fit the data and then use that data to test it to see if it fits? Of course the model will fit the data, it was built to do so. Friedman avoided this problem since he had no way of knowing if the next twenty years of data would fit the model or not. It did.

The other thing I’ll note is that there is a literature on how test statistics are affected by pretesting, but it is ignored for the most part (e.g. if you run a regression, then throw out an insignificant variable, anything you do later must take account of the fact that you could have made a type I or type II error during the pretesting phase). The bottom line is that the test statistics from the final version of the model are almost always non-normal, and the distribution of the test statistics is not generally known.


Update: Seems like a good time to rerun this graph on publications in political science journals:

Lies, Damn Lies, and….Via Kieran Healy, …It is, at first glance, just what it says it is: a study of publication bias, the tendency of academic journals to publish studies that find positive results but not to publish studies that fail to find results. …

The chart on the right shows G&M’s basic result. In statistics jargon, a significant result is anything with a “z-score” higher than 1.96, and if journals accepted articles based solely on the quality of the work, with no regard to z-scores, you’d expect the z-score of studies to resemble a bell curve. But that’s not what Gerber and Malhotra found. Above a z-score of 1.96, the results fit the bell curve pretty well, but below a z-score of 1.96 there are far fewer studies than you’d expect. Apparently, studies that fail to show significant results have a hard time getting published.

So far, this is unsurprising. Publication bias is a well-known and widely studied effect, and it would be surprising if G&M hadn’t found evidence of it. But take a closer look at the graph. In particular, take a look at the two bars directly adjacent to the magic number of 1.96. That’s kind of funny, isn’t it? They should be roughly the same height, but they aren’t even close. There are a lot of studies that just barely show significant results, and there are hardly any that fall just barely short of significance. There’s a pretty obvious conclusion here, and it has nothing to do with publication bias: data is being massaged on wide scale. A lot of researchers who almost find significant results are fiddling with the data to get themselves just over the line into significance. … Message to political science professors: you are being watched. And if you report results just barely above the significance level, we want to see your work….

  1. log(aid/GDP) – aid loan repayments[]
  2. average of indexes of budget deficit/GDP, inflation, and free trade[][]
  3. quality of institutions[]
  4. One more note. I wrote a paper on Friedman’s Plucking Model, and had a revise and resubmit at a pretty good journal. I satisfied all the referee’s objections, at least I thought I had, and it was all set to go. I had sent the first version of the paper to Friedman, and he wrote back with a long, multi-page letter that was very encouraging, and I incorporated his suggestions into the revision (a reason I’ll always have a soft spot for him, his time was valuable, yet he took the time to do this). But the final results weren’t robust, and had come about through trying different specifications until one worked. The final specification worked well, very well in fact, but the results were pretty fragile. As a result, I pulled the paper and did not resubmit it. The paper was completely redone and rewritten, but after thinking it over I decided it wasn’t robust enough to publish. I find myself regretting that sometimes, the referees would have probably taken the paper since the final version satisfied all their objections, and it was a good journal – I told myself I had simply done what everyone else does, etc. But, hard as it was for an assistant professor in need of publications to pull a paper, especially one Friedman himself had endorsed – this was just before going up for tenure so it could have mattered a lot – pulling the paper was the right thing to do. The only way to solve this problem – and data mining in economics is a problem – is for the people involved in the research to self-police the integrity of the process.[]
This entry was posted in Economics and public policy. Bookmark the permalink.
Notify of

Newest Most Voted
Inline Feedbacks
View all comments
15 years ago

“Theres a pretty obvious conclusion here, and it has nothing to do with publication bias: data is being massaged on wide scale.”
I’m not sure how you infer that. The graph is entirely unsurprising to me, and would occur even if people didn’t massage their data (although I’m sure that happens too). The reason is that many people (perhaps most in some areas) are often looking for tiny effects with low power that are supposed to be interesting theoretically, whereas other people are running more quantitative stuff where you do have big effects with lots of power. This means that what you are seeing is basically noise from the fiddly little effects, where someone has collected a data set than many other people have, but they just happened to have sampled from the tail of the distribution. This means to get to the next percentile, it will be very much harder (since the z = 1.96 data sets might already be coming from a sample a few SDs away from the real population mean), so you should see a huge drop off — the fact the data sets were just significant were just an outlier even in the first place. If you then mix the fiddly-little-effects with the big effects, you will end up with a normal curve generated by these plus the 1.96 blip from the fiddly little effects people.

I think the real test would be for someone else to re-run a pile of other people’s experiments from the same journal (which would be possible in some areas but not others). That way you could see to what extent results really are getting exaggerated by everybody trying to find the magical .05 value. Unfortunately, in many areas, this is an impossible strategy, and failures to replicate almost never get published. This is why we end up with a massive proliferation of little effects that don’t really matter.

Don Arthur
Don Arthur
15 years ago

Great post!

As a non-economist I’ve always been puzzled by the rituals of the discipline (how to get published, win peer respect etc). Each discipline seems to have its own.

Psychologists like experiments. But they tend to run a lot of them with American college students and under conditions that make it difficult to generalise the results outside the laboratory.

For a while it was fashionable in psychology to practice a crude kind of operationalism where it was forbidden to appeal to any unobservable process (eg thoughts, feelings or neural processes).

Anthropologists like ethnography and are very impressed when a researcher immerses themselves for years in an interesting foreign culture (an economics department perhaps?).

Economists love regressions seem to hover like seagulls around organisations with large data sets.

If an economist was researching unemployment and you offered to let them spend a week watching people in a welfare to work office do their job, they’d probably think you were trying to contaminate the research process (or waste their valuable time).

The odd thing is, if there are any good theories in social science, they’d have to apply across these disciplinary boundaries.

Sorry if this off topic.

derrida derider
derrida derider
15 years ago

“[humans] usually want to do better and predict when red will come up too, engaging in reasoning like after three straight greens, we are due for a red.

Of course if it’s sampling without replacement and the population size is small, this is perfectly rational.

Economists aren’t the only ones who data mine – in fact, they’re far from the worst (marketing people IME are the worst). It’s widespread wherever statistics are used and a risk in all frequentist approaches. So maybe the best remedy is to be more Bayesian.

But ultimately we have to understand that there are no shortcuts to truth – definitely establishing whether A causes B is often far harder than it seems.

john r walker
15 years ago

Great article!

Something that annoys me is proponents’ of econometrics unwillingness to adopt a more visceral and honest approach to research.

To admit that the crafting of variables is probably THE MOST important part of the process, and that everything else (the relationships between them) can flow from there.

Sadly, variables are restricted to those which have data available.

NB I ask Andrew Leigh about this via blog comments (I deign to use private correspondence in a principled attempt to make these discussions public, but I guess people of status have got there because they care about such things, and discussions with the likes of me may undermine that status) but received no response.

Bruce Bradbury
15 years ago

My explanation for the figure is that it represents a mixture of two types of studies. Type 1 are studies which are only interesting (and hence published) if the result is significant. Type 2 are studies where the result is of interest even if it is not significant.

As an aside, the term ‘data mining’ is not entirely negative. There is a large industry of software and practitioners who use data mining in a statistically respectable manner. The trick is to have a very large dataset, do your ‘mining’ in one half of the dataset, then test the estimated model in the other half.

15 years ago

I’m not into political science enough to be sure, but I would guess that a lot of the studies that are creating the graph from political science journals given above are not “data mining”. I would think that many are essentially experimental and are looking at differences across groups based on questions derived apriori from different theories. For example, I could test something like “do labor voters love their pets more than liberal voters” and base it on some apriori theory of altruism. When I didn’t find a difference, the result would go into my silly-experiments-I-ran pile. However, when the 50th person does essentially the same experiment and gets a Type I error with a p value of .0499, then they think it’s a great thing to publish. So the effect may be coming from things that arn’t to do with “data mining” but the fact that many people run lots and lots of little experiments.