Andrew Leigh’s excellent speech launching Randomistas

RandomistasRobert Solow once referred to the law and economics scholar Richard Posner as writing books the way the rest of us breathe. Andrew Leigh seems to be in this category with his output apparently accelerating on top of his no doubt gruelling schedule as an MP, not to mention being a father of three. 

Anyway, I’ve not yet read his latest but I did go to the Melbourne launch of his book where he lavished the breadth of his learning on his audience. I would have liked a somewhat greater awareness of the foibles of what Hayek called scientism in his speech.

Randomised controlled trials definitely have some very worthwhile things to offer policy making and Andrew’s speech makes that case compellingly. I also endorse his support for randomisation as a modus operandi – not just for all singing, all-dancing RCTs costing hundreds of thousands of dollars by academics, but also for every day randomisation in the way that’s proposed in the Lean Start-up and practised by the most successful IT firms like Google and Amazon.

But I’ve got an uneasy feeling about how randomisation so easily takes on the mantle of ‘gold standard’ for evidence – something repudiated by numerous scholars such as Angus Deaton and James Heckman. Here’s  Hayek in 1942, but he held the same views up to his death around forty years later:

In the hundred and twenty years or so during which this ambition to imitate Science in its methods rather than its spirit has now dominated social studies, it has contributed scarcely anything to our understanding of social phenomena… Demands for further attempts in this direction are still presented to us as the latest revolutionary innovations which, if adopted, will secure rapid undreamed of progress.

This idea that we can prove up ‘what works’ and then build a management system around it is OK as a meta-idea but only if it’s pursued with the scientific caveats that it requires. Alas managers and politicians are impatient with such things. I fear Andrew might be a little impatient with it also. And so, just as academia pumps out graduates who have been carefully trained to generate and operate any number of sophisticated models but have been poorly trained, if they’ve been trained at all, to understand their respective merits and limitations, so it would be easy for whole systems to be built which generate knowledge using randomised trials, but show little care in understanding precisely how far that knowledge can be generalised – how constrained to its context it is. I tried to explore this terrain in my own dinner address to the Australian Evaluation Society Annual Conference last year.

In any event, these issues may be dealt with in the book. Be that as it may, Andrew gave a great account of himself and I warmly recommend his speech, reproduced below the fold, to all. You’ll learn a lot. I did anyway.   

Andrew Leigh’s Launch Speech for his book Randomistas: How Radical Researchers Changed Our World

In 2013, a group of Finnish doctors published the results of a randomised trial of knee surgery performed for a torn meniscus, the piece of cartilage that provides a cushion between the thighbone and shinbone. This opera­tion, known as a meniscectomy, is performed millions of times a year, making it the most common orthopaedic procedure in countries such as Australia and the United States.

The randomised trial was based on ‘sham surgery’, in which patients consent to being assigned either to a regular treatment, or to being cut open and sewn up again without the operation being performed. Not only is the patient assigned to true surgery or placebo surgery based on the toss of a coin – they are not even told afterwards what happened to them.

The 2013 randomised experiment showed that among middle-aged patients, surgery for a torn meniscus was no more effective than sham surgery. Not everyone welcomed the finding. An editorial in the journal Arthroscopy thundered that sham surgery randomised trials were ‘ludicrous’. The editors went so far as to argue that because no ‘right-minded patients’ would participate in sham surgeries, the results would ‘not be generalizable to mentally healthy patients’.

Yet sham surgeries are growing in importance, as people realise that the placebo effect in surgery is probably bigger than in any other area of medicine. A recent study found that three-quarters of patients say they feel better after surgery; but that in half the cases, those who got sham surgery experience just as big an improvement as those who got real surgery. The results suggest that millions of people every year are undergoing surgeries that make them feel a bit better – yet they would feel just as good if they had undergone placebo surgery instead.

Despite the advocacy of surgeons such as Melbourne’s Peter Choong, sham surgery remains in its infancy. Part of the challenge comes down to how they approach their job. Sydney surgeon Ian Harris points out that patients sometimes regard aggressive surgeons as heroic and conservative surgeons as cowardly.

* * *

What does a typical randomised trial look like? Suppose that we decided to test the impact of sleep on happiness by doing an experiment with the 100 people in this room. If we tossed coins, we would end up with 50 people in the heads group, and 50 in the tails group. Now imagine we asked the heads group to get an extra hour’s sleep that evening, and then surveyed people the next night, asking them to rate how happy they were with their lives. If we found that the heads group were happier than the tails group, it would be reasonable to conclude that a little more snooze helps lose the blues.

The beauty of a randomised trial is that it gets around problems that might plague an observational analysis, such as the possibility that happiness causes sleep – good-tempered people tend to hit the pillow early.

Randomised trials have a long history in medicine, going back to James Lind’s work on scurvy, and Ambroise Paré’s work on treating battlefield burns. In the 1800s, a randomised trial showed that bloodletting didn’t cure patients. Alas, the result came to be accepted after doctors had decided to call one of their leading journals The Lancet.

In the 1940s, British research Austin Bradford Hill was working on streptomycin, a promising new treatment for tuberculosis. The disease had nearly killed Hill as a child, and still claimed the lives of nearly 200,000 Britons annually. Hill used scarcity as an argument for doing a randomised trial, rather than rolling out the treatment across the country. ‘We had no dollars and the amount we were allowed by the Treasury was enough only for, so to speak, a handful of patients. In that situation I said it would be unethi­cal not to make a randomised controlled trial’

A trial in 1954 randomly injected 600,000 US children with either polio vaccine or salt water. The vaccine proved effective, and immunisation of all American children began the following year. The 1960s saw randomised trials used to test drugs for diabetes and blood pressure, and the contraceptive pill.

In between, there have been plenty of randomised trials of ineffective treatments. Today, only one in ten drugs that look promising in the ends up finding its way onto the market.

In each case, those taking the new drug are compared against peo­ple taking a fake drug, or placebo. For alleviating discomfort, the placebo effect works in surprising ways. For example, placebo injections produce a larger effect than placebo pills. Even the colour of a tablet changes the way in which patients perceive its effect. Thanks to randomised trials, we know that if you want to reduce depression, you should give the patient a yellow tablet. For reducing pain, use a white pill. For lowering anxiety, offer a green one. Sedatives work best when delivered in blue pills, while stimulants are most effective as red pills. The makers of the movie The Matrix clearly knew this when they devised a moment for the hero to choose between a blue pill and a red pill.

For my own part, randomised trials have helped shape how I look after my health. I used to take a daily multivitamin tablet, until I read a study that found that for otherwise healthy people, there is no evidence that extra vitamins make you live longer. Nor do the randomised trials support fish oil supplements. I wear compression socks after an Australian randomised trial of marathoners showed that they aid recovery, and I remove my sons’ bandaids quickly rather than slowly after a study at James Cook University reported that it was less painful.

The randomistas are reshaping social policy too.

In Melbourne, the ‘Journey to Social Inclusion’ experiment was Australia’s first randomised trial of a homelessness program. The intervention lasted for three years, and provided the 40 people in the treatment group with intensive support from a social worker. This caseworker might help them find housing, reconnect with family and access job training. Another forty people in the control group did not receive any extra support.

What might we expect from the program? If you’re like me, you’d have hoped that three years of intensive support would see all partici­pants healthy, clean and employed. But by and large, that’s not what the program found. Those who were randomly selected into the pro­gram were indeed more likely to have housing, and less likely to be in physical pain. But Journey to Social Inclusion had no impact on reducing drug use or improving mental health. At the end of three years, just two people in the treatment group had a job – the same number as in the control group.

The Journey to Social Inclusion program is a reminder of how hard it is to turn around the living standards of the most disadvan­taged. Hollywood loves to depict overnight transformations, but the more common trajectory for someone recovering from deep trauma looks more like two steps forward and one step back.

Unless we properly evaluate programs designed to help the long-term homeless, there’s a risk that people of goodwill – social workers, public servants and philanthropists – will fall into the trap of thinking it’s easy to change lives. There are plenty of evaluations of Australian homelessness programs that have produced better results than this one. But because none of those evaluations was as rigorously conducted as this one, there’s a good chance they’re overstating their achievements.

Researchers in Canberra have run world-leading randomised trials of ‘restorative justice conferencing’ – bringing offender and victim together to discuss what the perpetrator should do to repair the harm. Cases judged suitable for restorative justice are randomly allocated to it or to the traditional process. The studies in Australia and around the world conclude not only restorative justice reduces crime, but also that it helps victims. In one study, victims of violence were asked if they would harm the offender if they got the chance. When cases went to court, nearly half the victims said afterwards that they still wanted to take revenge – compared with less than one in ten cases that went through restorative justice.

If only we had randomised evidence on the impact of prisons. Then again, it’s hard to imagine that any prison authority would agree to run an experiment to answer this question. Courts and parole boards aim to dispense equal justice, not rely on luck. To have enough statistical power would require thousands of prisoners. There would need to be big differences in the sentences of the two groups, based on nothing more than chance. The cries of unfairness would be deafening…

Or so you might think. In 1970 the California parole board agreed to run just such an experiment. That year, 3000 prisoners who were coming up for release were divided into two groups. Using a random table of numbers, half of the prisoners had their sentence shortened by six months, while the rest served their regular term. After release, the authorities looked to see who reoffended. They found no differ­ence between the two groups, suggesting that another six months behind bars didn’t reduce make the streets any safer.

In the classroom, we’re learning a lot from randomised trials.

In one experiment, the Bill & Melinda Gates Foundation conducted a randomised trial of coaching programs for teachers. Each month, teachers sent videos of their lessons to an expert coach, who worked with them to eliminate bad habits and try new techniques. By the end of the year, teachers in the coaching pro­gram had seen gains in their classroom equivalent to several additional months of learning.

Another study looked at the Promise Academy, a school in Harlem that operates on a ‘no excuses’ model, with classes sometimes running from 8am to 7pm. Across the United States, the average black high school student is two to four years behind his or her white counterparts. Students who won a lottery to attend the Promise Academy improved their per­formance by enough to close the black–white test score gap. As lead researcher Roland Fryer points out, this overturns the fatalistic view that poverty is entrenched, and schools are incapable of making a transformational difference. He claims that the achievements of the Promise Academy are ‘the equivalent of curing cancer for these kids’.

Developing countries are awash with randomised trials. In Indonesia, a randomised trial tested the impact on students of randomly doubling teachers’ pay. In India, a randomised trial of 19 million people estimated the impact on corruption of rollout of biometrically identified smartcards.

When the Mexican city of Acayucan found that council only had money to pave about half the streets, the mayor saw an opportunity to avert some voter anger, and learn about the impacts of road paving. Rather than selecting the roads her­self, she let researchers randomly choose which streets to upgrade. In Kenya, economists worked with the national electricity utility to randomly give some households a discount on their connection fee. By varying the subsidy, the researchers were able to see how much households valued being connected to the grid.

Businesses are working on randomised trials too.

Quora, a question-and-answer web­site, devotes a tenth of its staff to running randomised trials, and is conducting about thirty experiments at any given time. Amazon is virtually built on randomised trials. As one commentator observes ‘every pixel on the [Amazon] home page has had to jus­tify its existence through repeated testing of alternative layouts’. In retail, if you’re wondering why half of all prices end in nine, you can blame the use of randomised marketing trials.

If you have a Coles FlyBuys card, you’re part of a randomised trial. One in 100 cards is randomly selected to be a control group, which does not receive any promotional material. This lets the company benchmark the impact of its promotions.

The shade of blue on the Google toolbar is the result of a randomised trial run by Marissa Mayer, then a vice-president at Google. She proposed an experiment that tested 40 different shades of blue. With billions of clicks, even a small difference means big bucks. One estimate if that finding the perfect colour for the toolbar added US$200 million to Google’s bottom line.

Google’s scientists have access to around 15 exabytes of data, and around 40,000 searches each second. This suggests that big data isn’t an alternative to randomised trials. If Google still gets value from randomised experiments, then the same must go for every other researcher on the planet.

Running a randomised experiment in business is often called ‘A/B testing’, and has become integral to the operation of firms such as Netflix, eBay, Intuit, Humana, Chrysler, United Airlines, Lyft and Uber. One US executive says that his firm has three cardinal rules: ‘you don’t harass women, you don’t steal and you’ve got to have a control group’. Yes, that’s right – you can lose your job for not having a control group.

* * *

You can even use randomised trials in your own life. Last year, I used Google ads to run a small experiment of my own. Anyone who searched the web might have seen an ad for a new book about randomised trials. Web surfers were randomly shown one of twelve possible book titles. My editors and I each had our favourite titles, but we had agreed to leave the final decision to a randomised experiment.

A week later, over 4000 people had seen one of the advertise­ments. The worst performing title (not a single person clicked on it) was Randomistas: How a Powerful Tool Changed Our World. Second place was Randomistas: The Secret Power of Experiments. And the clear winner was Randomistas: How Radical Researchers Changed Our World. The experiment took about an hour to set up, and cost me about $50.

A few years earlier, I had written a book on inequality for the same publisher. My editor wanted to call it Fair Enough? My mother sug­gested Battlers and Billionaires. After running Google ads for a few days, we found that the click rate for my mother’s title was nearly three times higher. My editor graciously conceded that the evidence was in, and Battlers and Billionaires hit the shelves the following year.

* * *

In the early-2000s, successful businessman Blake Mycoskie visited vil­lages outside Buenos Aires, and was struck by what he saw: ‘I knew somewhere in the back of my mind that poor children around the world often went barefoot, but now, for the first time, I saw the real effects of being shoeless: the blis­ters, the sores, the infections.’

To provide shoes to those children, Mycoskie founded ‘Shoes for Better Tomorrows’, which was soon shortened to TOMS. The com­pany made its customers a one-for-one promise: buy a pair of shoes and TOMS will donate a pair to a needy child. TOMS has given away over 60 million pairs of shoes.

Six years in, Mycoskie and his team wanted to know what impact TOMS was having, so they made the brave decision to let economists randomise shoe distribution across eighteen communities in El Salvador. The study showed that the canvas loafers didn’t go to waste: most children wore their new shoes most of the time. But the children’s health wasn’t any better, as the TOMS shoes were generally replacing older footwear. Free shoes didn’t improve children’s self-esteem, but did make them feel more dependent on outsiders.

Let’s be clear about what this meant. Corporate philanthropy wasn’t an add-on for TOMS – it was the firm’s founding credo. Now a ran­domised trial showed that among recipients in El Salvador, free shoes weren’t doing much to improve child outcomes, and may even have been fostering a sense of dependency. Yet rather than trying to dis­credit the evaluation, TOMS responded promptly.

As lead researcher Bruce Wydick wrote: ‘TOMS is perhaps the most nimble organization any of us has ever worked with, an organization that truly cares about what it is doing, seeks evidence-based results on its program, and is committed to re-orienting the nature of its intervention in order to maximize results. In response to children saying that the canvas loafer isn’t their first choice, they now often give away sports shoes . . . In response to the dependency issue, they now want to pursue giving the shoes to kids as rewards for school attendance and performance . . . Never once as researchers did we feel pressure to hide results that could shed an unfavourable light on the company… we applaud them for their transparency and commitment to evidence-based action among the poor.’

No-one should fault Blake Mycoskie for setting up TOMS shoes, acting based on the best available evidence at the time. As the poet W.H. Auden once put it, ‘We may not know very much, but we do know something, and while we must always be prepared to change our minds, we must act as best we can in the light of what we do know.’

But when new facts arrive, TOMS shifted. And because of that, the TOMS randomised trial doesn’t look like a failure at all. Blake Mycoskie’s goal in establishing the firm was to improve the health of poor children. The company evaluated its approach. It didn’t work. So it changed tack. The philosophy of test-learn-adapt is at the heart of randomisation.

Randomised trials flourish where modesty meets numeracy. An experimenting society doesn’t just mean we do more rigorous evaluation, it also means we pay more attention to the facts. We are less dogmatic, more honest, more open to criticism, less defensive. We are more willing to change our theories when the data prove them wrong.

Ethically done, randomised experiments can change our world for the better. Time to toss a few more coins?


This entry was posted in Best From Elsewhere, Economics and public policy. Bookmark the permalink.

8 Responses to Andrew Leigh’s excellent speech launching Randomistas

  1. conrad says:

    The flipside of RCTs is that it is an extremely expensive way to collect data in many areas. Simple experiments where people participate in both conditions generally need far fewer participants than between groups designs and often can give data that is good enough for what you need. As noted, it is also very hard to get ethics for sham conditions — I doubt the meniscus study would ever get through an ethics board in Aus.

    This extra cost can be thought of as an opportunity cost in learning. For example, let’s say a RCT requires 10 times the number of participants as a simple experiment and I was interested in early childhood outcomes. With a RCT, the best I might have the money for is: “This program works better than that one”. With 10 experiments, I could actually look at different aspects of the programs and hence learn about what makes them good or bad.

  2. Bruce Bradbury says:

    The Journey to Social Inclusion example is illustrative of a key challenge facing randomised trials – indeed any form of evaluation. If such a program were able to reduce drug dependence or increase employment it would have very large net social and personal benefits. The flip side of this is that a program which increased the probability of such a successful outcome by only a small amount would be worth funding.

    However, to observe such small effects, a large study is required. With only 40 participants we can only test for large effects. We need to draw on other theory and evidence to make decisions when important but small effects are possible.

  3. Nicholas Gruen says:

    Thanks Bruce,

    I think 40 is enough to know a lot.

    The Sacred Heart Mission still spruik their program in lavish terms on their website.

    The J2SI pilot, which supported 40 people over three years, delivered impressive results. A study undertaken a year after service delivery came to an end, showed that 75% of participants remained in stable housing after four years, 80% had seen a decline in the need for health services and the pilot offered savings to government of up to $32,080 per participant.

    The Australian Government acknowledged the pilot with a National Homelessness Services Award for excellence and innovation in 2013. It has also received a Council to Homeless Persons award for excellence in ending homelessness for adults.

    When it says “the pilot offered savings to government of up to $32,080 per participant” that’s gross savings. The net present cost to government was $80,326 reported on p. 25 of this document!

    Here’s one conclusion from the study:

    To summarise, although some important benefits defy quantification, the CBA shows that the J2SI program generates some positive economic outcomes in the areas of health service use, as well as accommodation and support service use. However, it also shows that the short-term costs are higher than the short-term economic benefits.

    The researchers did conclude that there were net benefits to society by including the prospect of the program saving lives for which there was some evidence but it would have fallen well short of statistical significance. That’s fine with me. And even if it doesn’t pay dollar for dollar, I’m ok with my money being spent like this – though I’d like to know if focusing more on peer to peer support would have helped things – my guess is it would have done so substantially, but I’m biased.

  4. Stephen says:

    Andrew gave much the same speech at the launch I attended in Canberra of Randomistas – but added something absent from the published version, which is that RCTs are not appropriate for everything. He cited the famous (and somewhat tongue in cheek) article from the British Medical Journal which challenged people who want every intervention to be subject to a RCT to volunteer to participate in a randomised trial of whether parachutes improve the safety of people jumping out of planes.

    My own view is that the idea of an “evidence hierarchy” is misleading. Sometimes a meta study of numerous randomised control trials is suitable, sometimes not.

    What counts is to use the best available evidence for decision making. Where there is a large population and a lot at stake, a well designed RCT is often the best approach. In other circumstances, if observational evidence is sufficient to reach a sound conclusion, use that instead: it will probably be cheaper.

    Either way, what shouldn’t happen is companies or governments avoiding using evidence on the basis that it costs too much – there’s way too many examples of bad practices or policy inflicted on people because nobody bothered to collect evidence of whether or not they worked.

  5. paul frijters says:

    I agree with Hayek that there is a large degree to which the promise of randomisation is like a sermon: heard often, yielding little benefit that is visible. Theories of how the world works (ie, without a parachute you get pulled to earth so fast that you die upon impact) remain the goal, with experiments merely helping to refine those theories in case our theories are not good enough.

    Still, RCTs are not only elegant ways of wasting money and socially approved forms of magic and ritual to sanctify the theories we already believe in. Precisely because of their elite aura and magical elements they help convince people of ideas they would otherwise bitterly resist. Such as the idea that mental health suffering is ubiquitous in our population, can be caused by such unlikely culprits as air pollution and intestinal flora, and that it is surprisingly cheap and effective to do something about many of the most prevalent mental health problems. Many professionals simply would not believe that kind of message without solid proof (ie, modern magic). They would keep on believing that nothing would affect them without their knowledge, that they are not suffering from anything serious, and that those who do suffer are few, need pills or can’t be helped.

    The use of experimentation for small fry stuff in marketing is of course a distraction from a public good point of view. Fine, but irrelevant from a big picture point of view.

  6. derrida derider says:

    I’ve always been very comfortable with RCTs – yes, there should be more of them – but I’m very UNcomfortable with the way they are sold, and as someone who has enormous respect for Andrew it pains me to say so. I particularly do not like the “gold standard” “hierachy of evidence” framing – I’d far rather a “find the right tool for the job at hand” approach.

    Nic and Bruce have both adverted to two drawbacks – RCTs are not always ethically possible and they often have to be big, and hence slow and expensive, if we are to trust them.

    But there are two other problems I think are even bigger:

    – just like other social science methods, results can be influenced by the experimenters’ wishes, but unlike non-experimental methods end users will be less aware of this with RCTs. You have to be very careful of things like your randomisation protocols in program evaluation, for example, because the program deliverers have a way of frustrating them “to make sure resources go to those clients who’ll benefit most”. In drug tests double blind methods help get around this, but double blind is rarely possible in social science.

    – people like Angus Deaton and Dani Rodrik are right to point out the problems of EXTERNAL validity; the results of an RCT are often context-dependent and so cannot be generalised. Just because you’ve estimated the LATE well doesn’t mean you can estimate the population ATE well.

    But hell, I’m just a soon-to-be-superannuated public servant – what would I know? But then I’ll be free to put up long rants like this on my own blog or twitter :-)

  7. Epiphyte says:

    “What would happen if I click this link?” That’s what I wondered when I saw the link in your comment on Henry Farrell’s blog entry about the allocation of attention. I conducted the tiny experiment by clicking the link and… voila! Here I am.

    Everything boils down to how options are ranked/sorted/ordered. Each comment on Farrell’s blog entry is a different option. The options are sorted by date.

    On Reddit the comments are sorted by votes. The more votes a comment receives, the “better” it is, and the higher its placement on the page. The options are ranked by the Democratic Hand (DH). Higher ranked comments receive more attention than lower ranked comments. Therefore, the DH allocates people’s attention. This is also true for academic papers (citations = votes) and webpages (links = votes).

    At a dog show, the dogs are ranked by a committee. The committee defines “best”. A relatively small group of experts decides which dog to put on the pedestal. This is an example of the options being ranked by the Visible Hand (VH). The VH allocates people’s attention.

    The content on Netflix is ranked by a combination of the DH (thumbs up) and the VH. But its ultimately up to the VH to decide how to divide all the subscription money among all the content.

    The products at the grocery store are ranked by consumers spending their money. “Best” is defined by dollars. This is an example of options being ranked by the Invisible Hand (IH). The IH allocates people’s attention.

    The DH, VH and IH are all very different ways of ranking options. Since they are so very different, they must be very unequally effective. We really need science to test these different systems in order to determine which one is the most effective.

    My theory is that the IH is by far the best way to rank options. Given that the IH doesn’t currently rank scholarly papers… this would explain why scientists haven’t bothered to test the DH, VH and IH. There’s a giant disparity between what scientists are studying and what they should be studying. If this isn’t the case, then the credible feedback that markets provide on the usefulness of our behavior to others, really isn’t that necessary.

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail. You can also subscribe without commenting.