Evaluation is not a thing

An earlier version of this piece was published last week on the Mandarin.

Because the idea I have called “the Evaluator General” is several ideas knitted together to try to resolve a number of dilemmas, it comes with numerous implications that are often missed or misunderstood. So I’ve addressed them separately in specific articles. This article does the same, explaining that a central goal for me is for evaluation to become less of a ‘thing’ – separate from the activity it’s evaluating.

For better or worse, policymakers tend to come at evaluation from one of just two perspectives. First, for program managers it can give them an independent set of eyes to help assess how they’re going and how to improve. This can be particularly important in the public sector where objectives are multiple and won’t generally map onto any financial metric the way profit is the ultimate indicator of success in the private sector. Second, for those governing and funding programs, separate evaluation also meets accountability needs.

This and various other exigencies, such as people’s desire to build and participate in ‘professions’ has led to the growing institutionalisation and professionalisation of evaluation. There’s plenty to like about this. And as a new discipline and profession, evaluation is much fresher than mature disciplines whose intellectual foundations ossified years ago even when palpably unsatisfactory. This is true of my discipline economics,1
but of others too, their commanding heights confined to academia, an increasingly bureaucratised, fast foodified institution

The discipline of evaluation contains riches. But it is also a vast, loose network of approaches. Alas, in the push for more evaluation, it is being taken to be something far more settled and definitive than it is – as if getting something evaluated were like getting an auditor to check financial accounts or an engineer to check the structural integrity of a bridge. 

Indeed so called ‘goal free evaluation’ is an interesting and productive area of the discipline. There, the evaluator assesses the impact of the program without calibrating it against – or ideally even knowing – the program’s stated goals. This can improve program hygiene just as double blindness adds to the hygiene of a randomised controlled trial. It can also facilitate wider, and so potentially more powerful evaluative insights. These include unintended and/or negative consequences of a program, as well as its efficiency and effectiveness including system/network effects normally outside the program’s defined scope. (Nothing could demonstrate its value better than the central agencies obliviousness to its existence. It rarely dawns on the Great and the Good to forbear from exhaustively specifying the goals of the endeavours they fund).

Further, ‘evaluation’ didn’t play much of a role in the great technical achievements of humanity – the Apollo program or the development of the internet. And nor did ‘evaluation’ – conceived as formal and separate from delivering the goods – play much of a role in the delivery of AlphaZero’s technical wizardry in chess or the miracle of the Toyota Production System.

All those achievements required endless evaluative thinking.  But it took place as part of the process of doing the work, not as a ‘thing’ delivered from outside. But this isn’t how professions work. Professions sell services and so ‘evaluation’ is being brought into the production of government services as plumbing or landscaping would be delivered on a building site. That’s just one reason why it’s not working well and won’t if we continue to misunderstand it. 

The PC’s recent work on indigenous evaluation, argues that:

Evaluation is most effective when it is integrated into each stage of policy and program development, from setting policy objectives and collecting baseline data, through to using evaluation findings to inform future policy and program design.

But it’s hard to operationalise these requirements except by bringing evaluation into and alongside operations in an ongoing capacity. Evaluative thinking is of the essence in most of the improvement organisations manage. And it’s in short supply – thus for instance the New Zealand Government’s Wellbeing strategy is focused on measuring wellbeing without directly considering how they can improve it. I hope an evaluator draws their attention to that sometime. But they’d be much better reflecting on it now. 

Good program design should contain a great deal of evaluation. If a particular mechanism is important – that children with particular learning needs are best handled in some particular way – it can be tested before we commit to it. And then again and again after we have. This is one of the things that, sad to say, it took ‘nudge units’ to introduce into many government programs – but as consultants from the outside of programs. But evaluation and testing goes on all the time in a well-run organisation. It’s going on in Facebook and Google and Amazon and Toyota in numerous sites and programs as we speak.

Sometimes there’ll be a case for stepping back and so putting some space between operations and their evaluation. But that’s really quite rare in well-run organisations. In many, if not all of the numerous examples presented in boxes in the PC’s work on indigenous evaluation, evaluation answers questions that come up, and could easily be handled as the program went on.  

Be that as it may, this was one of the things I wanted to encourage with my proposal for an Evaluator General. Under the arrangements as I envisage them, those delivering services work away for their line agency alongside those with expertise in evaluation who report to the line agency but are formally under the direction of the Evaluator General. Together those whose job is to do, and those whose job is to know collaborate to understand and improve the program day in day out. 

In his best-seller The lean start-up Eric Rees writes about how start-ups should use their presence in the market to learn. Instead of making complex plans based on lots of assumptions, he recommends making:2

constant adjustments with a steering wheel called the Build-Measure-Learn feedback loop. Through this process of steering, we can learn when and if it’s time to make a sharp turn called a pivot or whether we should persevere along our current path.

Now re-read the earlier passage from the PC. I defy you to explain how what’s called for can be delivered if evaluation is separated from what its evaluating. That’s why in my model, the Evaluator General is responsible for monitoring and evaluation. It also creates a scaffolding in which the distinctions between different types of evaluation in the literature, for instance between a summative focus (focused on accountability for impact) and a ‘formative’ one (focused on program improvement) can often mutually reinforce one another rather than be formally separated. 

The Evaluator General’s officers are tasked with knowing and recording, and prompting the evaluative thinking which, while it should assist with meeting pre-set program goals, should also range more broadly around all the things the program is achieving and might be brought to achieve.

Thanks to Keryn Hassall and Alexandra Ellinson for helpful comments on earlier drafts.

  1. 1. As the philosopher Martha Nussbaum put it, “we have to grapple with the sad fact that contemporary economics has not yet put itself onto the map of conceptually respectable theories of human action. (Indeed, it has repudiated the rich foundations that the philosophical anthropology of Adam Smith offered it)”.[]
  2. 1. Eric Ries. The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses, Currency, p. 41.[]
This entry was posted in Cultural Critique, Economics and public policy, Information, Innovation. Bookmark the permalink.
Notify of

1 Comment
Newest Most Voted
Inline Feedbacks
View all comments
paul frijters
paul frijters
3 years ago

Yes, agreed with this. Evaluation should also be part of design and implementation as they are both very similar activities to ex-post judgment. Indeed, almost any form of looking ahead has an element of evaluating that imagined future.
On the use of language, you know how public servants try to use different words for things the rest of the population lumps together. So evaluation is ex-post, always of something that has been done. When one looks ahead and judges whether something is worth doing, it is called appraisal.