Hidden Rules of Development

Hidden Rules of Development

Home
Notes
Archive
About

What to Do When the Evidence Wasn’t Made for Your Country

A synthesis of the recent debate on rigorous evidence, and three questions for choosing an intervention that fits.

Stephen Brien's avatar
Stephen Brien
Jun 15, 2026
Cross-posted by Hidden Rules of Development
"My friend Stephen Brien has a great Substack page "Hidden Rules of Development" and wrote this clear non-technical note about the practicalities of external validity of evidence and making contextual decisions. "
- Lant Pritchett

I’ve been writing here mostly about political economy — the deeper structures that shape why governments behave as they do. This post turns to something more tangible: how a decision-maker should choose between specific policy interventions. The question has drawn several contributions lately, and rather than add another, I want to pull them together into a perspective on what a policy-maker can actually do in light of them.

Four recent contributions to the development economics debate have converged on the same problem from different directions.

On the Ideas in Development podcast, Michelle Rao reports on her research, which shows no overall relationship between what rigorous evaluations find and what governments subsequently spend. The evidence gets produced. It does not reach decisions.1

Rafe Meager, on another episode of the same podcast, argued that systematic reviews are answering the wrong question. Decision-makers do not need to know whether a programme works on average. They need to know whether their system can deliver it.2

Lant Pritchett’s research on research delivers the empirical verdict: in every published test, applying the international average of rigorous estimates to a country without its own trial produces worse predictions than correcting local data. The “rely on the rigorous evidence” standard approach loses every time.3

Research by Jason Coupet and colleagues in the public management literature found a further problem: the “credibility revolution” transforming economics has largely not reached the field producing governance recommendations. The evidence being pooled is mostly correlation research.4

These are distinct failures. Rao’s is about uptake: whether it reaches decisions at all. Pritchett’s and Meager’s is about method: whether the evidence that does reach decisions is being applied correctly. Coupet’s is about the underlying evidence: whether what is being pooled is rigorous enough to bear the claims made on its behalf.

Different as they are, these failures point in the same direction: evidence-based policy was built to answer whether something works in general, but the decisions it is supposed to inform require something more specific: whether it will work here and now.

What follows addresses the method problem. It is the problem a minister can directly contest: Rao’s failure lies upstream, in whether evidence reaches decisions at all; Coupet’s lies in the research infrastructure that generated the recommendation.

Addressing the method problem is within reach: it is not upstream of the recommendation or beneath it. It is the recommendation, and the published comparisons that give the minister grounds to push back are drawn from the same research. Three questions help a minister who wants to use evidence well identify when the standard recommendation method is likely to mislead — and ask for something better. For a genuinely novel intervention with no local implementation history, the international average remains the appropriate starting point; the argument below does not challenge that case.

Same evidence. Opposite recommendations.

When a randomised trial tests a programme, it generates two outputs. The first is an estimate of the true causal effect in the place where it was run. How much better did participants do because of the programme? The second comes from comparing that rigorous result against the local observational data from the same setting. How far did the local data overstate or understate the true effect? This second output is the bias estimate. Both come from the same study.

A systematic review is a structured synthesis of many such trials across different countries — in clinical medicine, mostly randomised; in governance, research suggests, often not — pooling their estimates of causal effect into an international average. This is what gets presented as the evidence base for a recommendation. But the bias estimates from those same studies are not used.

The alternative approach uses them differently. It uses each country’s local data as the starting point and adjusts it to account for the typical gap between observational data and the true effect observed in rigorous trials. What results is neither the raw local data nor the international average: local data with its known distortion removed.

Local observational studies typically control for measured socio-demographic factors (age, income, prior education levels). The gap between these adjusted estimates and the results of a randomised trial reflects selection on less visible characteristics, motivation, ambition, and the disposition to seek out a programme. These are what the bias estimate captures.

Those same three comparisons tested which output is a better guide to country-level outcomes. In all three (private schooling across 32 countries, migration wages from 42 countries of origin, and returns to microcredit), using the bias estimates to correct local data outperforms applying the international causal effect average. In every published comparison, bias correction wins; the standard approach of using just the impact estimates has never won.5

In practice, the recommendation ministers receive is based on the average causal effect. Those bias estimates from the same research perform better but are not used. This is not an analytical preference grounded in the evidence. Results are publicly available. It is an institutional one. Recommendation-making bodies were designed around the causal effect average, and the methodology has not changed in response to the published comparisons.



Why the range of effects matters more than the average

The reason the alternative approach outperforms the standard one is visible in how the effects of these programmes vary across countries. Private school learning premiums are near zero where public schools function reasonably well, and large where they are chronically weak. The same intervention produces starkly different results depending on where it is applied.

Standard evidence review treats this kind of variation as an analytic problem. If the same programme produces markedly different results across places, the pooled average becomes unreliable.

For a minister, the variation is exactly the point. It tells you that the effect depends on local conditions.

Effects travel when delivery conditions travel. The strongest predictor of whether a result from one setting replicates in another is not the size of the original estimate but whether the programme can be run as it was designed — the replication question Meager puts at the centre: not the size of the effect found elsewhere, but whether the conditions that produced it can be reproduced here.

An international average collapses this range into a single number by combining countries at both ends, near-zero cases and large-effect cases. That result is a middle estimate that is wrong in opposite directions for countries at either extreme. A minister who knows whether local public schools are functioning or failing knows which end of that range their country sits at. The international average discards precisely that knowledge.

The minister who has spent years working within their country’s schools, labour markets, or agricultural system carries structural knowledge no international average can replicate. It is this knowledge (which direction the selection bias runs, whether public services are chronically weak or functioning reasonably, where this country’s conditions sit relative to comparable countries) that allows local data to be read correctly.

The local observational data and the minister’s structural understanding of the setting are not in competition: together, they are the basis for a corrected estimate that neither could produce alone. The three questions at the end of this post are how a minister puts that combination to work.

Why corrected local data wins

International evidence is asked to do two things. It does one of them reliably.

The first is what the standard approach asks of it: estimate the true causal effect of an intervention for a country with local observational data but no controlled test. The previous section established why this fails. The effect depends on local conditions that vary across countries, and the international average cannot tell a minister which end of that range their country occupies.

The second task is one the same evidence base can do reliably. Estimate the bias in local observational data. Selection bias means that programme participants are not a random sample. On average, those who choose to enrol, migrate, or send their children to a private school are more motivated and better-positioned than those who do not.

Dominican officials, for instance, know from local observation that their migrants are positively selected. The capable and ambitious leave first. The same pattern operates across most countries, because it reflects a consistent feature of how people make choices rather than anything specific to any single country.

The direction of the distortion is therefore similar across contexts. Local observational data overstates the programme’s benefit because the people who chose it would have done better anyway. International evidence pooled across many settings provides a reliable estimate of the typical gap between what local data show and the true effect.

Correcting local data using the international bias estimate exploits what international evidence reliably provides. Replacing it with the international causal average asks it to do what it cannot. This asymmetry is not an abstract claim. The three published comparisons provide direct evidence for it. If selection bias were no more consistent across contexts than true causal effects are, bias-correction would not outperform the average. That the bias-correction approach has outperformed in every published comparison is itself evidence that bias is more consistent across contexts than the true effect is.

The standard approach rests on the assumption that the international causal average is a reliable guide regardless of local conditions. The recommendation-making bodies that institutionalised this approach did not test it against a local-data alternative before adopting it, and have not done so since. Those comparisons were conducted by researchers outside those bodies. That distinction is what makes it possible to challenge the recommendation without disputing the evidence.

The accountability is asymmetric

A minister who accepts a systematic review recommendation accepts more than its conclusion. Every such recommendation is the result of a methodological choice: whether to apply the international causal effect average or to adjust local data using international bias estimates. That choice belongs to the technical team. Accountability for the decision that follows rests with the minister. One party chooses the method; another bears the consequences — and that is before one reaches Rao’s harder question about whether the evidence was sought at all.

The minister may hold local knowledge that points in a different direction from the recommendation. That knowledge is precisely what matters for getting the estimate right: who participates in these programmes, whether local conditions resemble those in the studies, and where this country sits relative to the international average. But it cannot enter the process.

Local observational data is structurally excluded from systematic reviews. Specifically, the inclusion criteria require randomised trials or comparable rigorous designs, which observational data cannot meet regardless of its quality. A minister who raises local observational data in response to a recommendation will be told, correctly within those rules, that it does not meet the inclusion standard. The recourse offered is to commission a randomised trial, which takes years and does not resolve the decision at hand.

This is not a dispute between evidence and intuition. It is a dispute between two different uses of evidence: one built into the process and one not. The methodology was designed before the published comparisons showed that bias-corrected local data outperform the international average. It has not been revised since.

A recent audit across leading public management and administration journals found that nearly all rely on designs that cannot establish causation — a share that has not improved in a decade. In the governance domain, the minister often receives a synthesis of correlational research, pooled using the wrong method.[4]

The minister who understands what the recommendation actually is (a methodological choice, not simply the evidence) has grounds to ask for a different application of the same research.

What to ask when the recommendation arrives

When a systematic review recommendation arrives, three questions establish whether to press for the alternative.

First: does the true effect of this intervention depend on local conditions that vary substantially across countries? If the effect is sensitive to the quality of public services, the structure of local labour markets, or institutional capacity, the international average is unlikely to be the right guide. What matters is the full range of effects across countries and the conditions that predict where this country sits within it.

Recent work on evidence aggregation has reframed the diagnostic: not ‘does this programme work?’ but ‘can my system actually deliver this?’ Implementation fidelity — whether the programme will be run as designed, by staff with the capacity assumed in the trials — is among the strongest predictors of whether an effect from elsewhere materialises locally.[2]

Second: does the country have its own evidence on this question, even if imperfect? If yes, that is the appropriate starting point. Where no local data exists, the international average is the appropriate default. Local data reflects local conditions. Its imperfection is a bias, usually directional and correctable, not a reason to discard the signal. In every published comparison, correcting local data with the international bias estimate outperforms replacing it with the international average.

Third: has anyone estimated how far local data overstates or understates the true effect, and in which direction? If not, that is what the minister should commission, not another systematic review, but a bias-characterisation study, an estimate of the numerical gap between what local observational data shows and what the true causal effect is. This is distinct from a needs assessment or country diagnostic. A single such study improves the reliability of every existing local estimate in the same sector.

None of these questions disputes the evidence. All three are about which part of it gets used.

The standard recommendation and the better alternative draw on the same rigorous research. What separates them is which output of that research gets applied: the causal effect estimates, aggregated into an international average, or the bias estimates, used to correct local observations.

The international average approach was designed in fields where uniformity of response is a reasonable assumption, such as clinical medicine, where a drug’s mechanism does not depend on whether a country’s public institutions are functioning. In those fields, pooling estimates across contexts to produce a reliable average is defensible; the effect is largely the same wherever the treatment is applied. Development economics is not such a field.

The same programme produces starkly different results depending on local institutional conditions, as the variation in effects across countries makes plain. Importing the methodology without adjusting for this distinction is why the standard approach has not won a single published comparison.

The minister who presses for the alternative is not rejecting rigorous evidence. They are asking for a method suited to the actual question, in a world where the same programme produces very different results depending on where it is applied. That leaves Rao’s harder problem untouched, and Coupet’s prior one. But it is the problem the minister has leverage over.


1

Michelle Rao, “The Evidence on Evidence,” Ideas in Development, hosted by Oliver Hanney, May 26, 2026. Underlying paper: Rao, “Program Evaluations and Policy Spending: Evidence from Conditional Cash Transfers in Latin America” (working paper) — 128 evaluations across 17 Latin American and Caribbean countries, 2000–2015.

2

Rafe Meager, “Aggregating Evidence,” Ideas in Development, hosted by Oliver Hanney (VoxDev), June 2, 2026.

3

Lant Pritchett, “The Incredible Credulity Revolution: The Incoherence of External Validity (concrete example),” Substack, May 25, 2026.

4

Coupet, Diebold, Greathouse, Oxley, Siciliano, and Benitez (2026), “Did the Credibility Revolution Skip Public Management?” — across 3,227 papers in twelve leading public administration and management journals (2015–2024), 87% relied on naive statistical designs. Excluding the Journal of Policy Analysis and Management, which attracts substantial economics research, only 7.4% of papers used designs capable of supporting causal claims.

5

Published comparisons: Pritchett (2024), Review of Development Economics, 28(4), 2034–2058; Pritchett and Sandefur (2015), AER: Papers & Proceedings, 105(5), 471–475.

No posts

© 2026 Stephen Brien · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture