Every impact evaluation is conducted in a very specific environment and results cannot generalize to different contexts. In particular, the effects of an intervention might depend on factors such as the pre-existing cultural/political/institutional setting. How do you deal with this?

This is known in the literature as the “external validity” problem and it provides a great motivation to pursue and bring together a lot of evaluations on a topic in order to look at the range of effects that a particular kind of intervention has had in different contexts. We do know that some interventions, such as insecticide-treated bed-nets, do tend to improve well-being on average. The more impact evaluations we see in different settings, the better. Ultimately, this is an argument for more impact evaluations rather than fewer.

Whether the results of an impact evaluation are widely applicable is ultimately an empirical question that we will be uniquely situated to answer. Without the data we are collecting, we wouldn’t know how to generalize from one setting to another. Again, this is a strength of our process rather than a weakness. Preliminary results suggest that while outcomes are context-dependent, even limiting attention to those results obtained in the same country does not improve predictive power.


Programs are not randomly selected for evaluation – how can one be sure, then, that one is getting a good sense of the effectiveness of a particular program?

In the ideal world, programs would be randomly selected for evaluation. Unfortunately, this is not the ideal world, and we are left to guess that usually only the best programs are evaluated within any given intervention.

If this problem were the same across all interventions, there would be no concern that this was biasing results. Then the problem would simply be that the average of the estimated effects, across all interventions, would be higher than the interventions’ typical effects. If we took a conservative enough approach and down-weighted each estimate, we could mitigate this problem.

The bigger problem would be if, in some fields, the most effective programs were chosen to be evaluated, while in other fields there was no such selection bias. For now, though, an even bigger concern is that poorly-performing organizations can choose not to have their work be evaluated at all and are not socially sanctioned for this. While AidGrade cannot yet solve the problem of organizations and researchers selecting which programs to have evaluated, we can encourage more organizations to conduct some sort of external evaluation, attacking the larger, first-order problem.


RCTs and other impact evaluations do not evaluate all the effects that an intervention may have. Receiving aid can sometimes disempower recipients; it can draw the more higher-skilled into aid work or into professions serving aid workers; it could result in Dutch disease or have broad macroeconomic effects. Impact evaluations are typically designed to capture a narrow set of potential good effects, with less regard for potentially negative or broader effects. Won’t your focus on impact evaluations mean missing these effects?

We do want an impact evaluation to pick up all the effects an intervention has, not just narrowly defined ones, in the same way that we would want to know all the side effects and potentially dangerous interactions a medical drug could have. To some extent, researchers can be encouraged to include more areas in which there could be spillovers in their evaluations. Apart from this, given suspicion that these broader effects exist, our response is to be very conservative with our evaluations and to only support those projects that seem to have such beneficial effects that it seems unlikely that potential downsides could outweigh them. As in the case of a drug, if an intervention has a common side effect we expect that it will come out in time. Overall, despite the complexities, providing more information can only help the discussion.


Why evaluate by outcome?

Evaluating interventions by outcome compares apples to apples. One needs to make more assumptions about what improves people’s well-being in order to know whether an intervention that increases life expectancy by X is “better” or “worse” than an intervention that increases cognitive skills by Y. Work is underway to estimate models of well-being, but in the interim this is the best we can do without being dishonest.


By focusing on projects evaluated by impact evaluations, aren’t you effectively shutting out smaller NGOs that may have more worthwhile projects but fewer resources?

Not all projects are hard to evaluate; a lot depends on the particular project. Some programs are already set up in such a way that evaluation is easy, such as with some educational programs which can be evaluated using state test scores.[1] Further, it should be noted that impact evaluations have traditionally been carried out by academics, and academics do not have the right incentives to produce cost-effective evaluations. We see plenty of scope for lower-cost evaluations.

NGO staff time may be constrained, but there are outside foundations that sometimes provide funds for evaluation that can help. Thus, while smaller organizations may sometimes have a harder time conducting an evaluation, they are not completely shut out.

AidGrade is also working to lessen this inequality. We solicit donations to fund impact evaluations for small organizations, providing funding for both the program to be evaluated as well as any additional resources required for the evaluation such as data collection. We offer donors detailed results in exchange for their donation.


By focusing on projects evaluated by impact evaluations aren’t you effectively shutting out projects that cannot be evaluated by impact evaluation?

This is an important issue. If carefully designed, randomized controlled trials (RCTs) and other impact evaluations can capture many spillover effects, but the wider and more disperse the array of effects the less likely the impact evaluation is to capture them. RCTs and other impact evaluations are also inappropriate for some projects for ethical reasons, such as if a drug were available to treat a disease and there were no resource constraints. Finally, impact evaluations do cost some money. While we noted earlier that they are not necessarily that expensive, costs matter, and it is not the case that spending any amount of money would be worthwhile in order to document effectiveness. At a certain point, it no longer becomes worth it to evaluate a program. If one’s mental model is that most small charities are slightly helpful, one may not find reason to embark on an expensive evaluation. However, we believe that, overall, AidGrade’s emphasis on more evaluation is helpful, as resources are presently too often allocated to less effective projects.

It is also possible to evaluate many outcomes that people often do not immediately think are quantifiable. For example, there are evaluations that focus on whether a program empowered its beneficiaries.


What if a program or organization targets a population that is particularly difficult to treat? These organizations should not be penalized for taking on a harder task!

We are aware of this issue and would make the following argument. An organization typically faces smaller gains from treatment for one of two reasons: either the population is doing well enough that further improvements are more difficult than they would be otherwise, or the population is doing particularly poorly, so that there are many obstacles that have a negative effect on treatment. If only small gains are possible because a population is already doing very well, it is acceptable to penalize it because we believe that aid should target those who are not as well-off. If only small gains are possible because the population is doing poorly across many dimensions and faces obstacles, the fact that little progress can be made in this population should still matter and be a warning sign. For example, suppose a study found that providing textbooks alone does not improve curricular achievement when teachers are bad. If this is generally true, it makes less sense to fund programs that only provide textbooks with no other support because the returns will be negligible.

Some unfairness may remain, but we believe this is the best and most objective way of proceeding forward. For the sake of efficiency, we do want to encourage organizations to work in areas in which they can see the highest returns.


By opening up development data in this way, aren’t you running the risk that someone will use it to get any results they want by manipulating the filters they select?

This is definitely a concern, and we warn against it. However, we think the benefits of putting all the data out there outweigh the drawbacks; by making it open and encouraging discussion, we believe the more plausible filters will prevail.


What do you think of quasi-experimental methods?

They are great and certainly can be very useful depending on the context. While we refer mostly to RCTs throughout this text, we welcome all well-designed studies.

[1] For example, Bettinger and Baker evaluated InsideTrack college coaching in an evaluation involving 13,000 students at eight colleges for less than $20,000, and Roland Fryer studied a teacher incentive program covering 396 public schools for $50,000. Source: Robert Slavin, 2011. “Gold-standard program evaluations, on a shoestring budget”, Education Week. http://blogs.edweek.org/edweek/sputnik/2011/10/gold-standard_program_evaluations_on_a_shoestring_budget.html , accessed Dec. 25, 2011.