Originally posted on Eva Vivalt’s blog.
An excerpt from on-going work, working from AidGrade’s database of impact evaluation results in development economics.
These are results from caliper tests which essentially compare the number of results just above a critical threshold (t=1.96) with those just below a critical threshold. You can vary the width of the band; for example, a 5% caliper would look at the range 1.862 – 2.058. If you see a jump at 1.96, you might suspect specification searching is going on, in which researchers only report the results they like, biasing the results.
Over | Under | p-value | * | |
---|---|---|---|---|
All studies | ||||
2.5% Caliper | 45 | 26 | 0.02 | <0.05 |
5% Caliper | 73 | 51 | 0.03 | <0.05 |
10% Caliper | 127 | 117 | 0.28 | |
15% Caliper | 182 | 185 | 0.58 | |
20% Caliper | 220 | 231 | 0.71 | |
RCTs | ||||
2.5% Caliper | 24 | 14 | 0.07 | <0.10 |
5% Caliper | 35 | 28 | 0.22 | |
10% Caliper | 64 | 68 | 0.67 | |
15% Caliper | 97 | 107 | 0.78 | |
20% Caliper | 119 | 134 | 0.84 | |
Non-RCTs | ||||
2.5% Caliper | 21 | 12 | 0.08 | <0.10 |
5% Caliper | 38 | 23 | 0.04 | <0.05 |
10% Caliper | 63 | 49 | 0.11 | |
15% Caliper | 85 | 78 | 0.32 | |
20% Caliper | 101 | 97 | 0.42 |
Okay, there seems to be a jump. Possibly more among quasi-experimental studies than among RCTs.
Overall, though, this jump is actually quite small. Gerber and Malhotra did the same kinds of tests for political science and sociology. They used different selection criteria when gathering their papers, essentially maximizing the probability they would see a jump, but take a look at their numbers:
Political science:
Over | Under | * | |
---|---|---|---|
A. APSR | |||
Vol. 89-101 | |||
10% Caliper | 49 | 15 | <0.001 |
15% Caliper | 67 | 23 | <0.001 |
20% Caliper | 83 | 33 | <0.001 |
Vol. 96-101 | |||
10% Caliper | 36 | 11 | <0.001 |
15% Caliper | 46 | 17 | <0.001 |
20% Caliper | 55 | 21 | <0.001 |
Vol. 89-95 | |||
10% Caliper | 13 | 4 | 0.02 |
15% Caliper | 28 | 12 | 0.008 |
20% Caliper | 21 | 6 | 0.003 |
B. AJPS | |||
Vol. 39-51 | |||
10% Caliper | 90 | 38 | <0.001 |
15% Caliper | 128 | 66 | <0.001 |
20% Caliper | 165 | 95 | <0.001 |
Vol. 46-51 | |||
10% Caliper | 56 | 25 | <0.001 |
15% Caliper | 80 | 45 | 0.001 |
20% Caliper | 105 | 66 | 0.002 |
Vol. 39-45 | |||
10% Caliper | 34 | 13 | 0.002 |
15% Caliper | 48 | 21 | <0.001 |
20% Caliper | 60 | 29 | <0.001 |
Sociology:
Over | Under | * | |
---|---|---|---|
ASR (Vols. 68-70) | |||
5% Caliper | 15 | 4 | 0.01 |
10% Caliper | 26 | 15 | 0.06 |
15% Caliper | 47 | 17 | <0.001 |
20% Caliper | 54 | 19 | <0.001 |
ASJ (Vols. 109-111) | |||
5% Caliper | 16 | 4 | 0.006 |
10% Caliper | 25 | 11 | 0.01 |
15% Caliper | 41 | 14 | <0.001 |
20% Caliper | 48 | 18 | <0.001 |
TSQ (Vols. 44-46) | |||
5% Caliper | 13 | 4 | 0.02 |
10% Caliper | 22 | 7 | 0.004 |
15% Caliper | 26 | 11 | 0.01 |
20% Caliper | 30 | 20 | 0.1 |
Combined (recent vols.) | |||
5% Caliper | 44 | 12 | <0.001 |
10% Caliper | 73 | 33 | <0.001 |
15% Caliper | 114 | 42 | <0.001 |
20% Caliper | 132 | 57 | <0.001 |
ASR (Vols. 58-60) | |||
5% Caliper | 17 | 2 | <0.001 |
10% Caliper | 22 | 5 | <0.001 |
15% Caliper | 27 | 11 | 0.007 |
20% Caliper | 30 | 15 | 0.02 |
Wow! Economics is not doing so badly after all! (Some public health papers are also included, but results are comparable if you break it down.) To match Gerber and Malhotra, these are all reporting number of results rather than number of papers, and sometimes papers report more than one result, so there are some subtleties here that I get into in the longer working paper. Data are still being gathered, and there is much more to be said on this topic. If you’d like to see more of this kind of work on research credibility, please support us in the last few days of our Indiegogo campaign!