Do randomized controlled trials engage in less specification searching?

Originally posted on Eva Vivalt’s blog.

An excerpt from on-going work, working from AidGrade’s database of impact evaluation results in development economics.

These are results from caliper tests which essentially compare the number of results just above a critical threshold (t=1.96) with those just below a critical threshold. You can vary the width of the band; for example, a 5% caliper would look at the range 1.862 – 2.058. If you see a jump at 1.96, you might suspect specification searching is going on, in which researchers only report the results they like, biasing the results.

	Over	Under	p-value	*
All studies
2.5% Caliper	45	26	0.02	<0.05
5% Caliper	73	51	0.03	<0.05
10% Caliper	127	117	0.28
15% Caliper	182	185	0.58
20% Caliper	220	231	0.71
RCTs
2.5% Caliper	24	14	0.07	<0.10
5% Caliper	35	28	0.22
10% Caliper	64	68	0.67
15% Caliper	97	107	0.78
20% Caliper	119	134	0.84
Non-RCTs
2.5% Caliper	21	12	0.08	<0.10
5% Caliper	38	23	0.04	<0.05
10% Caliper	63	49	0.11
15% Caliper	85	78	0.32
20% Caliper	101	97	0.42

Okay, there seems to be a jump. Possibly more among quasi-experimental studies than among RCTs.

Overall, though, this jump is actually quite small. Gerber and Malhotra did the same kinds of tests for political science and sociology. They used different selection criteria when gathering their papers, essentially maximizing the probability they would see a jump, but take a look at their numbers:

Political science:

	Over	Under	*
A. APSR
Vol. 89-101
10% Caliper	49	15	<0.001
15% Caliper	67	23	<0.001
20% Caliper	83	33	<0.001
Vol. 96-101
10% Caliper	36	11	<0.001
15% Caliper	46	17	<0.001
20% Caliper	55	21	<0.001
Vol. 89-95
10% Caliper	13	4	0.02
15% Caliper	28	12	0.008
20% Caliper	21	6	0.003
B. AJPS
Vol. 39-51
10% Caliper	90	38	<0.001
15% Caliper	128	66	<0.001
20% Caliper	165	95	<0.001
Vol. 46-51
10% Caliper	56	25	<0.001
15% Caliper	80	45	0.001
20% Caliper	105	66	0.002
Vol. 39-45
10% Caliper	34	13	0.002
15% Caliper	48	21	<0.001
20% Caliper	60	29	<0.001

Sociology:

	Over	Under	*
ASR (Vols. 68-70)
5% Caliper	15	4	0.01
10% Caliper	26	15	0.06
15% Caliper	47	17	<0.001
20% Caliper	54	19	<0.001
ASJ (Vols. 109-111)
5% Caliper	16	4	0.006
10% Caliper	25	11	0.01
15% Caliper	41	14	<0.001
20% Caliper	48	18	<0.001
TSQ (Vols. 44-46)
5% Caliper	13	4	0.02
10% Caliper	22	7	0.004
15% Caliper	26	11	0.01
20% Caliper	30	20	0.1
Combined (recent vols.)
5% Caliper	44	12	<0.001
10% Caliper	73	33	<0.001
15% Caliper	114	42	<0.001
20% Caliper	132	57	<0.001
ASR (Vols. 58-60)
5% Caliper	17	2	<0.001
10% Caliper	22	5	<0.001
15% Caliper	27	11	0.007
20% Caliper	30	15	0.02

Wow! Economics is not doing so badly after all! (Some public health papers are also included, but results are comparable if you break it down.) To match Gerber and Malhotra, these are all reporting number of results rather than number of papers, and sometimes papers report more than one result, so there are some subtleties here that I get into in the longer working paper. Data are still being gathered, and there is much more to be said on this topic. If you’d like to see more of this kind of work on research credibility, please support us in the last few days of our Indiegogo campaign!