Crowdsourcing the credibility of empirical research.

Scientists can only rely on an empirical finding if it is credible. In science, a credible finding is one that has (at minimum) survived scrutiny along 3 dimensions: (1) method/data transparency, (2) analytic reproducibility/robustness, and (3) effect replicability. Curate Science is a platform to crowdsource the credibility of empirical research by curating its transparency, reproducibility/robustness, and replicability.

UPDATE (April 19, 2018): New unified curation framework released (version 5.2.0) and important 2-year grant secured to scale up the platform (see here for details).

Curated List of Large-Scale Replication Efforts

Searchable table of N=1,058 replications of 168 effects from the cognitive and social psychology published literature.

Examples: "RPP" for Reproducibility Project: Psychology replications; "ML1" or "ML3" for Many Labs 1 or 3 replications; "RRR" for Registered Replication Reports; "SP: Spec" for Social Psychology's Special Issue replications. For topical searches, try "priming", "anchoring", "gambler's fallacy", "love", "moral" (for morality), or "power posing."

Icon legend: = data; = study materials; = pre-registered protocol; = link to a replication's associated evidence collection. To sort replications, click on column headers. Reveal overflow text (...) by hovering over cell. For details about replication outcome values hover over cell and see about section. For additional replication study characteristics (& to see hidden imprecise large-scale-effort replications, see our public gSheet (see also our GitHub repo ).

Evidence collections (or "replication collections") group together replications of specific effects/hypotheses, organizing them by different ways of operationalizing/generalizing an effect. This structure allows enhanced visualization of replication effect sizes via familiar forest plots and depicts meta-analytic effect size estimates of replications for each operationalization of an effect (assuming >1 replication is available). Evidence collections also allow the curation of replications of sets of effects as predicted by a broader theory (e.g., see ego depletion theory evidence collection).

Below are example evidence collections on the following effects/phenomena: cleanliness priming, money priming, Macbeth effect, warmth embodiment, ego depletion, mood boosts helping, verbal overshadowing.

Cleanliness priming -- Replications (7)  

Original Studies & Replications N Effect size (d) [95% CI]
Operationalization #1: Cleanliness priming (scrambled task) reduces severity of moral judgments
Schnall et al. (2008a) Study 1 40
Arbesfeld et al. (2014) 60
Besman et al. (2013) 60
Huang (2014) Study 1 189
Lee et al. (2013) 90
Johnson et al. (2014a) Study 1 208
Johnson et al. (2014b) 736
Current meta-analytic estimate of operationalization #1 replications:
Generalization #1: Cleanliness priming (hand washing) reduces severity of moral judgments
Schnall et al. (2008a) Study 2 43
Johnson et al. (2014a) Study 2 126
[Underlying data (CSV)] [R-code]

Summary : The main finding that cleanliness priming reduces the severity of moral judgments currently exhibits replication difficulties (overall meta-analytic effect: r = -.08 [+/-.13]). In a follow-up commentary, Schnall argued that a ceiling effect in Johnson et al.'s (2014a) studies render their results uninterpretable and hence their replication results should be dismissed. However, independent re-analyses by Simonsohn, Yarkoni, Schönbrodt, Inbar, Fraley, and Simkovic appear to rule out such ceiling effect explanation, hence, Johnson et al.'s (2014a) results should be retained in gauging the replicability of the original cleanliness priming effect. Of course, it's possible "cleanliness priming" may be replicable under different operationalizations, conditions, and/or experimental designs (e.g., within-subjects).

Money priming -- Replications (42)  

Original Studies & Replications N Effect size (d) [95% CI]
Operationalization #1: Exposure to money (scrambled sentence) reduces helping (coding sheets for RA)
Vohs et al. (2006) Study 3 39
Grenier et al. (2012) 40
Generalization #1: Exposure to money (scrambled sentence) strengthen social inequality beliefs (just world beliefs)
Caruso et al. (2013) Study 2 168
Schuler & Wänke (in press) Study 2 115
Rohrer et al. (2015) Study 2 420
Current meta-analytic estimate of generalization #1 replications:
Generalization #2: Exposure to money (scrambled sentence) strengthen social inequality beliefs (social dominance orientation)
Caruso et al. (2013) Study 3 80
Rohrer et al. (2015) Study 3 156
Generalization #3: Exposure to money (instruction background) strengthen social inequality beliefs (fair market ideology)
Caruso et al. (2013) Study 4 48
Rohrer et al. (2015) Study 4 116
Generalization #4: Exposure to money (instruction background) strengthen social inequality beliefs (system justification scale)
Caruso et al. (2013) Study 1 30
Hunt & Krueger (2014) 87
Cheong (2014) 102
Devos (2014) 162
Swol (2014) 96
John & Skorinko (2014) 87
Davis & Hicks (2014) Study 1 187
Kappes (2014) 277
Klein et al. (2014) 127
Packard (2014) 112
Vranka (2014) 84
Cemalcilar (2014) 113
Bocian & Frankowska (2014) Study 2 169
Huntsinger & Mallett (2014) 146
Rohrer et al. (2015) Study 1 136
Schmidt & Nosek (2014, MTURK) 1000
Hovermale & Joy-Gaba (2014) 108
Vianello & Galliani (2014) 144
Schmidt & Nosek (2014, PI) 1329
Bernstein (2014) 84
Adams & Nelson (2014) 95
Rutchick (2014) 96
Vaughn (2014) 90
Levitan (2014) 123
Brumbaugh & Storbeck (2014) Study 1 103
Smith (2014) 107
Kurtz (2014) 174
Brumbaugh & Storbeck (2014) Study 2 86
Wichman (2014) 103
Pilati (2014) 120
Davis & Hicks (2014) Study 2 225
Furrow & Thompson (2014) 85
Bocian & Frankowska (2014) Study 1 79
Brandt et al. (2014) 80
Nier (2014) 95
Woodzicka (2014) 90
Schmidt & Nosek (2014) 81
Morris (2014) 98
Current meta-analytic estimate of generalization #4 replications:
[Underlying data (CSV)] [R-code]

Summary: The claim that incidental exposure to money influences social behavior/beliefs currently exhibits replication difficulties (overall meta-analytic effect: d = -.01 [+/-.05]). This appears to be the case whether money exposure is manipulated via instruction background images (Caruso et al., 2013, Study 1 & 4) or descrambling sentence task (Vohs et al., 2006, Study 3) and whether outcome variable is helping others (Vohs et al., 2006, Study 3), system justification beliefs (Caruso et al., 2013, Study 1), just world beliefs (Caruso et al., 2013, Study 2), social dominance beliefs (Caruso et al., 2013, Study 3), or fair market beliefs (Caruso et al., 2013, Study 4). Of course, it's possible money exposure reliably influences behavior under other (currently unknown) conditions, via other operationalizations, and/or using other experimental designs (e.g., within-subjects).

Macbeth effect -- Replications (11)  

Original Studies & Replications N Effect size (r) [95% CI]
Operationalization #1: Moral purity threat (transcribe text) boosts need to cleanse oneself (cleaning products desirability)
Zhong & Liljenquist (2006) Study 2 27
Earp et al. (2014) Study 3 286
Siev (2012) Study 2 148
Earp et al. (2014) Study 2 156
Siev (2012) Study 1 335
Earp et al. (2014) Study 1 153
Gamez et al. (2011) Study 2 36
Current meta-analytic estimate of operationalization #1 replications:
Generalization #1: Moral purity threat (recall [un]ethical act) boosts need to cleanse oneself (product choice)
Zhong & Liljenquist (2006) Study 3 32
Fayard et al. (2009) Study 1 210
Gamez et al. (2011) Study 3 45
Current meta-analytic estimate of generalization #1 replications:
Generalization #2: Physical cleansing (antiseptic wipe) reduced volunteerism (helping RA)
Zhong & Liljenquist (2006) Study 4 45
Fayard et al. (2009) Study 2 115
Gamez et al. (2011) Study 4 28
Reuven et al. (2013) 29
Current meta-analytic estimate of generalization #2 replications:
[Underlying data (CSV)] [R-code]

Summary: The claim that a threat to one's moral purity induces the need to cleanse oneself (the "Macbeth effect") currently exhibits replication difficulties (overall meta-analytic effect: r = -.02 [+/-.05]). This appears to be the case whether moral purity threat is manipulated via transcribing text describing an unethical vs. ethical act (Study 2) or by recalling an unethical vs. ethical deed (Studies 3) and whether need to cleanse onself is measured via desirability of cleansing products (Study 2), product choice (Study 3), or reduced volunteerism after cleansing (Study 4). Of course, it is possible the "Macbeth effect" is replicable under different operationalizations and/or experimental designs (e.g., within-subjects).

Physical warmth embodiment -- Replications (14)  

Original Studies & Replications N Effect size (r) [95% CI]
Operationalization #1: Trait loneliness (UCLA loneliness scale) positively associated with warmer bathing
Bargh & Shalev (2012) Study 1a 51
Bargh & Shalev (2012) Study 1b 41
Donnellan et al. (2015a) Study 9 197
Donnellan et al. (2015a) Study 4 228
Donnellan et al. (2015a) Study 1 235
Donnellan et al.(2015b) 291
Ferrell et al. (2013) 365
McDonald & Donnellan (2015) 356
Donnellan et al. (2015a) Study 2 480
Donnellan et al. (2015a) Study 8 365
Donnellan et al. (2015a) Study 7 311
Donnellan & Lucas (2014) 531
Donnellan et al. (2015a) Study 6 553
Donnellan et al. (2015a) Study 5 494
Donnellan et al. (2015a) Study 3 210
Current meta-analytic estimate of operationalization #1 replications:
Generalization #1: Physical coldness (frozen cold-pack) boosts reported feelings of chronic loneliness
Bargh & Shalev (2012) Study 2 75
Wortman et al. (2014) 260
[Underlying data (CSV)] [R-code]

Summary: The claim that physical warmth influences psychological social warmth currently exhibits replication difficulties (overall meta-analytic effect: r = .007 [+/-.035])), at least via Bargh and Shalev's (2012) Study 1 and 2 operational tests (Study 1: trait loneliness is positively associated with warmer bathing; Study 2: briefly holding a frozen cold-pack boosts reported feelings of chronic loneliness). Regarding first operational test, the loneliness-shower effect doesn't appear replicable whether (1) trait loneliness is measured using the complete 20-item UCLA Loneliness Scale (Donnellan et al., 2015 Studies 1-4) or a 10-item modified version of the UCLA Loneliness Scale (Donnellan et al., 2015 Studies 5-9, as in Bargh & Shalev, 2012 Studies 1a and 1b), (2) whether warm bathing is measured via a "physical warmth index" (all replications as in Bargh & Shalev, 2012 Study 1a and 1b) or via the arguably more hypothesis-relevant water temperature item (all replications of Bargh & Shalev Study 1), and (3) whether participants were sampled from Michigan (Donnellan et al., 2015 Studies 1-9), Texas (Ferrell et al., 2013), or Israel (McDonald & Donnellan, 2015). Of course, different operationalizations of the idea may yield replicable evidence, e.g., in different domains, contexts, or using other experimental designs (e.g., within-subjects).

Ego depletion theory -- Replications (32)  
Muraven, Tice, & Baumeister (1998) 
Self-control as limited resource: Regulatory depletion patterns
Baumeister, Bratslavsky, Muraven, & Tice (1998) 
Ego depletion: Is the active self a limited resource?

Original Studies & Replications N Effect size (d) [95% CI]
Prediction 1: Glucose consumption counteracts ego depletion
Gaillot, Baumeister et al. (2007) Study 7 61
Cesario & Corker (2010) 119
Wang & Dvorak (2010) 61
Lange & Eggert (2014) Study 1 70
Current meta-analytic estimate of Prediction 1 replications:
Prediction 2: Self-control impairs further self-control (ego depletion)
Muraven, Tice et al. (1998) Study 2 34
Murtagh & Todd (2004) Study 2 51
Schmeichel, Vohs et al. (2003) Study 1 24
Pond et al. (2011) Study 3 128
Schmeichel (2007) Study 1 79
Healy et al. (2011) Study 1 38
Carter & McCullough (2013) 138
Lurquin et al. (2016) 200
Inzlicht & Gutsell (2007) 33
Wang, Yang, & Wang (2014) 31
Sripada, Kessler, & Jonides (2014) 47
Ringos & Carlucci (2016) 68
Wolff, Muzzi & Brand (2016) 87
Calvillo & Mills (2016) 75
Crowell, Finley et al. (2016) 73
Lynch, vanDellen et al. (2016) 79
Birt & Muise (2016) 59
Yusainy, Wimbarti et al. (2016) 156
Lau & Brewer (2016) 99
Ullrich, Primoceri et al. (2016) 103
Elson (2016) 90
Cheung, Kroese et al. (2016) 181
Hagger & Chatzisarantis (2016) 101
Schlinkert, Schrama et al. (2016) 79
Philipp & Cannon (2016) 75
Carruth & Miyake (2016) 126
Brandt (2016) 102
Stamos, Bruyneel et al. (2016) 93
Rentzsch, Nalis et al. (2016) 103
Francis & Inzlicht (2016) 50
Lange, Heise et al. (2016) 106
Evans, Fay, & Mosser (2016) 89
Tinghög & Koppel (2016) 82
Otgaar, Martijn et al. (2016) 69
Muller, Zerhouni et al. (2016) 78
Current meta-analytic estimate of Prediction 2 replications:
[Underlying data (CSV)] [R-code]
Original Studies & Replications Independent Variables Dependent Variables Design Differences Active Sample Evidence
Prediction 1: Glucose consumption counteracts ego depletion
Gaillot, Baumeister et al. (2007) Study 7 sugar vs. splenda
video attention task vs. control
Stroop performance -
Cesario & Corker (2010) sugar vs. splenda
video attention task vs. control
Stroop performance No manipulation check Positive correlation between baseline & post-manipulation error rates, r = .36, p < .001
Wang & Dvorak (2010) sugar vs. splenda
future-discounting t1 vs. t2
future-discounting task -
Lange & Eggert (2014) Study 1 sugar vs. splenda
future-discounting t1 vs. t2
future-discounting task different choices in future-discounting task test-retest reliability of r = .80 across t1 and t2 scores
Prediction 2: Self-control impairs further self-control (ego depletion)
Muraven, Tice et al. (1998) Study 2 thought suppression vs. control anagram performance -
Murtagh & Todd (2004) Study 2 thought suppression vs. control anagram performance very difficult solvable anagrams used rather than "unsolvable"
Schmeichel, Vohs et al. (2003) Study 1 video attention task vs. control GRE standardized test -
Pond et al. (2011) Study 3 video attention task vs. control GRE standardized test 10 verbal GRE items used (instead of 13 analytic GRE items)
Schmeichel (2007) Study 1 video attention task vs. control working memory (OSPAN) -
Healy et al. (2011) Study 1 video attention task vs. control working memory (OSPAN) % of target words recalled (rather than total)
Carter & McCullough (2013) video attention task vs. control working memory (OSPAN) Effortful essay task vs. control in between IV and DV (perfectly confounded w/ IV)
Lurquin et al. (2016) video attention task vs. control working memory (OSPAN) 40 target words in OSPAN (rather than 48) Main effect of OSPAN set sizes on performance, F(1, 199) = 4439.81, p < .001
Inzlicht & Gutsell (2007) emotion suppression (video) vs. control EEG ERN during stroop task -
Wang, Yang, & Wang (2014) emotion suppression (video) vs. control EEG ERN during stroop task
Sripada, Kessler, & Jonides (2014) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) -
Ringos & Carlucci (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Wolff, Muzzi & Brand (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Calvillo & Mills (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Crowell, Finley et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Lynch, VanDellen et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Birt & Muise (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Yusainy, Wimbarti et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Indonesian language
Lau & Brewer (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Ullrich, Primoceri et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Elson (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Cheung, Kroese et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Hagger & Chatzisarantis (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Schlinkert, Schrama et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Philipp & Cannon (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Carruth & Miyake (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Brandt (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Stamos, Bruyneel et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Rentzsch, Nalis et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Francis & Inzlicht (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Lange, Heise et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Evans, Fay & Mosser (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Tinghög & Koppel (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Otgaar, Martijn et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Muller, Zerhouni et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) French language
[Underlying data (CSV)] [R-code]

Summary: There appears to be replication difficulties across 6 different operationalizations of original studies supporting the two main predictions of the strength model of self-control (Baumeister et al., 2007). Prediction 1: Independent researchers appear unable to replicate the finding that glucose consumption counteracts ego depletion, whether self-control is measured via Stroop (Cesario & Corker, 2010, as in Gaillot et al., 2007, Study 7) or future-discounting task (Lange & Eggert, 2014, Study 1, as in Wang & Dvorak, 2010). Prediction 2: There also appears to be replication difficulties (across 4 distinct operationalizations) for the basic ego depletion effect. This is the case whether IV manipulated via thought supppression, video attention task, emotion suppression during video watching, or effortful letter crossing task and also whether DV measured via anagram performance, standardized tests, working memory, or multi-source interference task. Wang et al. (2014) do appear to successfully replicate Inzlicht & Gutsell's (2007) finding that ego depletion led to reduced activity in the anterior cingulate (region previously associated with conflict monitoring), however this finding should be interpretd with caution given potential bias due to analytic flexibility in data exclusions and EEG analyses. Of course, ego depletion may reflect a replicable phenomenon under different conditions, contexts, and/or operationalizations; however, the replication difficulties across 6 different operationalizations suggest ego depletion might be much more nuanced than previously thought. Indeed, alternative models have recently been proposed (e.g., motivation/attention-based accounts, Inzlicht et al., 2014; mental fatigue, Inzlicht & Berkman, 2015) and novel intra-individual paradigms to measure ego depletion have also emerged (Francis, 2014; Francis et al., 2015) that offer promising avenues for future research.

Mood on helping -- Replications (3)  

Original Studies & Replications N Effect size (Risk Difference) [95% CI]
Operationalization #1: Positive mood (finding dime in telephone booth) boosts helping (picking up dropped papers)
Isen & Levin (1972) Study 2 41
Blevins & Murphy (1974) 50
Generalization #1: Positive mood (finding dime in telephone booth) boosts helping (mailing "forgotten letter")
Levin & Isen (1975) Study 1 24
Weyant & Clark (1977) Study 2 106
Weyant & Clark (1977) Study 1 32
Current meta-analytic estimate of generalization #1 replications:
[Underlying data & R-code]

Summary: The claim that positive mood boosts helping appears to have replicability problems. Across three replications, individuals presumably in a positive mood (induced via finding a dime in a telephone booth) helped at about the same rate (29.6%) as those not finding a dime (29.8%; meta-analytic risk difference estimate = .03 [+/-.19]; in original studies, 88.8% of dime-finding Ps helped compared to 13.9% of Ps in the control condition). This was the case whether helping was measured via picking up dropped papers (Blevins & Murphy, 1974 as in Isen & Levin, 1972, Study 2) or via mailing a "forgotten letter" (Weyant & Clark, 1977 Study 1 & 2 as in Levin & Isen, 1975, Study 1). These negative replication results are insufficient to declare the mood-helping link as unreplicable, however, they do warrant concern that perhaps additional unmodeled factors should be considered. For instance, it seems plausible that mood may influence helping in different ways for different individuals (e.g., negative, rather than positive, mood may boost helping in some individuals), and may also influence the same person differently on different occasions. Using highly-repeated within-person (HRWP) designs (e.g., Whitsett & Shoda, 2014) would be a fruitful avenue to empirically investigate these more plausible links between mood and helping behavior.

Verbal overshadowing (RRR1 & RRR2 ) -- Replications (23)   
Schooler & Engstler-Schooler (1990)      [For RRR1 studies, see replication table above (type "RRR1")]
Verbal overshadowing of visual memories: Some things are better left unsaid

Original Studies & Replications N Effect size [95% CI]
Operationalization #1: Verbal description (bank robber) reduces successful identification of perpetrator (lineup task)
Schooler & Engstler-Schooler (1990) Study 1 88
Poirer et al. (2014) 95
Delvenne et al. (2014) 98
Birt & Aucoin (2014) 65
Susa et al. (2014) 111
Carlson et al. (2014) 160
Musselman & Colarusso (2014) 78
Echterhoff & Kopietz (2014) 124
Mammarella et al. (2014) 104
Dellapaolera & Bornstein (2014) 164
Mitchell & Petro (2014) 109
Ulatowska & Cislak (2014) 106
Wade et al. (2014) 121
Birch (2014) 156
McCoy & Rancourt (2014) 89
Greenberg et al. (2014) 75
Alogna et al. (2014) 137
Michael et al. (2014, mTurk) 615
Koch et al. (2014) 67
Thompson (2014) 102
Rubinova et al. (2014) 110
Brandimonte (2014) 100
Eggleston et al. (2014) 93
Kehn et al. (2014) 113
Current meta-analytic estimate of operationalization #1 replications
[Underlying data (CSV) & R-code]

Summary: The verbal overshadowing effect currently appears to be replicable; verbally describing a robber after a 20-minute delay decreased correct identification rate in a lineup by 16% (from 54% [control] to 38% [verbal]; meta-analytic estimate = -16% [+/-.04], equivalent to r = .17). Still in question, however, is the validity and generalizability of the effect, hence it's still premature for public policy to be informed by verbal overshadowing evidence. Validity-wise, it's unclear whether verbal overshadowing is driven by a more conservative judgmental response bias process or driven by a reduced memory discriminability process because no "suspect-absent" lineups were used. This is important to clarify because it directly influences how eye-witness testimony should be treated (e.g., if verbal overshadowing is primarily driven by a more conservative response bias process, identifications made after a verbal descriptions should actually be given *more* [rather than less] weight, see Mickes & Wixted, 2015). Generalizability-wise, in a slight variant of RRR2 (i.e., RRR1), a much smaller overall verbal deficit of -4% [+/-.03] emerged, when the lineup identification occured 20 minutes after verbal description (which occurred immediately after seeing robbery). Future research needs to determine the size of verbal overshadowing when there's a delay between crime and verbal description and before lineup identification, which better reflect real-world conditions.

Scientists can only rely on an empirical finding if it is credible. In science, a credible finding is one that has (at minimum) survived scrutiny along 3 dimensions: (1) method/data transparency, (2) analytic reproducibility/robustness, and (3) effect replicability. However, no platform currently exists to find such information. Curate Science aims to fill this gap. It is a platform to crowdsource the credibility of empirical research by allowing researchers to curate the transparency, reproducibility/robustness, and replicability of published findings (for full details of our current unified curation framework, see here [version 5.2.0]). Our mission is to increase the cumulative and self-correcting nature of empirical research to accelerate the development of scientific knowledge and evidence-based applied innovations.

Crowdsourcing the credibility of empirical research creates the following value for various stakeholders of scientific research:

  1. Theory building and application:
    • It allows researchers to base beliefs about the credibility of effects on empirical evidence rather than authority (e.g., journal or university prestige).
    • It allows researchers to identify replicable effects that are ready to be extended (particularly useful for graduate students and early-career researchers).
    • It allows researchers to more accurately estimate effect sizes within a research area, yielding better estimates of sample sizes needed to achieve sufficient statistical power.
    • It allows researchers to identify important findings that have not yet been replicated and commission the conduct of such replications (via e.g., StudySwap or the Psychological Science Accelerator).
  2. Meta-scientific:
    • ​It yields a rich database of transparently reported studies and replications that can be used for meta-science research to deepen our understanding of the predictors of replicability (e.g., original study p-value, sample size, study design).
    • The platform can be used to track the transparency, reproducibility, robustness, and replicability of disciplines over time to gauge progress in achieving higher research integrity.
  3. Teaching/pedagogical: The searchable database can be used to teach about transparency and replication (e.g., showing real world examples of effects exhibiting different levels of replicability; it can also inform teachers about replicable effects that can justifiably be taught).
  4. Practical: It helps researchers locate publicly-available study materials for follow-up research and publicly available data sets for secondary (re-)analyses from alternative theoretical perspectives.
  5. Social normative:
    • Making it easier to find more transparently reported research increases the likelihood that ambivalent or unaware researchers will decide to adopt such transparent practices, hence accelerating a cultural shift in the research community where it becomes the social norm to report one’s research more transparently.
    • By increasing the visibility of replication studies, the platform rewards the contributions of researchers who devote their time to replicating the work of others (a crucial activity for research to be cumulative).

Replicability Curation Approach

Our replicability curation approach involves 3 key steps:

  1. Organize replications by the different operationalizations of an effect/hypothesis (in "evidence collections")
  2. Gauge replication method similarity, replication deviations, & plausibility of auxiliary hypotheses/assumptions
  3. Evaluate statistical evidence of replications meta-analytically using a nuanced and principled approach.

1. Organize Replications in Evidence Collections

Replications are first organized in evidence collections (or "replication collections") by grouping them by the different operationalizations/generalizations of the target effect/hypothesis. For example, shown below are replications of two different operationalizations of the "Macbeth effect" (i.e., threatened moral purity increases the need to cleanse oneself): (1) Transcribing text of unethical vs. ethical acts on cleaning product desirability (Zhong & Liljenquist's, 2006, Study 2 [left]) and (2) Recalling an unethical vs. ethical deed on product choice (Zhong & Liljenquist's, 2006, Study 3 [right]):

2. Gauge Replication Method Similarity, Deviations, & Plausibility of Auxiliary Hypotheses

Beyond considering the transparency (compliance to reporting standards, open materials, pre-registration information, and open data) and state of analytic reproducibility of the replication studies (which is crucial to ensure the credibility of any study), three key replication study characteristics should be considered:

2.1 Replications need to be sufficiently methodological similar relative to an original study because only such replications can cast doubt about an effect/hypothesis and hence speak to replicability. Methodologically dissimilar "conceptual replications", on the other hand, can only speak to the generalizability of an effect given that the intentionally introduced major design differences render unsupportive evidence ambiguous (i.e., it could be due to the falsity of the original hypothesis or to one or more of the methodological changes). We have developed a replication taxonomy to help in this regard (see below). It distinguishes sufficiently methodologically-similar direct replications (i.e., "Exact", "Very Close", or "Close") from insufficiently methodologically-similar "conceptual replications" (i.e., generalizability studies, what are called "Far" or "Very Far").

2.2 All known methodological deviations of a replication (relative to an original study) should be considered. For favorable replication evidence, knowing about such design differences shows that the effect is robust across these differences. For unfavorable replication evidence, such differences provide initial clues regarding potential boundary conditions of an effect (though the parsimonious possibility that an initially reported effect may have been due to a statistical fluke or error should also always be considered).

2.3 It is also important to consider evidence regarding the plausibility that auxiliary hypotheses held true (which are required to test a substantive hypothesis (Meehl, 1990, p. 200); e.g., that the measurement instruments operated correctly; that participants understood the instructions). This is particularly important when a replication study reports a null finding so that we can rule out more mundane explanations for not having detected an effect (e.g., evidence of a successful manipulation check or detecting a known replicable effect [e.g., a semantic priming effect] could help rule out that a data processing or experimenter error caused an observed null finding).

3. Evaluate Replication Evidence in Nuanced Manner

Replicability is then gauged by statistically evaluating replication evidence at both (1) the individual replication study level (when only one or a few replications are available) and (2) at the meta-analytic level (when several replications are available) by synthesizing evidence across replication studies nested within distinct operationalizations/generalizations of an effect. Replication evidence needs to be interpreted in a nuanced manner by considering 3 distinct aspects: (1) whether a signal was detected, (2) consistency of replication effect size (ES) relative to original, and (3) precision of replication ES estimate (LeBel, Vanpaemel, Cheung, & Campbell, 2018). Such considerations yield the following "replication outcomes" (see below for visual depictions):

  1. signal - consistent: replication ES 95% CI excludes 0 and includes original ES point estimate.
  2. signal - inconsistent: replication ES 95% CI excludes 0 but also excludes original ES point estimate. Three subcategorizations:
    • larger (same direction): replication ES is larger and in same direction as original ES.
    • smaller (same direction): replication ES is smaller and in same direction as original ES.
    • opposite direction/pattern: replication ES is in opposite direction (or reflects inconsistent pattern) relative to original ES direction/pattern.
  3. no signal - consistent: replication ES 95% CI includes 0 but also includes original ES point estimate
  4. no signal - inconsistent: replication ES 95% CI includes 0 but excludes original ES point estimate

(In cases where a replication effect size estimate was less precise than the original, i.e., replication ES CI is wider than original, the label "imprecise" is added to warn readers that such replication should only be interpreted meta-analytically.

For example, as shown below, for the original instantiation of Bem's “retroactive recall” effect (Study 8), meta-analytic replication evidence of all known eligible replications does not detect a signal given that the meta-analytic replication ES estimate of DR% is -.03% +/- .4% (which is inconsistent with the original ES estimate of DR% = 2.3% +/- 2.3%). For the method generalization (Bem’s Study 9), meta-analytic replication evidence also does not detect a signal (meta-analytic replication ES estimate of DR% = -.2% +/- .98%, which is also inconsistent with the original ES estimate of 4.2% +/- 3.0%). Consequently, the effect that rehearsing words after a memory test can boost our memory for those words does not appear to be credible given that it has not survived the severe falsification attempts provided by the direct replications.


Products in Development

Search for Transparently Reported Studies

Evidence can only be considered credible if it has survived scrutiny by independent researchers. Such scrutiny is only possible if a study's methodological details and data are transparently reported. In this spirit, we're designing an interface to search for (and filter by) studies transparently reported along different transparency dimensions (i.e., compliance to reporting standards, open/public study materials, preregistered/Registered Reports, open/public data, and reproducible code). Researchers will be able to add and curate their own transparently reported articles. See interactive prototype in development below (e.g., filter by transparency component checkboxes; hover and/or click badge icons to interact with curated transparency components).

Show only articles with: Reporting standard/Methodological disclosure statement
Open/public study materials
Preregistered/Registered Reports protocol
Open/public data
Reproducible code/Computational capsule

authors/study.number article.title journal.name DOI rs om prereg od rc
Pittarello, Leib et al. (2015)
Study 1
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.

Study 2
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.
Justifications Shape Ethical Blind Spots Psychological Science 10.1177/0956797615571018 rs om prereg od
Colby, DeWitt & Chapman (2015)
Study 1
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.

Study 2
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.

Study 3
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.
Grouping Promotes Equality: The Effect of Recipient Grouping on Allocation of Limited Medical Resources Psychological Science 10.1177/0956797615583978 rs om prereg od
Birmingham et al. (2015)
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.
Implicit Social Biases in People With Autism Psychological Science 10.1177/0956797615595607 rs om od
Tworek & Cimpian (2016)
Study 1
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.

Study 2
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.

Study 3
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.

Study 4
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.

Study 5
Study complies with the Basic 4 (at submission) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.
Why Do People Tend to Infer 'Ought' From 'Is'? The Role of Biases in Explanation Psychological Science 10.1177/0956797616650875 rs om prereg od
Willén & Strömwall (2012)
Study complies with the Basic 7 (retroactive) reporting standard:
  1. Excluded data (subjects/observations): Initially, 36 offenders participated in the experiment. The narratives from six participants were excluded for different reasons. Three respondents were excluded because – on the respondents' initiative – the narratives were not focused on the purpose of the interview (i.e., the experiment). One was excluded because the false confession was so lengthy that the interviewer had to end the interview prematurely. One respondent was excluded because the narratives were severely incoherent to the extent that they were not possible to understand. One respondent was excluded because the interviewer had reason to believe that the event behind the supposedly true confession actually had taken place. All six respondents were excluded prior to the statistical analyses being made. Some of the excluded narratives were used by the research assistants/coders for training purposes.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: A rough sample size of about 30 participants was decided in advance. No formal power calculation was conducted and the participation rate was expected to be low.
  5. Analytic plans: It was initially predicted that gender and interview experience would influence the outcome on CBCA and RM scores. These analyses were not statistically significant (p > .05). In line with common publication practice (John et al., 2012), these predictions were therefore deleted from the report and the two variables instead included as covariates.
  6. Unreported related studies: N/A
  7. Other disclosures: For the purpose of any future research aiming to replicate the study, it should be noted that the interviewer also had criminal experience and that all respondents were briefly informed about this (albeit not the nature of the experience) during the recruitment process. Before granting access, the prisons' heads of security did a thorough check-up to ensure that the interviewer was not familiar with any of the current prisoners (i.e., not only potential respondents).
Date of retroactive disclosure: February 1, 2018.
Original retroactive disclosure statement
Offenders’ uncoerced false confessions: A new application of statement analysis? Legal and Criminological Psychology 10.1111/j.2044-8333.2011.02018.x rs
Campbell et al. (2018)
Study complies with the Basic 4 (retroactive) reporting standard:
  1. Excluded data (subjects/observations): Full details reported in article.
  2. Experimental conditions: Full details reported in article.
  3. Outcome measures: Full details reported in article.
  4. Sample size determination: Full details reported in article.
Self-esteem, relationship threat, and dependency regulation: Independent replication of Murray, Rose, Bellavia, Holmes, and Kusche (2002) Study 3 Preprint Journal of Research in Personality 10.1016/j.jrp.2017.04.001 rs om prereg od rc
Vize, Collison et al. (2018) Examining the Effects of Controlling for Shared Variance among the Dark Triad Using Meta-analytic Structural Equation Modelling European Journal of Personality 10.1002/per.2137 om od
Butler, Karpowitz et al. (2017) Who Gets the Credit? Legislative Responsiveness and Evaluations of Members, Parties, and the US Congress HTML Political Science Research and Methods 10.1017/psrm.2015.83 od rc
Gilad & Mizrahi-Man (2015) A reanalysis of mouse ENCODE comparative gene expression data HTML F1000Research 10.12688/f1000research.6536.1 od rc
Eriksson, Andersson, & Strimling (2018)
Study 1
Study 2
Study 3
Study 4
When is it appropriate to reprimand a norm violation? The roles of anger, behavioral consequences, violation severity, and social distance Judgment and Decision Making jdm17127 od

Replications User Interface Improvements

Replications user interface improvements currently in development include: (1) meta-analyze selected replications, (2) enhanced visualization of complex designs, (3) curating and visualizing multiple outcomes, and (4) public crowdsourcing and replication alerts.


1. Meta-analyze Selected Replications

Meta-analyze and generate forest plots of user-selected replications, including displaying estimates of heterogeneity across replications (replications automatically grouped by distinct operationalizations of an effect). Try it by clicking the top "Meta-analyze" button on the left-hand side of the table.

2. Enhanced Visualization of Complex Designs

Enhanced visualization of effect sizes for more complex designs by displaying small (side-by-side) plots of original and replication data patterns on mouseover (e.g., complex interaction patterns). Try it by hovering over (or clicking for touch screens) the original or replication effect size cells (dotted underline).

3. Curating and Visualizing Multiple Outcomes

Replications sometimes involve more than one outcome measure (dependent variable). We will soon offer the ability to add, meta-analyze, and visualize multiple outcomes (e.g., primary, secondary, and auxiliary outcomes):

Infidelity Distress Effect -- Replications (3)  

4. Crowdsourcing and Replication Alerts

We will soon offer the ability for the broader community of researchers to add missing replications and update/modify replication study characteristics (we currently use an internal crowdsourcing mechanism). Registered users will also be able to create alerts to be automatically notified when new replications are posted based on the name of a target effect or keyword/topic.

Current Contributors
Current contributors are helping with conceptual developments of Curate Science, writing/editing of related manuscripts, and/or with curation.
Etienne P. LeBel
University of Western Ontario
Founder & Lead
Wolf Vanpaemel
KU Leuven
Touko Kuusi
University of Helsinki
Randy McCarthy
Northern Illinois University
Brian Earp
University of Oxford

Malte Elson
Ruhr University Bochum
Current Advisory Board (as of June 2017)
Advisory board members periodically provide feedback on grant proposal applications and related manuscripts and general advice regarding Curate Science's current focus areas and future directions.
Susann Fiedler
Max Planck Institute
Anna van't Veer
Leiden University
Julia Rohrer
Max Planck Institute
Michèle Nuijten
Tilburg University
Dorothy Bishop
University of Oxford





Brent Roberts
University of Illinois - Urbana-Champaign
Hal Pashler
University of California - San Diego
Daniel Simons
University of Illinois - Urbana-Champaign
Alex Holcombe
University of Sydney
E-J Wagenmakers
University of Amsterdam





Katie Corker
Grand Valley State University
Simine Vazire
University of California – Davis
Richard Lucas
Michigan State University
Marco Perugini
University of Milan-Bicocca
Lorne Campbell
University of Western Ontario


Eric Eich
University of British Columbia
Mark Brandt
Tilburg University
Please sign up below to receive the Curate Science Newsletter to be automatically notifed about news and updates. See past announcements.




Previous Funders

Current Funders

Current Partners

Contact Details

University of Western Ontario
1151 Richmond St
London, Ontario, CANADA, N6A 3K7
email: curatescience@gmail.com