Crowd-sourcing the tracking and interpretation of replication evidence.

Published scientific findings can only be considered trustworthy -- for theory and applications (e.g., health interventions) -- once successfully replicated and generalized by independent researchers. No database, however, currently exists that systematically tracks and meta-analytically summarizes independent direct replications to gauge the replicability and generalizability of social science findings over time. Curate Science is a crowd-sourced effort to achieve just this to accelerate the development of trustworthy knowledge that can soundly inform theory and effective public policy to improve human welfare (see About section for more details).

Update (October 28, 2016): We've just released a new framework (version 3.0.1) for curating replication evidence of social science findings and are now soliciting feedback (please email us at curatescience-anti-bot-bit@gmail.com). Details of our previous approaches can be found here (version 2.0.4) and here (version 1.0.5).

Large-Scale Replication Projects (970 replications)
  • Reproducibility Project: Psychology [100 replications; view studies ]
  • Social Psych Special Issue [31 replications]
    • Many Labs 1 [12 effects x 36 labs = 432 replications]
  • Many Labs 2 [26 effects, N = ~15,000]
  • Many Labs 3 [10 effects x 21 labs = 210 replications]
  • Many Labs 4: Impact of "expertise" on replicability
  • Many Labs 5: Can peer-review of protocols boost replicability?
  • Registered Replication Reports (RRRs) at Perspectives on Psychological Science
    • RRR1 & RRR2: Verbal overshadowing [23 replications; view studies ]
    • RRR3: Grammar on intentionality [13 replications]
    • RRR4: Ego depletion [23 replications; view studies ]
    • RRR5: Facial feedback hypothesis [17 replications]
    • RRR6: Commitment on forgiveness [16 replications]
    • RRR7: Intuitive-cooperation effect [20 replications]
    • RRR8: Trivial pursuit effect [data being collected]
    • RRR9: Hostility priming increases perceptions of hostility [data being collected]
    • RRR10: Moral reminder reduces cheating [data being collected]
  • Economics Reproducibility Project [67 replications; Chang & Li, 2015]
  • Economics Lab Experiments Replicability Project [18 replications; Camerer et al, 2016]

Last updated: December 3, 2016

All Replications (1047 replications; 370 curated, 677 being curated [see list])

List of known direct replications ("Exact", "Very Close", or "Close" direct replications) of generalizations of effects according to our working taxonomy. PDFs of study details are linked via study author names. Replication methodological details can be found here. Please let us know about any missing replications or errors at curatescience-anti-bot-bit@gmail.com

Search via CTLR+F (Windows) or ⌘+F (Mac) Last updated: November 14, 2016

Reproducibility Project: Psychology (100 replications; 57 Social, 43 Cognition)

Replication protocols and data/materials available at specified protocol URL and OSF URL, respectively. More details about original and replication studies can be found here. Please let us know about any errors at curatescience-anti-bot-bit@gmail.com

Search via CTLR+F (Windows) or ⌘+F (Mac) Last updated: November 21, 2016

Social Priming / Embodiment

Cleanliness priming -- Replications (7)  
Schnall, Benton, & Harvey (2008a)
With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments
DOI:10.1111/j.1467-9280.2008.02227.x  

Original Studies & Replications N Effect size (d) [95% CI]
Schnall et al. (2008a) Study 1 40
Arbesfeld et al. (2014) 60
Besman et al. (2013) 60
Huang (2014) Study 1 189
Lee et al. (2013) 90
Johnson et al. (2014a) Study 1 208
Johnson et al. (2014b) 736
Current meta-analytic estimate of replications of SBH's Study 1 (random-effects):
Schnall et al. (2008a) Study 2 43
Johnson et al. (2014a) Study 2 126
Current meta-analytic estimate of all replications (random-effects):
[Underlying data (CSV)] [R-code]

Summary (Last updated: April 7, 2016): The main finding that cleanliness priming reduces the severity of moral judgments does not (yet) appear to be replicable (overall meta-analytic effect: r = -.08 [+/-.13]). In a follow-up commentary, Schnall argued that a ceiling effect in Johnson et al.'s (2014a) studies render their results uninterpretable and hence their replication results should be dismissed. However, independent re-analyses by Simonsohn, Yarkoni, Schönbrodt, Inbar, Fraley, and Simkovic appear to rule out such ceiling effect explanation, hence, Johnson et al.'s (2014a) results should be retained in gauging the replicability of the original cleanliness priming effect. Of course, it's possible "cleanliness priming" may be replicable under different operationalizations, conditions, and/or experimental designs (e.g., within-subjects). Indeed, Huang (2014) has reported new evidence suggesting cleanliness priming may only reduce severity of moral judgments under conditions of "low response effort", however, the research appears to be low-powered (<50%) to detect the small interaction effect found (r = .12). Regardless, independent corroboration of Huang's interaction effect is required before confidence is placed in such moderated cleanliness priming effect.

Original authors' and replicators' comments: F. Cheung mentioned a note should be added that data for the Besman et al. (2013) replication has been lost (communicated to him by K. Daubman, who has not yet responded to my request for links to original data of both her Arbesfeld et al. and Besman et al. replications). M. Frank mentioned we should consider including some of Huang's (2014) studies (baseline un-moderated conditions only), which led us to add Huang's Study 1 (only study with baseline condition comparable to Schnall et al.'s Study 1 design). S. Schnall has yet to respond (email sent March 11, 2016).

Related Commentary

Money priming -- Replications (42)  
Vohs, Mead, & Goode (2006) 
The psychological consequences of money
Caruso, Vohs, Baxter, & Waytz (2013) 
Mere exposure to money increases endorsement of free-market systems and social inequality

Original Studies & Replications N Effect size (d) [95% CI]
Vohs et al. (2006) Study 3 39
Grenier et al. (2012) 40
Caruso et al. (2013) Study 2 168
Schuler & Wänke (in press) Study 2 115
Rohrer et al. (2015) Study 2 420
Current meta-analytic estimate of replications of CVBW's Study 2 (random-effects):
Caruso et al. (2013) Study 3 80
Rohrer et al. (2015) Study 3 156
Caruso et al. (2013) Study 4 48
Rohrer et al. (2015) Study 4 116
Caruso et al. (2013) Study 1 30
Hunt & Krueger (2014) 87
Cheong (2014) 102
Devos (2014) 162
Swol (2014) 96
John & Skorinko (2014) 87
Davis & Hicks (2014) Study 1 187
Kappes (2014) 277
Klein et al. (2014) 127
Packard (2014) 112
Vranka (2014) 84
Cemalcilar (2014) 113
Bocian & Frankowska (2014) Study 2 169
Huntsinger & Mallett (2014) 146
Rohrer et al. (2015) Study 1 136
Schmidt & Nosek (2014, MTURK) 1000
Hovermale & Joy-Gaba (2014) 108
Vianello & Galliani (2014) 144
Schmidt & Nosek (2014, PI) 1329
Bernstein (2014) 84
Adams & Nelson (2014) 95
Rutchick (2014) 96
Vaughn (2014) 90
Levitan (2014) 123
Brumbaugh & Storbeck (2014) Study 1 103
Smith (2014) 107
Kurtz (2014) 174
Brumbaugh & Storbeck (2014) Study 2 86
Wichman (2014) 103
Pilati (2014) 120
Davis & Hicks (2014) Study 2 225
Furrow & Thompson (2014) 85
Bocian & Frankowska (2014) Study 1 79
Brandt et al. (2014) 80
Nier (2014) 95
Woodzicka (2014) 90
Schmidt & Nosek (2014) 81
Morris (2014) 98
Current meta-analytic estimate of replications of CVBW's Study 1 (random-effects):
Current meta-analytic estimate of all replications (random-effects):
[Underlying data (CSV)] [R-code]

Summary (Last updated: March 24, 2016): The claim that incidental exposure to money influences social behavior and beliefs does not (yet) appear to be replicable (overall meta-analytic effect: d = -.01 [+/-.05]). This appears to be the case whether money exposure is manipulated via instruction background images (Caruso et al., 2013, Study 1 & 4) or descrambling sentence task (Vohs et al., 2006, Study 3) and whether outcome variable is helping others (Vohs et al., 2006, Study 3), system justification beliefs (Caruso et al., 2013, Study 1), just world beliefs (Caruso et al., 2013, Study 2), social dominance beliefs (Caruso et al., 2013, Study 3), or fair market beliefs (Caruso et al., 2013, Study 4). Of course, it's possible money exposure reliably influences behavior under other (currently unknown) conditions, via other operationalizations, and/or using other experimental designs (e.g., within-subjects).

Original authors' comments: K. Vohs responded and mentioned Schuler & Wänke's (in press) replication of Caruso et al. (2013) was missing; this lead us to add Schuler & Wänke (in press) Study 2 (main effect) as a direct replication of Caruso et al. (2013) Study 2. Vohs pointed out several design differences between Grenier et al. (2012) and Vohs et al.'s (2006) original Study 3, but these deviations are minor (e.g., different priming stimuli, different help target); given Grenier et al. (2012) used the same general methodology as Vohs et al. (2006) Study 3 for the independent variable (unscrambling priming task) and dependent variable (offering help to code data sheets), the study satisfies eligibility criteria for a sufficiently similar direct replication according to Curate Science's taxonomy and hence was retained. Vohs also pointed out design differences between Tate (2009) and Vohs et al. (2006) Study 3; given Tate (2009) employed a different general methodology for the IV (background image on a poster instead of unscrambling task), the study does *not* satisfy eligibility criteria for a direct replication and hence was excluded. Finally, Vohs mentioned that "replication studies" for Vohs et al. (2006) are reported in Vohs (2015), however none of these studies were sufficiently similar methodologically to meet direct replication eligibility criteria and hence were not added.

Related Commentary

Macbeth effect -- Replications (11)  
Zhong & Liljenquist (2006)
Washing away your sins: Threatened morality and physical cleansing
DOI:10.1126/science.1130726  

Original Studies & Replications N Effect size (r) [95% CI]
Zhong & Liljenquist (2006) Study 2 27
Earp et al. (2014) Study 3 286
Siev (2012) Study 2 148
Earp et al. (2014) Study 2 156
Siev (2012) Study 1 335
Earp et al. (2014) Study 1 153
Gamez et al. (2011) Study 2 36
Current meta-analytic estimate of replications of Z&L's Study 2 (random-effects):
Zhong & Liljenquist (2006) Study 3 32
Fayard et al. (2009) Study 1 210
Gamez et al. (2011) Study 3 45
Current meta-analytic estimate of replications of Z&L's Study 3 (random-effects):
Zhong & Liljenquist (2006) Study 4 45
Fayard et al. (2009) Study 2 115
Gamez et al. (2011) Study 4 28
Reuven et al. (2013) 29
Current meta-analytic estimate of replications of Z&L's Study 4 (random-effects):
Current meta-analytic estimate of all replications (random-effects):
[Underlying data (CSV)] [R-code]

Summary (Last updated: November 11, 2016): The main finding that a threat to one's moral purity induces the need to cleanse oneself (the "Macbeth effect") does not (yet) appear to be replicable (overall meta-analytic effect: r = -.02 [+/-.05]). This appears to be the case whether moral purity threat is manipulated via recalling unethical vs. ethical deed (Studies 3 and 4) or transcribing text describing unethical vs. ethical act (Study 2) and whether need to cleanse onself is measured via desirability of cleansing products (Study 2), product choice (Study 3), or reduced volunteerism after cleansing (Study 4). Of course, it is possible the "Macbeth effect" is replicable under different operationalizations and/or experimental designs (e.g., within-subjects).

Original authors' comments: We shared a draft of the curated set of replications with both original authors, and invited them to provide feedback. Chenbo Zhong replied thanking us for the notice and mentioned two published articles that should potentially be considered (i.e., Denke et al., 2014; Reuven et al., 2013). Reuven et al. do indeed report a sufficiently close replication (in their non-OCD control group) of Zhong & Liljenquist's Study 4 and hence the control group replication was added (though we're currently clarifying an issue with their reported t-value).

Related Commentary

Physical warmth embodiment -- Replications (14)  
Bargh & Shalev (2012)
The Substitutability of Physical and Social Warmth in Daily Life
DOI:10.1037/a0023527  

Original Studies & Replications N Effect size (r) [95% CI]
Bargh & Shalev (2012) Study 1a 51
Bargh & Shalev (2012) Study 1b 41
Donnellan et al. (2015a) Study 9 197
Donnellan et al. (2015a) Study 4 228
Donnellan et al. (2015a) Study 1 235
Donnellan et al.(2015b) 291
Ferrell et al. (2013) 365
McDonald & Donnellan (2015) 356
Donnellan et al. (2015a) Study 2 480
Donnellan et al. (2015a) Study 8 365
Donnellan et al. (2015a) Study 7 311
Donnellan & Lucas (2014) 531
Donnellan et al. (2015a) Study 6 553
Donnellan et al. (2015a) Study 5 494
Donnellan et al. (2015a) Study 3 210
Current meta-analytic estimate of replications of B&S' Study 1 (random-effects):
Bargh & Shalev (2012) Study 2 75
Wortman et al. (2014) 260
Current meta-analytic estimate of all replications (random-effects):
[Underlying data (CSV)] [R-code]

Summary (Last updated: April 7, 2016): The notion that physical warmth influences psychological social warmth does not appear to be well-supported by the independent replication evidence (overall meta-analytic effect: r = .007 [+/-.035])), at least via Bargh and Shalev's (2012) Study 1 and 2 operational tests (Study 1: trait loneliness is positively associated with warmer bathing; Study 2: briefly holding a frozen cold-pack boosts reported feelings of chronic loneliness). Regarding first operational test, the loneliness-shower effect doesn't appear replicable whether (1) trait loneliness is measured using the complete 20-item UCLA Loneliness Scale (Donnellan et al., 2015 Studies 1-4) or a 10-item modified version of the UCLA Loneliness Scale (Donnellan et al., 2015 Studies 5-9, as in Bargh & Shalev, 2012 Studies 1a and 1b), (2) whether warm bathing is measured via a "physical warmth index" (all replications as in Bargh & Shalev, 2012 Study 1a and 1b) or via the arguably more hypothesis-relevant water temperature item (all replications of Bargh & Shalev Study 1), and (3) whether participants were sampled from Michigan (Donnellan et al., 2015 Studies 1-9), Texas (Ferrell et al., 2013), or Israel (McDonald & Donnellan, 2015). Of course, different operationalizations of the idea may yield replicable evidence, e.g., in different domains, contexts, or using other experimental designs (e.g., within-subjects). In a response, Shalev & Bargh (2015) point out design differences in Donnellan et al.'s (2015) replications that could have led to discrepant results (e.g., participant awareness not probed) and report three additional studies yielding small positive correlations between loneliness and new bathing and showering items (measured separately; r = .09 [+/-.09, N=491] and r = .14 [+/-.08, N=552]). These new findings, however, await independent corroboration (these additional studies not included in meta-analysis because they were executed by non-independent researchers, see FAQ for more details). In a rejoinder, Donnellan et al. (2015b) report an additional study that (1) probed participant awareness and found effect size unaltered by excluding participants suspected of study awareness (r=-.04, N=291 vs. r=-.05, N=323 total sample) and (2) found no evidence that individual differences in attachment style moderated the loneliness-showering link.

Original authors' comments: I. Shalev responsed stating that they've already publicly responded to these replications and have reported three additional studies in their response and that readers be referred to this article (Shalev & Bargh, 2015). B. Donnellan responded stating that several open questions remain including (1) unexplained anomalies in Bargh & Shalev's (2012) Study 1a data (i.e., 46 of the 51 participants (90%) reported taking less than one shower or bath per week) and (2) concerns regarding unclear exclusion criteria for Shalev & Bargh's (2015) new studies. Donnellan further stated that he's unconvinced by Shalev & Bargh's reply and that replication attempts by multiple independent labs would be the most constructive step forward.

Related Commentary

Self

Strength model of self-control -- Replications (32)  
Muraven, Tice, & Baumeister (1998) 
Self-control as limited resource: Regulatory depletion patterns
Baumeister, Bratslavsky, Muraven, & Tice (1998) 
Ego depletion: Is the active self a limited resource?

Original Studies & Replications N Effect size (d) [95% CI]
Prediction 1: Glucose consumption counteracts ego depletion
Gaillot, Baumeister et al. (2007) Study 7 61
Cesario & Corker (2010) 119
Wang & Dvorak (2010) 61
Lange & Eggert (2014) Study 1 70
Current meta-analytic estimate of Prediction 1 replications (random-effects):
Prediction 2: Self-control impairs further self-control (ego depletion)
Muraven, Tice et al. (1998) Study 2 34
Murtagh & Todd (2004) Study 2 51
Schmeichel, Vohs et al. (2003) Study 1 24
Pond et al. (2011) Study 3 128
Schmeichel (2007) Study 1 79
Healy et al. (2011) Study 1 38
Carter & McCullough (2013) 138
Lurquin et al. (2016) 200
Inzlicht & Gutsell (2007) 33
Wang, Yang, & Wang (2014) 31
Sripada, Kessler, & Jonides (2014) 47
Ringos & Carlucci (2016) 68
Wolff, Muzzi & Brand (2016) 87
Calvillo & Mills (2016) 75
Crowell, Finley et al. (2016) 73
Lynch, vanDellen et al. (2016) 79
Birt & Muise (2016) 59
Yusainy, Wimbarti et al. (2016) 156
Lau & Brewer (2016) 99
Ullrich, Primoceri et al. (2016) 103
Elson (2016) 90
Cheung, Kroese et al. (2016) 181
Hagger & Chatzisarantis (2016) 101
Schlinkert, Schrama et al. (2016) 79
Philipp & Cannon (2016) 75
Carruth & Miyake (2016) 126
Brandt (2016) 102
Stamos, Bruyneel et al. (2016) 93
Rentzsch, Nalis et al. (2016) 103
Francis & Inzlicht (2016) 50
Lange, Heise et al. (2016) 106
Evans, Fay, & Mosser (2016) 89
Tinghög & Koppel (2016) 82
Otgaar, Martijn et al. (2016) 69
Muller, Zerhouni et al. (2016) 78
Current meta-analytic estimate of Prediction 2 replications (random-effects):
[Underlying data (CSV)] [R-code]
Original Studies & Replications Independent Variables Dependent Variables Design Differences Active Sample Evidence
Prediction 1: Glucose consumption counteracts ego depletion
Gaillot, Baumeister et al. (2007) Study 7 sugar vs. splenda
video attention task vs. control
Stroop performance -
Cesario & Corker (2010) sugar vs. splenda
video attention task vs. control
Stroop performance No manipulation check Positive correlation between baseline & post-manipulation error rates, r = .36, p < .001
Wang & Dvorak (2010) sugar vs. splenda
future-discounting t1 vs. t2
future-discounting task -
Lange & Eggert (2014) Study 1 sugar vs. splenda
future-discounting t1 vs. t2
future-discounting task different choices in future-discounting task test-retest reliability of r = .80 across t1 and t2 scores
Prediction 2: Self-control impairs further self-control (ego depletion)
Muraven, Tice et al. (1998) Study 2 thought suppression vs. control anagram performance -
Murtagh & Todd (2004) Study 2 thought suppression vs. control anagram performance very difficult solvable anagrams used rather than "unsolvable"
Schmeichel, Vohs et al. (2003) Study 1 video attention task vs. control GRE standardized test -
Pond et al. (2011) Study 3 video attention task vs. control GRE standardized test 10 verbal GRE items used (instead of 13 analytic GRE items)
Schmeichel (2007) Study 1 video attention task vs. control working memory (OSPAN) -
Healy et al. (2011) Study 1 video attention task vs. control working memory (OSPAN) % of target words recalled (rather than total)
Carter & McCullough (2013) video attention task vs. control working memory (OSPAN) Effortful essay task vs. control in between IV and DV (perfectly confounded w/ IV)
Lurquin et al. (2016) video attention task vs. control working memory (OSPAN) 40 target words in OSPAN (rather than 48) Main effect of OSPAN set sizes on performance, F(1, 199) = 4439.81, p < .001
Inzlicht & Gutsell (2007) emotion suppression (video) vs. control EEG ERN during stroop task -
Wang, Yang, & Wang (2014) emotion suppression (video) vs. control EEG ERN during stroop task
Sripada, Kessler, & Jonides (2014) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) -
Ringos & Carlucci (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Wolff, Muzzi & Brand (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Calvillo & Mills (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Crowell, Finley et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Lynch, VanDellen et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Birt & Muise (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Yusainy, Wimbarti et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Indonesian language
Lau & Brewer (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Ullrich, Primoceri et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Elson (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Cheung, Kroese et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Hagger & Chatzisarantis (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Schlinkert, Schrama et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Philipp & Cannon (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Carruth & Miyake (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Brandt (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Stamos, Bruyneel et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Rentzsch, Nalis et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Francis & Inzlicht (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Lange, Heise et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) German language
Evans, Fay & Mosser (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Tinghög & Koppel (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV)
Otgaar, Martijn et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) Dutch language
Muller, Zerhouni et al. (2016) effortful letter crossing vs. control multi-source interference task (MSIT; RTV) French language
[Underlying data (CSV)] [R-code]

Summary (Last updated: November 11, 2016): There appears to be replication difficulties across 6 different operationalizations of original studies supporting the two main predictions of the strength model of self-control (Baumeister et al., 2007). Prediction 1: Independent researchers appear unable to replicate the finding that glucose consumption counteracts ego depletion, whether self-control is measured via Stroop (Cesario & Corker, 2010, as in Gaillot et al., 2007, Study 7) or future-discounting task (Lange & Eggert, 2014, Study 1, as in Wang & Dvorak, 2010). Prediction 2: There also appears to be replication difficulties (across 4 distinct operationalizations) for the basic ego depletion effect. This is the case whether IV manipulated via thought supppression, video attention task, emotion suppression during video watching, or effortful letter crossing task and also whether DV measured via anagram performance, standardized tests, working memory, or multi-source interference task. Wang et al. (2014) do appear to successfully replicate Inzlicht & Gutsell's (2007) finding that ego depletion led to reduced activity in the anterior cingulate (region previously associated with conflict monitoring), however this finding should be interpretd with caution given potential bias due to analytic flexibility in data exclusions and EEG analyses. Of course, ego depletion may reflect a replicable phenomenon under different conditions, contexts, and/or operationalizations; however, the replication difficulties across 6 different operationalizations suggest ego depletion might be much more nuanced than previously thought. Indeed, alternative models have recently been proposed (e.g., motivation/attention-based accounts, Inzlicht et al., 2014; mental fatigue, Inzlicht & Berkman, 2015) and novel intra-individual paradigms to measure ego depletion have also emerged (Francis, 2014; Francis et al., 2015) that offer promising avenues for future research.

Original authors' and replicators' comments: B. Schmeichel pointed out a missing replication (Healy et al., 2011, Study 1) of Schmeichel (2007, Study 1); we've added the study, though are currently clarifying with K. Healey a potential issue with their reported effect size. F. Lange mentioned that effect sizes for the RRR ego depletion replications seemed off (also pointed out by B. Schmeichel); indeed, we inadvertently sourced the effect sizes from an RRR dataset that included all exclusions (these have now been corrected and match values reported in Figure 1 of Sripada et al. RRR article). M. Inzlicht responded that he's currently developing a pre-registered study of the basic ego depletion effect using a much longer initial depletion task and adapted to be effortful for everyone via a more powerful pre-post mixed-design. R. Dvorak stated their study was not a replication of ego depletion; we clarified that the Wang & Dvorak (2010) study is used as an original study whose finding is consistent with the glucose claim of Baumeister et al.'s (2007) strength model. J. Lurquin mentioned their effect size was d=0.22 (not d=0.21), but .21 is actually correct given we use Hedge's g bias correction, but still call it d because of its greater familiarity with researchers.

Related Commentary

Classic Social Psychology

Mood on helping -- Replications (3)  
Isen & Levin (1972) 
Effect of feeling good on helping: Cookies and kindness
Levin & Isen (1975) 
Further studies on the effect of feeling good on helping

Original Studies & Replications N Effect size (Risk Difference) [95% CI]
Isen & Levin (1972) Study 2 41
Blevins & Murphy (1974) 50
Levin & Isen (1975) Study 1 24
Weyant & Clark (1977) Study 2 106
Weyant & Clark (1977) Study 1 32
Current meta-analytic estimate of L&I Study 1 replications (random-effects):
Current meta-analytic estimate of all replications (random-effects):
[Underlying data & R-code]

Summary (Last updated: March 24, 2016): The finding that positive mood boosts helping appears to have replicability problems. Across three replications, individuals presumably in a positive mood (induced via finding a dime in a telephone booth) helped at about the same rate (29.6%) as those not finding a dime (29.8%; meta-analytic risk difference estimate = .03 [+/-.19]; in original studies, 88.8% of dime-finding Ps helped compared to 13.9% of Ps in the control condition). This was the case whether helping was measured via picking up dropped papers (Blevins & Murphy, 1974 as in Isen & Levin, 1972, Study 2) or via mailing a "forgotten letter" (Weyant & Clark, 1977 Study 1 & 2 as in Levin & Isen, 1975, Study 1). These negative replication results are insufficient to declare the mood-helping link as unreplicable, however, they do warrant concern that perhaps additional unmodeled factors should be considered. For instance, it seems plausible that mood may influence helping in different ways for different individuals (e.g., negative, rather than positive, mood may boost helping in some individuals), and may also influence the same person differently on different occasions. Using highly-repeated within-person (HRWP) designs (e.g., Whitsett & Shoda, 2014) would be a fruitful avenue to empirically investigate these more plausible links between mood and helping behavior.

Original authors' comments: Report your research and results thoroughly, you may no longer be around when future researchers interpret replication results of your work!

Registered Replication Reports (RRR) @PoPS

Verbal overshadowing (RRR1 & RRR2 ) -- Replications (23)   
Schooler & Engstler-Schooler (1990)
Verbal overshadowing of visual memories: Some things are better left unsaid
DOI:10.1016/0010-0285(90)90003-M  

Original Studies & Replications N Effect size [95% CI]
Schooler & Engstler-Schooler (1990) Study 1 88
Poirer et al. (2014) 95
Delvenne et al. (2014) 98
Birt & Aucoin (2014) 65
Susa et al. (2014) 111
Carlson et al. (2014) 160
Musselman & Colarusso (2014) 78
Echterhoff & Kopietz (2014) 124
Mammarella et al. (2014) 104
Dellapaolera & Bornstein (2014) 164
Mitchell & Petro (2014) 109
Ulatowska & Cislak (2014) 106
Wade et al. (2014) 121
Birch (2014) 156
McCoy & Rancourt (2014) 89
Greenberg et al. (2014) 75
Alogna et al. (2014) 137
Michael et al. (2014, mTurk) 615
Koch et al. (2014) 67
Thompson (2014) 102
Rubinova et al. (2014) 110
Brandimonte (2014) 100
Eggleston et al. (2014) 93
Kehn et al. (2014) 113
Current meta-analytic estimate of all lab replications (random-effects):
[Underlying data (CSV) & R-code]

Summary (Last updated: March 3, 2016): The verbal overshadowing effect appears to be replicable; verbally describing a robber after a 20-minute delay decreased correct identification rate in a lineup by 16% (from 54% [control] to 38% [verbal]; meta-analytic estimate = -16% [+/-.04], equivalent to r = .17). Still in question, however, is the validity and generalizability of the effect, hence it's still premature for public policy to be informed by verbal overshadowing evidence. Validity-wise, it's unclear whether verbal overshadowing is driven by a more conservative judgmental response bias process or driven by a reduced memory discriminability process because no "suspect-absent" lineups were used. This is important to clarify because it directly influences how eye-witness testimony should be treated (e.g., if verbal overshadowing is primarily driven by a more conservative response bias process, identifications made after a verbal descriptions should actually be given *more* [rather than less] weight, see Mickes & Wixted, 2015). Generalizability-wise, in a slight variant of RRR2 (i.e., RRR1), a much smaller overall verbal deficit of -4% [+/-.03] emerged, when the lineup identification occured 20 minutes after verbal description (which occurred immediately after seeing robbery). Future research needs to determine the size of verbal overshadowing when there's a delay between crime and verbal description and before lineup identification, which better reflect real-world conditions.

Original authors' comments: We shared a draft of the curated set of replications with original authors, and invited them to provide feedback. Jonathan Schooler replied stating that the information seemed fine to him.

Related Commentary

(For full details of our replication evidence curation framework, please see here.)

Every year, society spends billions of dollars (primarily of public tax payer money) to fund scientific studies aimed at deepening our understanding of the natural and social world. The hope is that the findings yielded by these studies will help us address important societal problems (e.g., cancer; suicide; racial discrimination; voter turnout). The findings yielded by these studies, however, can only be considered sufficiently trustworthy knowledge ready to inform public-policy decisions once they have been successfully replicated and generalized by independent researchers. Successful replication is taken to mean that independent researchers have been able to consistently observe similar results as originally reported using similar methodology and conditions to an original study. Successful generalization is taken to mean that independent researchers have been able to consistently observe similar results as originally reported under situations that use different methodologies (often superior methodologies or measurement instruments), contexts, and populations, consequently producing evidence that original results generalize to these different situations.

Current approaches (i.e., traditional meta-analyses ) to synthesizing evidence are unable to produce the trustworthy knowledge we seek because these cannot fully account for publication bias (Ferguson & Heene, 2012; McShane, Bockenholt, & Hansen, 2016; Rosenthal, 1979), questionable research practices (John et al., 2012), and unintentional exploitation of design and analytic flexibility (Simmons et al., 2011; Gelman & Loken, 2013) and the various unknowable interactions among these factors.

To achieve our goal of creating trustworthy knowledge then, we need to systematically track the replicability and generalizability of social science findings over time. Curate Science is a general and unified framework for the tracking and curation of replicability and generalizability evidence of social science findings, with the goal of producing a dynamic, living, and continuously evolving body of knowledge that can soundly inform public-policy. The general framework needs to be very flexible to overcome several distinct conceptual, epistemological, and statistical challenges that arise when tracking and gauging replicability and generalizability. Each of the following challenges needs to be overcome to achieve our goal:

  1. Accommodation of different approaches to replication: The current focus in economics and political science is on analytic reproducibility and robustness analyses whereas current focus in psychology is on new sample replications. Each of these approaches is important and the order in which these approaches is implemented is crucial to maximize research efficiency. Findings that are not analytically reproducible and/or analytically robust may not be worth the costly expenses required to attempt to replicate in a new sample. Also, for maximal knowledge creation, it is crucial to have inter-disciplinary curation of replication evidence rather than having economists, political scientists, and psychologists maintain their own replication databases (as is currently the case).
  2. Accumulation of replication evidence that speaks to the replicability and generalizability of an effect/hypothesis (i.e., Replicability replication evidence vs. Generalizability replication evidence). To achieve this, we need flexible ontological structures – what we’re calling evidence collections – to accommodate replication studies of specific effects/hypotheses being nested in different ways in relation to original studies that test an effect across different generalizations and/or operationalizations of the target constructs (e.g., replications of an effect via a single vs. multiple generalization(s)/operationalization(s) originating from a single published article; replications of an effect via multiple generalizations and/or operationalizations originating from several different published articles).
  3. Accommodation of different kinds of studies (e.g., experimental studies [RCTs], observational and correlational studies) and study designs (e.g., between-subjects designs, within-subject designs, interaction designs, etc.).
  4. Development of a working replication taxonomy to allow a principled and justifiable approach to distinguishing replications that are sufficiently methodologically similar to an original study vs. insufficiently methodologically similar. Such a taxonomy also guides what kind of original studies are eligible to be included in evidence collections as separate generalization branches under which direct replication studies are curated (see below for details).
  5. Taking into account study quality of replications and ability to pool across different subsets of replications that vary on the following study quality dimensions: (1) verifiability (e.g., open data/materials availability), (2) pre-registration status, (3) analytic reproducibility verification status, (4) analytic robustness verification status, (5) active sample evidence (also known as positive controls), and (6) replication design differences.
  6. Development of a principled approach to meta-analytically combining replication evidence within and across generalizations of an empirical effect and interpreting the overall meta-analytic results (e.g., fixed-effect vs. random-effects model, possibly hierarchical in the case of multiple generalizations and correlated outcomes; Bayesian approaches to yield more principled and meaningful credible intervals; small telescope approach in the case of very few replication studies).
  7. Creation of a viable crowd-sourcing system that includes key features to (i) incentivize number and frequency of contributions (low-barrier-to-entry approach, user contributions prominently displayed on public user profile and home page) and (ii) ensure quality-control (e.g., light-touch editorial review whereby posted information appears as "unverified" until an editor reviews and approves it).

Ironing out these conceptual, epistemological, and statistical issues is a pre-requisite for setting out to build an actual web interface that researchers can use to track and gauge replicability and generalizability to ultimately produce a living and dynamically evolving body of knowledge that can soundly inform public-policy. The framework is developed with an initial focus on social science findings that have applied implications given that such findings have a lot more potential in influencing society (e.g., font disfluency boosts math performance effect; stereotype threat; "wise" interventions on voting; Mozart effect). That said, the framework will also be able to handle basic social science findings that may not necessarily have direct societal implications.


Replication Taxonomy

Our proposed conceptual framework also needs some kind of workable replication taxonomy to allow a principled and justifiable approach to distinguishing replications that are sufficiently methodologically similar to an original study from replications that are insufficiently similar. Contrary to some current views in the field of psychology, replications actually lie on an ordered continuum of methodological similarity relative to an original study, with exact and conceptual replications occupying the extremes. A direct replication repeats a study using methods as similar as is reasonably possible to the original study, whereas a conceptual replication repeats a study using different general methodology and tests whether a finding generalizes to different manipulations, measurements, domains, and/or contexts (Asendorpf et al., 2013; Brandt et al., 2014; Lykken, 1968; Simons, 2014). To guide the classification of replications based on methodological similarity to an original study, we use the replication taxonomy depicted below, which is a simplification of Schmidt’s (2009) replication classification scheme, itself a simplification of an earlier taxonomy proposed by Hendrick (1991).

Replication Taxonomy

As can be seen, different types of increasingly methodologically dissimilar replications exist between these two poles, each of which serve different purposes. In an “Exact” replication (1st column), every controllable methodological facet would be the same except for contextual variables, which is only typically possible for the original lab and hence is of limited utility for our purposes here. “Very Close” replications (2nd column) employ the same IV and DV operationalizations and IV and DV stimuli as an original study, but can be different in terms of procedural details, physical setting, and contextual variables (with any required linguistic and/or cultural adaptions of the IV or DV stimuli considered as part of “Contextual variables"). “Close” replications (3rd column) employ the same IV and DV operationalizations, but can employ different sets of IV or DV stimuli (or different scale items or a shorter version of a scale) and different procedural details and contextual factors. “Far” replications (4th column) involve different operationalizations for the IV or DV constructs whereas for “Very Far” replications (5th column) everything can be different including different constructs altogether (via different operationalizations) in different domains of inquiry (as in Bargh, Chen, & Burrows’, 1996 Study 1, 2, and 3). Hence, “Exact”, “Very Close”, and “Close” replications reflect increasingly methodologically dissimilar types of direct replications that provide sufficient levels of falsifiability to (1) test and confirm the reliability (i.e., basic existence) of a phenomenon and (2) systematically test relevant contextual factors and other auxiliary assumptions, which contribute to validity and generalizability (Srivastava, 2014; Meehl, 1967, 1978). On the other hand, “Far” and “Very Far” replications reflect conceptual replications that can only speak to validity and generalizability, given the major design differences intentionally introduced.

To achieve our goal, only the three types of direct replications (“Exact”, “Very Close”, and “Close”) are eligible for inclusion in evidence collections. We need such a demarcation because sufficiently methodologically similar replications naturally constrain design and analytic flexibility (old-school “poor person’s pre-registration”) and so ensures sufficient levels of falsifiability to refute an original claim, assuming auxiliary assumptions are met (Earp & Trafimow, 2015; Meehl, 1967, 1978). If a follow-up study is different on every methodological facet, then it can never refute an original claim because unsupportive evidence can always be attributed to one of the intentionally introduced design differences rather than to the falsity of the original hypothesis (Hendrick, 1991; LeBel & Peters, 2011; see also Feynman, 1974). Without such constraints, a popular field where numerous researchers are testing the same (false) hypothesis will inevitably produce false positives with enough determination given that typically an infinite number of analytic and design specifications (across different operationalizations) exist to test a specific effect/phenomenon (Ioannidis, 2005; 2012). The historical case of cold fusion provides a compelling example of this. As recounted by Pashler and Harris (2012, p. 534), only follow-up studies using very different methodology yielded a trickle of positive results observing a cold fusion effect whereas more methodologically similar replications yielded overwhelmingly negative results (Taubes & Bond, 1993).

Note: These features are from an older version (2.0.4) of Curate Science. We will soon be releasing revamped UI designs and features based on a new curation framework (version 3.0.1). You can also check out various other features under development in our sandbox.

Lightning-fast Search with Auto-complete

Our homepage will feature a lightning-fast search with auto-complete so that you can quickly find what you're looking for. To browse, you can select from the Most Curated or Recently Updated articles lists.

1 search page
Innovative Search Results Page

Easily find relevant articles via icons that indicate availability of data/syntax, materials, replication studies, reproducibility info, and pre-registration info. Looking for articles that have specific components available? Use custom filters to only display those articles (e.g., only display articles with available data/syntax)!

2 search results
Article Page: Putting it all Together

Our flagship feature is the consolidation and curation of key information about published articles, which all come together on the article page. The page will feature automatically updating meta-analytic effect size forest plots, in-browser R analyses to verify to reproducibility of results, editable fields to add, modify, or update study information, and element-specific in-line commenting.

3 article page 03
User Profile Dashboard Page

The user dashboard will display a user's recent contributions, a list of their own articles, reading and analyses history, recent activities by other users, and notifications customization.

4 profile dashboard
Why are only "direct replications" considered on Curate Science?
"Direct replications" involve repeating a study using the same general methodology as the original study (except any required cultural or linguistic modifications). On the other hand, "conceptual replications" involve repeating a study using a different general methodology, to test whether a finding generalizes to other manipulations and measurements of the focal constructs (we argue such studies should thus more accurately be called "generalizability studies"). Curate Science only considers direct replications because only such studies can falsify original findings. Failed "conceptual replications" are completely ambiguous as to whether negative results are due to (1) the falsity of original finding or (2) the different methodology employed. Consequently, an over-emphasis on "conceptual replications", in combination with publication bias and unintentional exploitation of design and analytic flexibility, can grossly mischaracterize the evidence base for the reliability of empirical findings (Pashler & Harris, 2012; LeBel & Peters, 2011).
How methodologically close does a direct replication have to be to be added on Curate Science?
As mentioned, a "direct replication" involves repeating a study using the same general methodology as the original study. This means using the same experimental manipulation(s) for independent variables and the same measures for dependent/outcome variables. Minor deviations from the original methodology, however, are acceptable, including using different stimuli, different versions of a questionnaire (e.g., 18-item short-version of Need for cognition scale [NFC] instead of original 34-item version), and any cultural and/or linguistic changes required to execute a direct replication. For example, in Reuven et al.'s (2013) replication of Zhong & Liljenquist's (2006) Study 4, they used the same outcome measure (volunteer behavior), but used a slightly different operationalization (trichotomous rather than dichotomous volunteer choice). The important part is to disclose these design differences so that readers can judge for themselves the extent to which the design differences might be responsible for discrepant results (see Simonsohn (2016) for more on this). Indeed, in the near future, we will explicitly note any known design difference for curated replications (e.g., an icon, which when clicked, expands the row below revealing design differences for that replication).
What does "independent" in "independent replication" mean?
To prevent bias, replications must be carried out by researchers who are sufficiently independent from the researchers who executed the original studies. We conceptualize "sufficiently independent" following the "arm's length principle" used in law. In our context, this means that replicators have not co-authored articles with any original authors and also do not have any familial or interpersonal ties with any original authors.
What is Curate Science's official policy regarding soliciting feedback from original authors?
Curating replication results involves publically discussing original research, hence our policy is to contact authors of original studies under discussion *before* publicly posting curated information. Feedback from original authors will be used to improve the posted information (author comments may also be posted directly underneath replication results to further augment interpretation of results).
Who is Curate Science's intended audience?
Our primary audience is the community of academic, government, and industry researchers. However, we are designing the website so that the organized information is also useful to students, educators, journalists, and public policy makers (e.g., a journalist could look up an article to see whether limitations/flaws have been identified enabling them to write a more balanced news article). Our initial focus involves published articles in the life and social sciences (starting with psychology/neuroscience), however we may eventually expand to other areas.
What is curation?
Digital curation is the process of selecting, filtering, and extracting information as to increase its quality. Curate Science organizes, selects, filters, and extracts information from a diverse set of sources with the goal of increasing the quality of fundamental information about scientific articles.
Who can access and consume information about articles on Curate Science?
Anyone, including non-registered users, can lookup information on Curate Science.
Who can add, modify, and update article information on Curate Science?
Only registered users will be able to add, modify, and update information about articles. That being said, anyone that is affiliated with a research organization can become a registered user, as long as they provide their real names, email address, affiliation, and title (e.g., post-doc, graduate student, undergraduate student, research assistant).
How will you ensure quality control of the information posted about articles on Curate Science?
We will employ a two-stage verification process for some of the information whereby information initially posted will be labeled as "unverified" until a second user confirms it, at which time it will appear as verified. This will be the case for key statistics, independent replication information, and publication bias indices. Like Wikipedia, we will also have a revision history for each editable/updatable field showing which user changed what information on what date (and any notes regarding the edit left by the user).
Is it really feasible to organize and curate information for scientific articles? In other words, why would researchers be willing to spend their precious time curating information on Curate Science?
Our view is that researchers should be highly motivated to add and update information regarding published articles in their own area of research because there is an intrinsic interest to update the scientific record to more accurately reflect the totality of the evidence. We also expect -- just like what happened with Wikipedia -- that more influential and/or controversial articles will be curated first given that a large number of researchers are interested in these articles. Other articles will likely be curated commensurate to the level of interest commanded by the market, though of course article authors are free to curate their own articles as much as they want (and for good reasons, e.g. available data citation advantage, see Piwowar & Vision, 2013)
Who is funding Curate Science?
We have received a $10,000 USD seed grant from the Center for Open Science (COS) to help with initial development. An anonymous donor has also allocated funds to support Curate Science as part of a renewal grant given to COS. Templeton Foundation has accepted our initial grant proposal and we will soon be submitting a full proposal to them. We are also currently in discussions with Sloan Foundation and Laura & John Arnold Foundation.
Is Curate Science associated with the Open Science Framework hosted by the Center for Open Science?
Though Curate Science has formed an informal partnership with the Center for Open Science (with respect to funding, see above), our web application is completely independent from the Open Science Framework (though we will of course be linking to available data, materials, and pre-registration information hosted on the OSF).
How is Curate Science different from Figshare.com, DataDryad.org, and Harvard's Dataverse?
Figshare.com, DataDryad.org, and Harvard's Dataverse (and many other similar websites) are data repositories where researchers can make their data, syntax, and materials publicly available and get credit for doing so. Curate Science organizes, consolidates, and curates all of this publicly available information from as many different sources as possible at the study-level for all published articles with a DOI. Curate Science also provides a platform for the crowd to verify analyses and post comments regarding specific issues such as reproducibility of analyses, problems with posted materials/stimuli, etc.
How is Curate Science different from the Open Science Framework?
The Open Science Framework is a place for researchers to archive, collaborate on, and pre-register their research projects to facilitate researchers' workflow to help increase the alignment between scientific values and scientific practices (see more details on OSF's about page). In contrast, Curate Science, as its name implies, is focused primarily on the curation of scientific information tied to published articles by providing a platform for users to add, modify, update, and comment on published article's replication and reproducibility information (among other things, see features).
How is Curate Science different from PsychFileDrawer.org?
PsychFileDrawer.org is a highly useful website that was designed to overcome the pernicious file drawer problem in psychology where researchers can manually upload serious replication attempts whether they succeeded or failed. Curate Science aims to significantly build upon PsychFileDrawer's venerable efforts by automatically identifying as many extant independent replication results as possible (via text mining) and also will provide a simple interface for the crowd to add any missing replication results. Curate Science will also feature an innovative article page that visually depicts the complex inter-relationships between original and replication studies (to facilitate the difficult task of interpreting replication results) and which will also allow the crowd to curate key information about the original and replication studies (in addition to several other features).
Will I be able to post data/materials directly on an article page on Curate Science?
Yes, eventually. Our current focus is to organize and curate the available data/materials that are already hosted by the many public repositories that already exist. However, in our quest to radically simplify the fundamental scientific practice of sharing data/materials, we are working on forging a partnership with a major data repository website so that users can easily drag-and-drop data for their articles via our interface using our partner's infrastructure (indeed, we're currently in discussions with Figshare.com and the OSF in this regard).
Main Team
Etienne P. LeBel
Founder & Lead
Alex Kyllo
Technical Advisor
Fred Hasselman
Lead Statistician


Advisory Board
Denny Borsboom
University of Amsterdam
Hal Pashler
University of California - San Diego
Daniel Simons
University of Illinois
Alex Holcombe
University of Sydney
E-J Wagenmakers
University of Amsterdam





Brent Roberts
University of Illinois - Urbana-Champaign
Eric Eich
University of British Columbia
Rogier Kievit
University of Cambridge
Leslie John
Harvard University
Brian Earp
Oxford University





Uli Schimmack
University of Toronto
Simine Vazire
Washington University in St. Louis
Axel Cleeremans
Universite Libre de Bruxelles
Brent Donnellan
Michigan State University
Richard Lucas
Michigan State University




Marco Perugini
University of Milan-Bicocca
Mark Brandt
Tilburg University
Joe Cesario
Michigan State University
Ap Dijksterhuis
Radboud University Nijmegen



Jeffry Simpson
University of Minnesota
Jan De Houwer
Ghent University
Lorne Campbell
Western University


Foundational Members
Christian Battista
Technical Advisor
Ben Coe
Technical Advisor
Stephen Demjanenko
Technical Advisor
Please sign up below to receive the Curate Science Newsletter to be automatically notifed about news and updates.

*Thanks to Felix Schönbrodt who is currently hosting Curate Science.