Crowd-sourcing the tracking and interpretation of replication evidence.
Published scientific findings can only be considered trustworthy -- for theory and applications (e.g., health interventions) -- once successfully replicated and generalized by independent researchers. No database, however, currently exists that systematically tracks and meta-analytically summarizes independent direct replications to gauge the replicability and generalizability of social science findings over time. Curate Science is a crowd-sourced effort to achieve just this to accelerate the development of trustworthy knowledge that can soundly inform theory and effective public policy to improve human welfare (see About section for more details).
Update (October 28, 2016): We've just released a new framework (version 3.0.1) for curating replication evidence of social science findings and are now soliciting feedback (please email us at curatescience @gmail.com). Details of our previous approaches can be found here (version 2.0.4) and here (version 1.0.5).
- Reproducibility Project: Psychology [100 replications; view studies ]
- Social Psych Special Issue [31 replications]
- Many Labs 1 [12 effects x 36 labs = 432 replications]
- Many Labs 2 [26 effects, N = ~15,000]
- Many Labs 3 [10 effects x 21 labs = 210 replications]
- Many Labs 4: Impact of "expertise" on replicability
- Many Labs 5: Can peer-review of protocols boost replicability?
- Registered Replication Reports (RRRs) at Perspectives on Psychological Science
- RRR1 & RRR2: Verbal overshadowing [23 replications; view studies ]
- RRR3: Grammar on intentionality [13 replications]
- RRR4: Ego depletion [23 replications; view studies ]
- RRR5: Facial feedback hypothesis [17 replications]
- RRR6: Commitment on forgiveness [16 replications]
- RRR7: Intuitive-cooperation effect [20 replications]
- RRR8: Trivial pursuit effect [data being collected]
- RRR9: Hostility priming increases perceptions of hostility [data being collected]
- RRR10: Moral reminder reduces cheating [data being collected]
- Economics Reproducibility Project [67 replications; Chang & Li, 2015]
- Economics Lab Experiments Replicability Project [18 replications; Camerer et al, 2016]
Last updated: December 3, 2016
All Replications (1047 replications; 370 curated, 677 being curated [see list])
List of known direct replications ("Exact", "Very Close", or "Close" direct replications) of generalizations of effects according to our working taxonomy. PDFs of study details are linked via study author names. Replication methodological details can be found here. Please let us know about any missing replications or errors at curatescience @gmail.com
Search via CTLR+F (Windows) or ⌘+F (Mac) Last updated: November 14, 2016
Replication protocols and data/materials available at specified protocol URL and OSF URL, respectively. More details about original and replication studies can be found here. Please let us know about any errors at curatescience @gmail.com
Search via CTLR+F (Windows) or ⌘+F (Mac) Last updated: November 21, 2016
Social Priming / Embodiment
Cleanliness priming -- Replications (7) Tweet
Schnall, Benton, & Harvey (2008a)
With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments
|Original Studies & Replications||N||Effect size (d) [95% CI]|
|Schnall et al. (2008a) Study 1||40|
|Arbesfeld et al. (2014)||60|
|Besman et al. (2013)||60|
|Huang (2014) Study 1||189|
|Lee et al. (2013)||90|
|Johnson et al. (2014a) Study 1||208|
|Johnson et al. (2014b)||736|
|Current meta-analytic estimate of replications of SBH's Study 1 (random-effects):|
|Schnall et al. (2008a) Study 2||43|
|Johnson et al. (2014a) Study 2||126|
|Current meta-analytic estimate of all replications (random-effects):|
|[Underlying data (CSV)] [R-code]|
Summary (Last updated: April 7, 2016): The main finding that cleanliness priming reduces the severity of moral judgments does not (yet) appear to be replicable (overall meta-analytic effect: r = -.08 [+/-.13]). In a follow-up commentary, Schnall argued that a ceiling effect in Johnson et al.'s (2014a) studies render their results uninterpretable and hence their replication results should be dismissed. However, independent re-analyses by Simonsohn, Yarkoni, Schönbrodt, Inbar, Fraley, and Simkovic appear to rule out such ceiling effect explanation, hence, Johnson et al.'s (2014a) results should be retained in gauging the replicability of the original cleanliness priming effect. Of course, it's possible "cleanliness priming" may be replicable under different operationalizations, conditions, and/or experimental designs (e.g., within-subjects). Indeed, Huang (2014) has reported new evidence suggesting cleanliness priming may only reduce severity of moral judgments under conditions of "low response effort", however, the research appears to be low-powered (<50%) to detect the small interaction effect found (r = .12). Regardless, independent corroboration of Huang's interaction effect is required before confidence is placed in such moderated cleanliness priming effect.
Original authors' and replicators' comments: F. Cheung mentioned a note should be added that data for the Besman et al. (2013) replication has been lost (communicated to him by K. Daubman, who has not yet responded to my request for links to original data of both her Arbesfeld et al. and Besman et al. replications). M. Frank mentioned we should consider including some of Huang's (2014) studies (baseline un-moderated conditions only), which led us to add Huang's Study 1 (only study with baseline condition comparable to Schnall et al.'s Study 1 design). S. Schnall has yet to respond (email sent March 11, 2016).
- Brent Donnellan (May 21, 2014): Random Reflections on Ceiling Effects and Replication Studies
- Simone Schnall (May 23, 2014): An Experience with a Registered Replication Project (Comments section with heated exchanges)
- Carol Tweten (May 25, 2014): I'm disappointed: A graduate student's perspective
- Sanjay Srivastava (May 25, 2014): Does the replication debate have a diversity problem?
- Michael Frank (May 26, 2014): Another replication of Schnall, Benton, & Harvey (2008)
- Nicole Janz (May 25, 2014): "Replication bullying": Who replicates the replicators?
- Felix Schönbrodt (May 26, 2014): About replication bullies and scientific progress
- Etienne P. LeBel (May 26, 2014): Unsuccessful replications are beginnings not ends – Part I
- Michael Kraus (May 27, 2014): Notes on replication from an un-tenured social psychologist
- Simone Schnall (May 31, 2014): Further Thoughts on Replications, Ceiling Effects and Bullying
- Commentary related to ceiling effect re-analyses
- Uri Simonsohn (June 4, 2014): Ceiling effects and replications
- Tal Yarkoni (June 1, 2014): There is no ceiling effect in Johnson, Cheung, & Donnellan (2014)
- Felix Schönbrodt (June 2, 2014): Reanalyzing the Schnall/Johnson “cleanliness” data sets: New insights from Bayesian and robust approaches
- R. Chris Fraley (May 24, 2014): Additional Reflections on Ceiling Effects in Recent Replication Research
- Yoel Inbar (May 31, 2014): Ceiling effects?
- Matus Simkovic (June 25, 2014): Guess what? Another Analysis of the Schnall-Johnson Data
Money priming -- Replications (42) Tweet
Vohs, Mead, & Goode (2006)
The psychological consequences of money
Caruso, Vohs, Baxter, & Waytz (2013)
Mere exposure to money increases endorsement of free-market systems and social inequality
Summary (Last updated: March 24, 2016): The claim that incidental exposure to money influences social behavior and beliefs does not (yet) appear to be replicable (overall meta-analytic effect: d = -.01 [+/-.05]). This appears to be the case whether money exposure is manipulated via instruction background images (Caruso et al., 2013, Study 1 & 4) or descrambling sentence task (Vohs et al., 2006, Study 3) and whether outcome variable is helping others (Vohs et al., 2006, Study 3), system justification beliefs (Caruso et al., 2013, Study 1), just world beliefs (Caruso et al., 2013, Study 2), social dominance beliefs (Caruso et al., 2013, Study 3), or fair market beliefs (Caruso et al., 2013, Study 4). Of course, it's possible money exposure reliably influences behavior under other (currently unknown) conditions, via other operationalizations, and/or using other experimental designs (e.g., within-subjects).
Original authors' comments: K. Vohs responded and mentioned Schuler & Wänke's (in press) replication of Caruso et al. (2013) was missing; this lead us to add Schuler & Wänke (in press) Study 2 (main effect) as a direct replication of Caruso et al. (2013) Study 2. Vohs pointed out several design differences between Grenier et al. (2012) and Vohs et al.'s (2006) original Study 3, but these deviations are minor (e.g., different priming stimuli, different help target); given Grenier et al. (2012) used the same general methodology as Vohs et al. (2006) Study 3 for the independent variable (unscrambling priming task) and dependent variable (offering help to code data sheets), the study satisfies eligibility criteria for a sufficiently similar direct replication according to Curate Science's taxonomy and hence was retained. Vohs also pointed out design differences between Tate (2009) and Vohs et al. (2006) Study 3; given Tate (2009) employed a different general methodology for the IV (background image on a poster instead of unscrambling task), the study does *not* satisfy eligibility criteria for a direct replication and hence was excluded. Finally, Vohs mentioned that "replication studies" for Vohs et al. (2006) are reported in Vohs (2015), however none of these studies were sufficiently similar methodologically to meet direct replication eligibility criteria and hence were not added.
- Joe Pinsker (October 30, 2014): Just Looking at Cash Makes People Selfish and Less Social
- Neuroskeptic (July, 2015): Social Priming: Money for Nothing?
|Original Studies & Replications||N||Effect size (r) [95% CI]|
|Zhong & Liljenquist (2006) Study 2||27|
|Earp et al. (2014) Study 3||286|
|Siev (2012) Study 2||148|
|Earp et al. (2014) Study 2||156|
|Siev (2012) Study 1||335|
|Earp et al. (2014) Study 1||153|
|Gamez et al. (2011) Study 2||36|
|Current meta-analytic estimate of replications of Z&L's Study 2 (random-effects):|
|Zhong & Liljenquist (2006) Study 3||32|
|Fayard et al. (2009) Study 1||210|
|Gamez et al. (2011) Study 3||45|
|Current meta-analytic estimate of replications of Z&L's Study 3 (random-effects):|
|Zhong & Liljenquist (2006) Study 4||45|
|Fayard et al. (2009) Study 2||115|
|Gamez et al. (2011) Study 4||28|
|Reuven et al. (2013)||29|
|Current meta-analytic estimate of replications of Z&L's Study 4 (random-effects):|
|Current meta-analytic estimate of all replications (random-effects):|
|[Underlying data (CSV)] [R-code]|
Summary (Last updated: November 11, 2016): The main finding that a threat to one's moral purity induces the need to cleanse oneself (the "Macbeth effect") does not (yet) appear to be replicable (overall meta-analytic effect: r = -.02 [+/-.05]). This appears to be the case whether moral purity threat is manipulated via recalling unethical vs. ethical deed (Studies 3 and 4) or transcribing text describing unethical vs. ethical act (Study 2) and whether need to cleanse onself is measured via desirability of cleansing products (Study 2), product choice (Study 3), or reduced volunteerism after cleansing (Study 4). Of course, it is possible the "Macbeth effect" is replicable under different operationalizations and/or experimental designs (e.g., within-subjects).
Original authors' comments: We shared a draft of the curated set of replications with both original authors, and invited them to provide feedback. Chenbo Zhong replied thanking us for the notice and mentioned two published articles that should potentially be considered (i.e., Denke et al., 2014; Reuven et al., 2013). Reuven et al. do indeed report a sufficiently close replication (in their non-OCD control group) of Zhong & Liljenquist's Study 4 and hence the control group replication was added (though we're currently clarifying an issue with their reported t-value).
- Christian Jarret (November 18, 2013): Not so easy to spot: A failure to replicate the Macbeth Effect across three continents
- David Berreby (March 2011): Three Cheers for Failure!
|Original Studies & Replications||N||Effect size (r) [95% CI]|
|Bargh & Shalev (2012) Study 1a||51|
|Bargh & Shalev (2012) Study 1b||41|
|Donnellan et al. (2015a) Study 9||197|
|Donnellan et al. (2015a) Study 4||228|
|Donnellan et al. (2015a) Study 1||235|
|Donnellan et al.(2015b)||291|
|Ferrell et al. (2013)||365|
|McDonald & Donnellan (2015)||356|
|Donnellan et al. (2015a) Study 2||480|
|Donnellan et al. (2015a) Study 8||365|
|Donnellan et al. (2015a) Study 7||311|
|Donnellan & Lucas (2014)||531|
|Donnellan et al. (2015a) Study 6||553|
|Donnellan et al. (2015a) Study 5||494|
|Donnellan et al. (2015a) Study 3||210|
|Current meta-analytic estimate of replications of B&S' Study 1 (random-effects):|
|Bargh & Shalev (2012) Study 2||75|
|Wortman et al. (2014)||260|
|Current meta-analytic estimate of all replications (random-effects):|
|[Underlying data (CSV)] [R-code]|
Summary (Last updated: April 7, 2016): The notion that physical warmth influences psychological social warmth does not appear to be well-supported by the independent replication evidence (overall meta-analytic effect: r = .007 [+/-.035])), at least via Bargh and Shalev's (2012) Study 1 and 2 operational tests (Study 1: trait loneliness is positively associated with warmer bathing; Study 2: briefly holding a frozen cold-pack boosts reported feelings of chronic loneliness). Regarding first operational test, the loneliness-shower effect doesn't appear replicable whether (1) trait loneliness is measured using the complete 20-item UCLA Loneliness Scale (Donnellan et al., 2015 Studies 1-4) or a 10-item modified version of the UCLA Loneliness Scale (Donnellan et al., 2015 Studies 5-9, as in Bargh & Shalev, 2012 Studies 1a and 1b), (2) whether warm bathing is measured via a "physical warmth index" (all replications as in Bargh & Shalev, 2012 Study 1a and 1b) or via the arguably more hypothesis-relevant water temperature item (all replications of Bargh & Shalev Study 1), and (3) whether participants were sampled from Michigan (Donnellan et al., 2015 Studies 1-9), Texas (Ferrell et al., 2013), or Israel (McDonald & Donnellan, 2015). Of course, different operationalizations of the idea may yield replicable evidence, e.g., in different domains, contexts, or using other experimental designs (e.g., within-subjects). In a response, Shalev & Bargh (2015) point out design differences in Donnellan et al.'s (2015) replications that could have led to discrepant results (e.g., participant awareness not probed) and report three additional studies yielding small positive correlations between loneliness and new bathing and showering items (measured separately; r = .09 [+/-.09, N=491] and r = .14 [+/-.08, N=552]). These new findings, however, await independent corroboration (these additional studies not included in meta-analysis because they were executed by non-independent researchers, see FAQ for more details). In a rejoinder, Donnellan et al. (2015b) report an additional study that (1) probed participant awareness and found effect size unaltered by excluding participants suspected of study awareness (r=-.04, N=291 vs. r=-.05, N=323 total sample) and (2) found no evidence that individual differences in attachment style moderated the loneliness-showering link.
Original authors' comments: I. Shalev responsed stating that they've already publicly responded to these replications and have reported three additional studies in their response and that readers be referred to this article (Shalev & Bargh, 2015). B. Donnellan responded stating that several open questions remain including (1) unexplained anomalies in Bargh & Shalev's (2012) Study 1a data (i.e., 46 of the 51 participants (90%) reported taking less than one shower or bath per week) and (2) concerns regarding unclear exclusion criteria for Shalev & Bargh's (2015) new studies. Donnellan further stated that he's unconvinced by Shalev & Bargh's reply and that replication attempts by multiple independent labs would be the most constructive step forward.
- Christian Jarrett (June 20, 2011): Feeling lonely? Have a bath
- Unknown (July 2, 2011): Hot Baths May Cure Loneliness
- Unknown (July 5, 2011): Wash the loneliness away with a long, hot bath
- Elizabeth Angell (July 11, 2011): What a Long, Hot Shower Says About You
- Ian Birch: Important study has implications for treatment of social anxiety
- Unknown (June 23, 2011): Having a hot bath dispels loneliness
- Unknown (June 24, 2011): How soaking in a warm bath can stop you feeling lonely
- Brent Donnellan (September 20, 2012): What’s the First Rule about John Bargh’s Data?
- Sian Beilock (January 25, 2012): Feeling Lonely? Take a Warm Bath
- Brent Donnellan (January 24, 2014): Warm Water and Loneliness
- Brent Donnellan (May 1, 2014): Warm Water and Loneliness Again?!?!
- Brent Donnellan (November 19, 2014): (Hopefully) The Last Thing We Write About Warm Water and Loneliness
Strength model of self-control -- Replications (32) Tweet
Muraven, Tice, & Baumeister (1998)
Self-control as limited resource: Regulatory depletion patterns
Baumeister, Bratslavsky, Muraven, & Tice (1998)
Ego depletion: Is the active self a limited resource?
Summary (Last updated: November 11, 2016): There appears to be replication difficulties across 6 different operationalizations of original studies supporting the two main predictions of the strength model of self-control (Baumeister et al., 2007). Prediction 1: Independent researchers appear unable to replicate the finding that glucose consumption counteracts ego depletion, whether self-control is measured via Stroop (Cesario & Corker, 2010, as in Gaillot et al., 2007, Study 7) or future-discounting task (Lange & Eggert, 2014, Study 1, as in Wang & Dvorak, 2010). Prediction 2: There also appears to be replication difficulties (across 4 distinct operationalizations) for the basic ego depletion effect. This is the case whether IV manipulated via thought supppression, video attention task, emotion suppression during video watching, or effortful letter crossing task and also whether DV measured via anagram performance, standardized tests, working memory, or multi-source interference task. Wang et al. (2014) do appear to successfully replicate Inzlicht & Gutsell's (2007) finding that ego depletion led to reduced activity in the anterior cingulate (region previously associated with conflict monitoring), however this finding should be interpretd with caution given potential bias due to analytic flexibility in data exclusions and EEG analyses. Of course, ego depletion may reflect a replicable phenomenon under different conditions, contexts, and/or operationalizations; however, the replication difficulties across 6 different operationalizations suggest ego depletion might be much more nuanced than previously thought. Indeed, alternative models have recently been proposed (e.g., motivation/attention-based accounts, Inzlicht et al., 2014; mental fatigue, Inzlicht & Berkman, 2015) and novel intra-individual paradigms to measure ego depletion have also emerged (Francis, 2014; Francis et al., 2015) that offer promising avenues for future research.
Original authors' and replicators' comments: B. Schmeichel pointed out a missing replication (Healy et al., 2011, Study 1) of Schmeichel (2007, Study 1); we've added the study, though are currently clarifying with K. Healey a potential issue with their reported effect size. F. Lange mentioned that effect sizes for the RRR ego depletion replications seemed off (also pointed out by B. Schmeichel); indeed, we inadvertently sourced the effect sizes from an RRR dataset that included all exclusions (these have now been corrected and match values reported in Figure 1 of Sripada et al. RRR article). M. Inzlicht responded that he's currently developing a pre-registered study of the basic ego depletion effect using a much longer initial depletion task and adapted to be effortful for everyone via a more powerful pre-post mixed-design. R. Dvorak stated their study was not a replication of ego depletion; we clarified that the Wang & Dvorak (2010) study is used as an original study whose finding is consistent with the glucose claim of Baumeister et al.'s (2007) strength model. J. Lurquin mentioned their effect size was d=0.22 (not d=0.21), but .21 is actually correct given we use Hedge's g bias correction, but still call it d because of its greater familiarity with researchers.
- Michael Inzlicht (April 4, 2016): Updating beliefs
- Christopher J. Ferguson (March 29, 2016): The Reduction of Ego-Depletion
- Brian Resnick (March 25, 2016): What psychology’s crisis means for the future of science
- Michael Inzlicht (March 25, 2016): The Replication Crisis Is My Crisis
- Steven M. Ledbetter & Omar Ganai (March 17, 2016): We've all been wrong about willpower. And that's OK.
- Daniel Engber (March 6, 2016): Everything Is Crumbling: An influential psychological theory, borne out in hundreds of experiments, may have just been debunked. How can so many scientists have been so wrong?
- Melissa Dahl (March 4, 2016): If You Believe Your Willpower Is Limitless, It Is: Psychology's favorite theory about willpower may be totally wrong
- Sam Mcnerney (April, 2013): Ego Depletion, Motivation and Attention: A New Model of Self-Control
- Reponses to Ego Depletion RRR:
- Roy Baumeister & Kathleen Vohs (March 17, 2016): Misguided Effort with Elusive Implications
- C. Sripada, D. Kessler, & J. Jonides (March, 2016): Sifting Signal From Noise With Replication Science
- Martin S. Hagger (March, 2016): Rumours of the Demise of Ego-Depletion are (Somewhat) Exaggerated
Classic Social Psychology
Mood on helping -- Replications (3) Tweet
Isen & Levin (1972)
Effect of feeling good on helping: Cookies and kindness
Levin & Isen (1975)
Further studies on the effect of feeling good on helping
|Original Studies & Replications||N||Effect size (Risk Difference) [95% CI]|
|Isen & Levin (1972) Study 2||41|
|Blevins & Murphy (1974)||50|
|Levin & Isen (1975) Study 1||24|
|Weyant & Clark (1977) Study 2||106|
|Weyant & Clark (1977) Study 1||32|
|Current meta-analytic estimate of L&I Study 1 replications (random-effects):|
|Current meta-analytic estimate of all replications (random-effects):|
|[Underlying data & R-code]|
Summary (Last updated: March 24, 2016): The finding that positive mood boosts helping appears to have replicability problems. Across three replications, individuals presumably in a positive mood (induced via finding a dime in a telephone booth) helped at about the same rate (29.6%) as those not finding a dime (29.8%; meta-analytic risk difference estimate = .03 [+/-.19]; in original studies, 88.8% of dime-finding Ps helped compared to 13.9% of Ps in the control condition). This was the case whether helping was measured via picking up dropped papers (Blevins & Murphy, 1974 as in Isen & Levin, 1972, Study 2) or via mailing a "forgotten letter" (Weyant & Clark, 1977 Study 1 & 2 as in Levin & Isen, 1975, Study 1). These negative replication results are insufficient to declare the mood-helping link as unreplicable, however, they do warrant concern that perhaps additional unmodeled factors should be considered. For instance, it seems plausible that mood may influence helping in different ways for different individuals (e.g., negative, rather than positive, mood may boost helping in some individuals), and may also influence the same person differently on different occasions. Using highly-repeated within-person (HRWP) designs (e.g., Whitsett & Shoda, 2014) would be a fruitful avenue to empirically investigate these more plausible links between mood and helping behavior.
Original authors' comments: Report your research and results thoroughly, you may no longer be around when future researchers interpret replication results of your work!
Registered Replication Reports (RRR) @PoPS
Verbal overshadowing (RRR1 & RRR2 ) -- Replications (23) Tweet
Schooler & Engstler-Schooler (1990)
Verbal overshadowing of visual memories: Some things are better left unsaid
|Original Studies & Replications||N||Effect size [95% CI]|
|Schooler & Engstler-Schooler (1990) Study 1||88|
|Poirer et al. (2014)||95|
|Delvenne et al. (2014)||98|
|Birt & Aucoin (2014)||65|
|Susa et al. (2014)||111|
|Carlson et al. (2014)||160|
|Musselman & Colarusso (2014)||78|
|Echterhoff & Kopietz (2014)||124|
|Mammarella et al. (2014)||104|
|Dellapaolera & Bornstein (2014)||164|
|Mitchell & Petro (2014)||109|
|Ulatowska & Cislak (2014)||106|
|Wade et al. (2014)||121|
|McCoy & Rancourt (2014)||89|
|Greenberg et al. (2014)||75|
|Alogna et al. (2014)||137|
|Michael et al. (2014, mTurk)||615|
|Koch et al. (2014)||67|
|Rubinova et al. (2014)||110|
|Eggleston et al. (2014)||93|
|Kehn et al. (2014)||113|
|Current meta-analytic estimate of all lab replications (random-effects):|
|[Underlying data (CSV) & R-code]|
Summary (Last updated: March 3, 2016): The verbal overshadowing effect appears to be replicable; verbally describing a robber after a 20-minute delay decreased correct identification rate in a lineup by 16% (from 54% [control] to 38% [verbal]; meta-analytic estimate = -16% [+/-.04], equivalent to r = .17). Still in question, however, is the validity and generalizability of the effect, hence it's still premature for public policy to be informed by verbal overshadowing evidence. Validity-wise, it's unclear whether verbal overshadowing is driven by a more conservative judgmental response bias process or driven by a reduced memory discriminability process because no "suspect-absent" lineups were used. This is important to clarify because it directly influences how eye-witness testimony should be treated (e.g., if verbal overshadowing is primarily driven by a more conservative response bias process, identifications made after a verbal descriptions should actually be given *more* [rather than less] weight, see Mickes & Wixted, 2015). Generalizability-wise, in a slight variant of RRR2 (i.e., RRR1), a much smaller overall verbal deficit of -4% [+/-.03] emerged, when the lineup identification occured 20 minutes after verbal description (which occurred immediately after seeing robbery). Future research needs to determine the size of verbal overshadowing when there's a delay between crime and verbal description and before lineup identification, which better reflect real-world conditions.
Original authors' comments: We shared a draft of the curated set of replications with original authors, and invited them to provide feedback. Jonathan Schooler replied stating that the information seemed fine to him.
- Rolf Zwaan (September 18, 2014): Verbal overshadowing: What can we learn from the First APS Registered Replication Report?
- Mickes & Wixted (2015) follow-up article: On the applied implications of the verbal overshadowing effect
(For full details of our replication evidence curation framework, please see here.)
Every year, society spends billions of dollars (primarily of public tax payer money) to fund scientific studies aimed at deepening our understanding of the natural and social world. The hope is that the findings yielded by these studies will help us address important societal problems (e.g., cancer; suicide; racial discrimination; voter turnout). The findings yielded by these studies, however, can only be considered sufficiently trustworthy knowledge ready to inform public-policy decisions once they have been successfully replicated and generalized by independent researchers. Successful replication is taken to mean that independent researchers have been able to consistently observe similar results as originally reported using similar methodology and conditions to an original study. Successful generalization is taken to mean that independent researchers have been able to consistently observe similar results as originally reported under situations that use different methodologies (often superior methodologies or measurement instruments), contexts, and populations, consequently producing evidence that original results generalize to these different situations.
Current approaches (i.e., traditional meta-analyses ) to synthesizing evidence are unable to produce the trustworthy knowledge we seek because these cannot fully account for publication bias (Ferguson & Heene, 2012; McShane, Bockenholt, & Hansen, 2016; Rosenthal, 1979), questionable research practices (John et al., 2012), and unintentional exploitation of design and analytic flexibility (Simmons et al., 2011; Gelman & Loken, 2013) and the various unknowable interactions among these factors.
To achieve our goal of creating trustworthy knowledge then, we need to systematically track the replicability and generalizability of social science findings over time. Curate Science is a general and unified framework for the tracking and curation of replicability and generalizability evidence of social science findings, with the goal of producing a dynamic, living, and continuously evolving body of knowledge that can soundly inform public-policy. The general framework needs to be very flexible to overcome several distinct conceptual, epistemological, and statistical challenges that arise when tracking and gauging replicability and generalizability. Each of the following challenges needs to be overcome to achieve our goal:
- Accommodation of different approaches to replication: The current focus in economics and political science is on analytic reproducibility and robustness analyses whereas current focus in psychology is on new sample replications. Each of these approaches is important and the order in which these approaches is implemented is crucial to maximize research efficiency. Findings that are not analytically reproducible and/or analytically robust may not be worth the costly expenses required to attempt to replicate in a new sample. Also, for maximal knowledge creation, it is crucial to have inter-disciplinary curation of replication evidence rather than having economists, political scientists, and psychologists maintain their own replication databases (as is currently the case).
- Accumulation of replication evidence that speaks to the replicability and generalizability of an effect/hypothesis (i.e., Replicability replication evidence vs. Generalizability replication evidence). To achieve this, we need flexible ontological structures – what we’re calling evidence collections – to accommodate replication studies of specific effects/hypotheses being nested in different ways in relation to original studies that test an effect across different generalizations and/or operationalizations of the target constructs (e.g., replications of an effect via a single vs. multiple generalization(s)/operationalization(s) originating from a single published article; replications of an effect via multiple generalizations and/or operationalizations originating from several different published articles).
- Accommodation of different kinds of studies (e.g., experimental studies [RCTs], observational and correlational studies) and study designs (e.g., between-subjects designs, within-subject designs, interaction designs, etc.).
- Development of a working replication taxonomy to allow a principled and justifiable approach to distinguishing replications that are sufficiently methodologically similar to an original study vs. insufficiently methodologically similar. Such a taxonomy also guides what kind of original studies are eligible to be included in evidence collections as separate generalization branches under which direct replication studies are curated (see below for details).
- Taking into account study quality of replications and ability to pool across different subsets of replications that vary on the following study quality dimensions: (1) verifiability (e.g., open data/materials availability), (2) pre-registration status, (3) analytic reproducibility verification status, (4) analytic robustness verification status, (5) active sample evidence (also known as positive controls), and (6) replication design differences.
- Development of a principled approach to meta-analytically combining replication evidence within and across generalizations of an empirical effect and interpreting the overall meta-analytic results (e.g., fixed-effect vs. random-effects model, possibly hierarchical in the case of multiple generalizations and correlated outcomes; Bayesian approaches to yield more principled and meaningful credible intervals; small telescope approach in the case of very few replication studies).
- Creation of a viable crowd-sourcing system that includes key features to (i) incentivize number and frequency of contributions (low-barrier-to-entry approach, user contributions prominently displayed on public user profile and home page) and (ii) ensure quality-control (e.g., light-touch editorial review whereby posted information appears as "unverified" until an editor reviews and approves it).
Ironing out these conceptual, epistemological, and statistical issues is a pre-requisite for setting out to build an actual web interface that researchers can use to track and gauge replicability and generalizability to ultimately produce a living and dynamically evolving body of knowledge that can soundly inform public-policy. The framework is developed with an initial focus on social science findings that have applied implications given that such findings have a lot more potential in influencing society (e.g., font disfluency boosts math performance effect; stereotype threat; "wise" interventions on voting; Mozart effect). That said, the framework will also be able to handle basic social science findings that may not necessarily have direct societal implications.
Our proposed conceptual framework also needs some kind of workable replication taxonomy to allow a principled and justifiable approach to distinguishing replications that are sufficiently methodologically similar to an original study from replications that are insufficiently similar. Contrary to some current views in the field of psychology, replications actually lie on an ordered continuum of methodological similarity relative to an original study, with exact and conceptual replications occupying the extremes. A direct replication repeats a study using methods as similar as is reasonably possible to the original study, whereas a conceptual replication repeats a study using different general methodology and tests whether a finding generalizes to different manipulations, measurements, domains, and/or contexts (Asendorpf et al., 2013; Brandt et al., 2014; Lykken, 1968; Simons, 2014). To guide the classification of replications based on methodological similarity to an original study, we use the replication taxonomy depicted below, which is a simplification of Schmidt’s (2009) replication classification scheme, itself a simplification of an earlier taxonomy proposed by Hendrick (1991).
As can be seen, different types of increasingly methodologically dissimilar replications exist between these two poles, each of which serve different purposes. In an “Exact” replication (1st column), every controllable methodological facet would be the same except for contextual variables, which is only typically possible for the original lab and hence is of limited utility for our purposes here. “Very Close” replications (2nd column) employ the same IV and DV operationalizations and IV and DV stimuli as an original study, but can be different in terms of procedural details, physical setting, and contextual variables (with any required linguistic and/or cultural adaptions of the IV or DV stimuli considered as part of “Contextual variables"). “Close” replications (3rd column) employ the same IV and DV operationalizations, but can employ different sets of IV or DV stimuli (or different scale items or a shorter version of a scale) and different procedural details and contextual factors. “Far” replications (4th column) involve different operationalizations for the IV or DV constructs whereas for “Very Far” replications (5th column) everything can be different including different constructs altogether (via different operationalizations) in different domains of inquiry (as in Bargh, Chen, & Burrows’, 1996 Study 1, 2, and 3). Hence, “Exact”, “Very Close”, and “Close” replications reflect increasingly methodologically dissimilar types of direct replications that provide sufficient levels of falsifiability to (1) test and confirm the reliability (i.e., basic existence) of a phenomenon and (2) systematically test relevant contextual factors and other auxiliary assumptions, which contribute to validity and generalizability (Srivastava, 2014; Meehl, 1967, 1978). On the other hand, “Far” and “Very Far” replications reflect conceptual replications that can only speak to validity and generalizability, given the major design differences intentionally introduced.
To achieve our goal, only the three types of direct replications (“Exact”, “Very Close”, and “Close”) are eligible for inclusion in evidence collections. We need such a demarcation because sufficiently methodologically similar replications naturally constrain design and analytic flexibility (old-school “poor person’s pre-registration”) and so ensures sufficient levels of falsifiability to refute an original claim, assuming auxiliary assumptions are met (Earp & Trafimow, 2015; Meehl, 1967, 1978). If a follow-up study is different on every methodological facet, then it can never refute an original claim because unsupportive evidence can always be attributed to one of the intentionally introduced design differences rather than to the falsity of the original hypothesis (Hendrick, 1991; LeBel & Peters, 2011; see also Feynman, 1974). Without such constraints, a popular field where numerous researchers are testing the same (false) hypothesis will inevitably produce false positives with enough determination given that typically an infinite number of analytic and design specifications (across different operationalizations) exist to test a specific effect/phenomenon (Ioannidis, 2005; 2012). The historical case of cold fusion provides a compelling example of this. As recounted by Pashler and Harris (2012, p. 534), only follow-up studies using very different methodology yielded a trickle of positive results observing a cold fusion effect whereas more methodologically similar replications yielded overwhelmingly negative results (Taubes & Bond, 1993).
Note: These features are from an older version (2.0.4) of Curate Science. We will soon be releasing revamped UI designs and features based on a new curation framework (version 3.0.1). You can also check out various other features under development in our sandbox.
Lightning-fast Search with Auto-complete
Our homepage will feature a lightning-fast search with auto-complete so that you can quickly find what you're looking for. To browse, you can select from the Most Curated or Recently Updated articles lists.
Innovative Search Results Page
Easily find relevant articles via icons that indicate availability of data/syntax, materials, replication studies, reproducibility info, and pre-registration info. Looking for articles that have specific components available? Use custom filters to only display those articles (e.g., only display articles with available data/syntax)!
Article Page: Putting it all Together
Our flagship feature is the consolidation and curation of key information about published articles, which all come together on the article page. The page will feature automatically updating meta-analytic effect size forest plots, in-browser R analyses to verify to reproducibility of results, editable fields to add, modify, or update study information, and element-specific in-line commenting.
User Profile Dashboard Page
The user dashboard will display a user's recent contributions, a list of their own articles, reading and analyses history, recent activities by other users, and notifications customization.