Best practice science in the age of irreplicability

Replication – the ability to find the same effect repeatedly in independent experiments — is a cornerstone of experimental science. However, the published literature is likely full of irreplicable results [1, 2]. There are many reasons for this problem, but the root cause is arguably that the incentive structure of science has selected for flashiness and surprise rather than for truth and rigour. Authors who publish in high-impact journals tend to be rewarded with jobs, grants, and career success, whether or not the result turns out to actually be replicable [3].

These incentives can facilitate poor experimental and statistical practices that make faulty conclusions more likely. Jens Klinzing wrote a nice overview on these issues for the last addition of Neuromag [4]. In this article, I’m going to take his lead and discuss a few practices in more detail that you can incorporate into your work, now and into the future, that will help to ensure the quality of scientific output.

CC-BY (2012) “Floating azalea maze, Central Garden” by Christopher Lance

The garden of forking paths

I believe that the vast majority of scientists are honestly trying to do the best and most accurate science they can. One of the most startling realisations I have had over the past few years, however, is how easy it is for even well-intentioned researchers to unconsciously mislead themselves (and thus also the larger scientific community) [5].

In practice, it’s almost always the case that numerous decisions about how to test the research question of interest are made after or during the process of data collection. By allowing our analyses to depend on the particular data we observe in an experiment we invite the possibility that we are shining a spotlight on noise: random fluctuations that are a property of this dataset and not a replicable property of the world at large.

In an article you should definitely read, Gelman and Loken characterise this as walking through a “garden of forking paths” [6]. The point this article makes, that should give us all pause, is that even if you do not sit and try a bunch of different analyses until you found the one that “worked” (i.e. gave the result you wanted), you might still be on thin inferential ice. Given a different dataset (but the same experiment) you might have done a different analysis and possibly drawn a different conclusion.

Distinguishing exploratory and confirmatory experiments

When your analyses depend on your data, you are conducting exploratory research. A confirmatory test, in contrast, is when everything is pre-specified before data collection. Among other things, this distinction is crucial for the interpretation of p-values. The fabled “0.05” cutoff should, in theory, ensure that 5% or fewer findings declared “significant” are false-positives (i.e. the null is actually true) across a body of literature. However, p-values only correspond to their nominal false-positive rates for confirmatory research – when your hypotheses, design and analysis plan are defined before data collection. For exploratory analyses the true false-positive rates can be far higher (see Jens’ article in the last Neuromag).

Currently, exploratory research is almost always presented as if it is confirmatory. That is a big problem and one of the major precedents causing the replication crisis and the growing public mistrust of science. How can we ensure that we clearly distinguish exploratory and confirmatory analyses, both for ourselves and others?

Report exploratory research honestly

If you know you are doing exploratory research, then report it as such! There should be big red flags in your discussion saying things like “these findings are exploratory; a confirmatory study is needed before firm conclusions can be drawn”. There is nothing wrong with exploratory research – it must continue to be published – but we need more confirmatory experiments in the literature (particularly for surprising or weak effects). And in my opinion, the media should be barred from reporting on anything in the experimental / biological sciences until it has been independently replicated.

But as I discussed above, it can be all too easy to fool yourself (“oh, of course I was going to normalise that way all along”). How can you ensure you are doing what you said you would do before you saw the data?

Pre-registration

Pre-registering an experiment means writing down your experimental hypotheses, data collection plan, experimental design, outcome measures, data preprocessing, exclusion criteria, analyses, and the final test of your hypothesis all before having begun data collection. You consider everything you can think of (greatly helped by running a pilot experiment) as explicitly as you can. Having analysis scripts prepared ahead of time would be the gold standard. Once you have collected the data you perform the analysis and report exactly that result.

As someone who has been doing experiments for years without pre-registration, I can tell you that it is hard. I am used to having some theoretical hypothesis and concrete experimental idea, then rushing in and testing it, then sorting out the details later (in the same dataset). Pre-registration means trying to specify lots of things that I would have sorted out on the way. Nevertheless, it is a rewarding way to guard yourself against the garden of forking paths.

What happens when (not if) you realise you really should do that normalisation or change the analysis, after you have collected the data? That is fine: just report it as such (along with the results of the original analysis). Pre-registration is a mechanism to help you be honest about what is confirmatory and what is exploratory. Readers of your paper can decide how much the deviations from what you had said you would do matter.

So how do you actually pre-register an experiment? This can be as simple as a text document on your computer. However, an even better idea is to pre-register the experiment online in a time stamped repository (e.g. Github or the Open Science Framework). Then, when you write up the study, you can point readers to a link with your pre-registration as evidence that you are really trying to delineate exploratory and confirmatory analyses. For this purpose I really like aspredicted.org. This site has a 9-question template for pre-registration, which encourages a short and structured document. At a later time, you can choose to make your preregistration pdf publicly available with a static URL link, which could be included in the methods of your paper.

Registered reports

At the next level in scientific transparency are registered reports. A scientific publication in a registered report format is one in which (a) the authors submit a manuscript consisting of an introduction, methods, and an analysis plan; (b) reviewers critique the methods and planned analyses, suggesting changes; (c) if the authors and reviewers agree on a protocol, then the paper is in-principle accepted at the journal; (d) the authors go collect and analyse the data; (e) the journal publishes the results. There are now over 20 journals that offer this format, including Cortex, Attention Perception and Psychophysics, and Nature Human Behavior [7].

In my opinion, this format is excellent for guarding against researcher degrees of freedom and is particularly suited for either direct replication studies or for studies whose outcome you expect will be contentious. Registered reports could allow you to sidestep fighting with reviewers who ultimately simply do not like your conclusions. After all, if they agreed on your experimental protocol, then the data could presumably have gone their way. The data could have confirmed their theory.

However, the time-scale required for registered reports make them unsuitable for use in something like a Master’s project or lab rotation. By the time you have received comments from reviewers and adjusted your protocols, your time in the lab is likely up. Self-pre-registration on the other hand will also improve your ability to discriminate exploratory and confirmatory research and will typically take you only a few hours.

Open data and materials

Scientists should want to make it as easy as possible for others to independently check our results and attempt replications. To facilitate these outcomes and to promote transparency in the research process more generally, it is becoming increasingly standard to make raw data and study materials (such as stimuli or analysis code) publicly available at the time of publication. This not only helps others to verify your work, but can also help you. Proper archiving of study material is a requirement of most research funding and this way your materials and data are archived as part of the publication process.

There are a number of ways to share data and materials online. My favourite resource for this is zenodo.org. It is a freely available EC-funded research sharing framework capable of hosting tens of gigabytes of data (terabytes available upon request), and it is hosted on the same infrastructure as the LHC data from CERN. Every upload gets a DOI (digital object identifier), meaning you can cite your materials in the text of your paper with a reference that is never lost (no more self- or publisher-hosted content going missing after a few years).

What if you expect to get several papers from the same dataset? Justifiably you want the first crack at the data you have painstakingly collected. This is no problem. You can arrange a data embargo period (say, one year). Your data is uploaded along with the first publication arising and you know you have one year from that point for exclusive opportunity for further analysis. The advantage of this system is that we continue to reward hard experimental work while reducing the incidence of valuable data languishing on someone’s hard-drive for years (“I’ll get to that other analysis sometime”).

Encouraging others to adopt these practices

Changing the way you do science is hard. There is a lot of inertia, which is greater for those who have been doing it longer. Students are at an advantage in this respect, but also have to abide by their supervisors’ wishes. Even if you agree with my suggestions above, it is probably not possible for you to adopt them all right now. That is fine. Start by telling others about them and encouraging others to do the same thing. Slowly, best practice is changing.

What can you do right now? Consider signing the Peer Reviewer’s Openness Initiative, or PRO – [8]. As a signatory to PRO, you ask authors to make their data and materials available as a condition of peer review.

Imagine you accept a review request. You check whether the authors make their materials available and if not then you write back to the editor asking the authors to either (a) make their data and materials available or (b) write in the manuscript why they do not. If the authors refuse to do either (a) or (b), your recommendation is to reject the manuscript because it does not meet the minimum standards for a scientific paper. This might seem harsh, but consider that you are not judging the justification for (b) – it could be something more (“our raw data contain identifiable patient information, but we make anonymised summary data available”) or less (“we’re too lazy to clean up our code and upload our data”) legitimate. You don’t care, so long as it appears in the main text of the manuscript.

As a reviewer, you can give open science and transparent practices a push along.

Conclusion

Science is in a replication crisis, but thankfully people are becoming increasingly aware of the issues and implementing ways to improve them. Consider trying to adopt the following practices in your scientific work:

Clearly discriminate exploratory and confirmatory analyses
Pre-register your next experiment
Make data and materials openly available
Encourage others to do the same

While these measures will not fix the underlying problem (incentives for scientific career advancement), they will help to improve the quality of scientific output.

Author’s note

Parts of this article were adapted from the author’s earlier blog post, found here. I use the term “irreplicability” rather than “irreproducibility” because “reproducible” research is when you get the same result from the same data and analysis (i.e. there’s no silly error in the analysis script or reporting) whereas a “replicable” finding is one that can be found repeatedly in independent (but as-far-as-possible-identical) experiments [9, 10].

Tom Wallis is a project leader in the CRC 1233 “Robust Vision”; Centre for Integrative Neuroscience, Tübingen, Germany. He blogs infrequently at www.tomwallis.info.

Citations

[1] en.wikipedia.org/wiki/Replication_crisis

[2] nymag.com/scienceofus/2016/09/a-helpful-rundown-of-psychologys-replication-crisis.html

[3] I’m not saying that high-impact papers are generally less likely to be true than papers in other outlets, but there is some worrying evidence that high-impact papers tend to feature lower statistical power and larger bias in effect size estimates in some fields journal.frontiersin.org/article/10.3389/fnhum.2013.00291/full

[4] neuromag.wordpress.com/2016/01/12/why-science-is-broken/

[5] nature.com/news/how-scientists-fool-themselves-and-how-they-can-stop-1.18517

[6] stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

[7] cos.io/our-services/registered-reports/

[8] opennessinitiative.org/

[9] replicability.tau.ac.il/index.php/replicability-in-science/replicability-vs-reproducibility.html

[10] languagelog.ldc.upenn.edu/nll/?p=21956