I love science. I really love science (and particularly neuroscience), humanity’s great cultural endeavor to understand the world and ourselves. Unfortunately, science is broken. It is broken in many ways, not only in the way it distributes funding, its bizarre publishing scheme, or the way scientists are employed. It is also fundamentally broken in the way we obtain and evaluate our findings.
Most published scientific literature is false
News traveled quickly when the results of a large replication initiative were published last year in the journal Science. The Open Science Collaboration, a team of 270 authors, had tried to replicate 100 studies from four important psychology journals. The replication studies were conducted as similar to the original studies as possible, often in close cooperation with the original authors. The results were alarming. In only 39 of the 100 replications could the original finding be successfully reproduced*. Considering the average statistical power of the replications, one would have expected around 89 successful reproductions if all the original effects were real. And it gets worse. The replication outcomes did not even correlate well with the statistical significance of the original studies. The most significant effects, having very low p-values, did not have much higher chances to be reproduced than moderately significant effects.
Outcomes of the replication initiative hardly resembled the original studies. Almost all original studies had been reported ‘significant’ (97 out of 100) with p-values < .05, the traditional threshold for significance. P-values in the replications, however, were spread all across the range between 0 and 1 and were mostly ‘not significant’ (p > .05). Even studies with very small p-values < .01 did not have a good chance to be replicated. Adapted from Open Science Collaboration (2015).
In my opinion, all this teaches us two things:
- P-values are a bad indicator for how likely an effect is to be reproduced. Next time you read a publication and there are three impressive asterisks (***) over a bar graph, remember, this does not mean the effect is much more reliable than if there was only one asterisk (*). Why this is the case can be seen in this video (dance of the p-values) in quite an entertaining way.
- We have way too many false positives in science. Low reproducibility is certainly not a problem exclusive to psychology. There are similar findings for neuroscience, genetics, epidemiology, etc. It is a general problem of science, particularly when samples (and thus statistical power) are small, effects are small, and variance is high. We have good reason to believe that more than half of all peer-reviewed scientific literature is false. More than half of all the papers you read. More than half of all posters you visit, talks you listen to, and more than half of all studies you are going to publish yourself.
But where do all the false positives come from?
Problem 1: Underpowered studies
Collecting data is expensive, not only in terms of money but also in terms of time, effort, technician hours, materials, lab space and so on. However, operating with sample sizes too small for the effect in question is dangerous. Most people think that low power should only affect your chance of missing a true effect (i.e. beta, your type II error rate, or more precisely: the probability of incorrectly retaining the null hypothesis when it is false). And that is indeed the case. Power (1 – beta) gives you the probability of recognizing an effect when it is indeed there (i.e. the probability of correctly rejecting the null hypothesis). Low power does not affect your type I error rate, which is given by your alpha value, set by you in advance. So you could argue that if you plan an underpowered study it is your own risk. You will be the one missing an amazing effect because you chose to test 10 mice instead of 20.
However, in addition to potentially wasting tax-payers money, which could otherwise be used to build schools and hospitals, you are also distorting the publication landscape. Why? There is more to consider than the type I error rate or the p-value of your study. The p-value tells you: Given there is no effect, how likely is the data you observed (i.e. if the null hypothesis is true, what are the chances that you will obtain the actually observed effect or larger). You will reject the null hypothesis only if your data was very unlikely. But as someone that reads through the scientific literature in a particular area, you want to know: If this study tells me there is an effect, how likely is it that there actually is an effect? This is called positive predictive value (PPV) of a study**. Please note that the p-value tells you how likely your data is given there is no effect. The PPV tells you how likely the effect is true, given the study (which is closer to what we actually want to know).
Let us further explore the difference between the p-value and the positive predictive value. Imagine you work on homeopathy and you operate with an alpha value of .05. For a given study, you as an experimenter, will set the type I error rate to 5 % by rejecting the null hypothesis only if your observed p-value is below your alpha value of .05. However, if you look at the whole field of homeopathy research, what is the rate of false positive studies in that field? Since ideal homeopathic remedies are identical to placebo you will end up with 100 % false positive studies, i.e. a positive predictive value of a positive study of 0 (the homeopathy example is from here). That means if you read a positive study from that field showing a significant effect this does not increase the chance of that effect being true (admittedly this is an extreme case, but you get the point). When it comes to your own studies, you want the positive predictive value to be high.
The positive predictive value of a study is higher
- the more likely your effect, i.e. the ratio of investigated true and false hypotheses (which is very hard to estimate),
- the lower your alpha value, and
- the higher your power (and that is crucial).
In this web app you can play with the parameters a bit and see how the PPV changes. Learn more about the PPV and the problem of underpowered studies here and here. The average PPV in neuroscience is probably somewhere around 50 % given well-intentioned estimates of average power and likelihood of an effect. That means, even if the researcher conducts a perfectly unbiased and honest study, the chances that a demonstrated effect is actually real is as likely as winning a coin toss.
Problem 2: Scientific misconduct
Another obvious explanation is plain fraud. Despite several recent prominent cases of fraud (like this one, or this, or this, for a list of notable cases look here) people seem to believe that scientific misconduct is very rare or a problem confined to countries outside of Europe and North America. However, I think fraud is more common than we like to think and the limited data we have on this topic seem to confirm this. I do not think this is surprising given the high incentives for ‘clean’ positive results, given how difficult it is to publish negative results, given how easy it is to make some ‘minor adjustments’ to the numbers in your Excel table, and given the extreme competition scientists are confronted with. But even these ‘minor adjustments’, impossible to detect and maybe just enough to lift the effect over the significance barrier, deceive your colleagues, distort our view of reality, waste tax-payers money, may ultimately have real world effects. Like patients receiving useless or even dangerous treatments (which for example happened here and here lately).
I am not suggesting that the non-reproduced studies in the above publication operated with fabricated data or that most false positive studies do. I am merely trying to describe some potential problems in science, problems that we need to take seriously and fix.
Problem 3: The researcher degrees of freedom
Every study comes with a large number of decisions, or “degrees of freedom”. You decide how much data to collect, how to identify outliers, which subgroups to introduce, which variables and interactions to analyze, which statistical tests to use – and how much of all this to report. The problem with our commonly used statistical framework is that it assumes all of these decisions are made before you look at the first data point. In the best case, your analysis should be entirely independent from your data, should be written down in advance, and should give you a definite result once you cast the data at it. Sure, this is not always realistic. Science can be messy and sometimes we can only develop the right analysis once we examine the actual data. But the further we diverge from the right path, the more will we bias the analysis – often unconsciously, often with good intentions, but often with detrimental effects to our scientific field. And this can happen faster than you may think. Simulations have shown that by exploiting only four degrees of freedom you will be able to get a positive finding from randomly drawn data up to 60 % of the time.
Likelihood of getting a false-positive result when exploiting researcher degrees of freedom. Simulations have shown that by combining four degrees of freedom you will be able to obtain a positive result from randomly drawn data 60 % of the time. Adapted from Simmons 2011.
A particular and often encountered problem is the collection of more data if the effect is “not yet significant”. This can strongly inflate your false positive rate, particularly when you collect only a few more observations per condition or when you repeatedly collect more data if your result is ‘still not significant’. If you exploit your researcher degrees of freedom you may end up deceiving yourself and everyone else. This does not mean that a study cannot be exploratory or have exploratory components. But this should be stated clearly in the manuscript, since it weakens the conclusions that can be drawn from its findings.
What can we do about it?
We can always argue that the error is in the system. If the journals more readily accepted “negative” findings, if our job situation was better and the pressure to publish lower, if replication studies were appreciated by higher-ranked journals and would not be almost useless for our career… Yes, all these things would substantially improve the situation. But they are measures that require long-term collaborative effort. What can we do in our own workgroup, right now, while planning, conducting, or publishing the next study? Needless to say I do not have the final solution. But I have collected some suggestions from papers, comments, and discussions with colleagues and fellow students. Here are some of them:
- Introduce strict blinding during data collection & analysis. This is already quite common for data collection, at least in some scientific disciplines, but I guess it is still very rare during data analysis.
- Collect more data. This can be hard if data collection is very time- and money-consuming. But often it is worth adding another month of work to the project to be a little bit more confident about the data in the end.
- Have strict rules on when to stop data collection and make sure everyone in the project agrees on them beforehand.
- Keep lab notebooks that document and enforce study design, investigated variables, sample size, in-/exclusion criteria, and data analysis scheme.
- Register your studies before beginning data collection. This has very successfully decreased the rate of positive studies (and probably increased the positive predictive value of studies) for clinical trials in the US.
- Analyze the important dependent variables at the very end and decide on the inclusion or exclusion of data points before.
- Try out pre-print publication maybe even with publishing raw data. This has been very successful in physics and is now slowly coming to the life sciences as well. One example is the bioRxiv.
Which of these suggestions can help, which are naïve, which are missing? Please start a discussion with us in the comments section.
The situation in science is bad but not hopeless. One of the much acclaimed features of science is that it is self-correcting. However, in this case it will not literally correct itself, it will have to be corrected by the people conducting it. As the next generation of scientists this will be our job – we shall see how we do.
Update: In March 2016, the journal Science published a comment on the mentioned large reproducibility study in which Gilbert et al. took a critical look at how the Open Science Collaboration (OSC) conducted their replications. Apparently they were at least in some cases far away from being ‘exact replications’. Gilbert et al. also looked at other replication initiatives and use their results to estimate the statistical error in the OSC data. They end up concluding that actually “the reproducibility of psychological science is quite high”. Let’s wait for the OSC’s reply.
Jens Klinzing is a doctoral student at the Graduate Training Centre for Neuroscience and the Institute for Medical Psychology and Behavioural Neurobiology in Tübingen, Germany.
* Reproduction success was measured not only by significant test statistics in the replication study (which would have resulted in only 35 successful reproductions) but by a total of 5 indicators, including combining the original and the replication data, effect size, and the subjective assessment “did it replicate?”.
** While people tend to call the type I error rate alpha the “false positive rate” this can be misleading, since the rate of false positives within a set of studies that used a particular alpha value and claim an effect is given by the false discovery rate = 1 – PPV, not the used alpha value.