In a new study published in Nature, 70 teams of neuroimaging researchers analyzed the same brain scan data, looking to verify the same nine hypotheses about the results. Every single team picked a different way to analyze the data, and their results varied wildly.
The neuroimaging data comprised results from functional magnetic resonance imaging, or fMRI brain scans, of 108 people performing a task. All but one of the 70 research teams had previously published research using fMRI data and could be considered experienced in analyzing neuroimaging results. Each of the nine hypotheses they were asked to test related to activation in a different brain region.
On average, 1 in 5 teams disagreed if each of the nine hypotheses were true or not. The researchers write:
“For every hypothesis, at least four different analysis pipelines could be found that were used in practice by research groups in the field and resulted in a significant outcome.”
That is, no matter what the hypothesis, there are always multiple, scientifically accepted ways to come up with a positive finding—even if there is likely no true difference.
Even “teams with highly correlated underlying unthresholded statistical maps nonetheless reported different hypothesis outcomes” after finishing their analyses in different ways.
Perhaps even more concerningly, researchers consistently expected to find significant results. This was tested for all researchers who were involved in the study, as well as outside researchers who did not analyze the data.
The researchers found “a general overestimation by researchers of the likelihood of significant results across hypotheses—even by those researchers who had analyzed the data themselves—reflecting a marked optimism bias by researchers in the field.”
The researchers expect to find an effect and may design their study—making choices in their data analysis—in such a way as to ensure that they find one. This can also lead to publication bias, in which negative results are unexpected and considered wrong, while positive results are published without much thought.
fMRI data can be extraordinarily complex and contains a large amount of what is considered “noise”—random fluctuations that need to be removed for the data to make sense.
In order to do that, algorithms are created which guess at which data is extraneous and which data is vitally important. It is a multi-stage process, and researchers must make choices throughout. There is no consensus around the “proper” way to analyze a neuroimaging dataset. A study from 2012 found 6,912 different ways of analyzing at a dataset, with five further ways of correcting those, ending up with 34,560 different final results—all of which are considered “correct.”
One previous study found that almost every single study that used neuroimaging analyzed the data differently. That study also found that most publications did not even report the specific choices they made when analyzing the data. The researchers, in that case, concluded that because the rate of false-positive results is thought to increase with the flexibility of experimental designs, “the field of functional neuroimaging may be particularly vulnerable to false positives.”
“False positives” occur when the researchers find an effect or a difference, but it is due to random elements of the data (it is not a real effect or difference).
A previous study also found that when fMRI data were analyzed differently, the supposed “normal” brain development of children looked completely different. The researchers in that study weighted different parts of the population differently (for instance, giving more weight to the participants living in poverty). They found that brain development occurred differently. This demonstrates that averaging very different data (as most neuroimaging studies do) can lead to incorrect conclusions about what is “normal.”
Unfortunately, despite the high degree of uncertainty, incorrect conclusions (false positives), and the lack of transparent methodological reporting, people are more inclined to trust studies that present neuroimages—even when those images are unrelated to the actual research. Another study demonstrated that students of psychology were especially susceptible to these misrepresentations.
Perhaps one of the most prominent examples of this problem in real life is a controversial study from 2017 that claimed to find brain differences between “normal” children and children with a diagnosis of “ADHD.” That study resulted in calls for a retraction, and Lancet Psychiatry devoted an entire issue to criticisms of the study. Very high-profile critics, like Allen Frances (chair of the DSM-IV task force) and Keith Conners (one of the first and most famous researchers on ADHD), re-analyzed the data and found no actual brain differences. Other researchers found differences—but they were explained by intelligence, not by the presence of the ADHD diagnosis.
Botvinik-Nezer, R., Holzmeister, F., Camerer, C.F., et al. (2020). Variability in the analysis of a single neuroimaging dataset by many teams. Nature, 582, 84–88. https://doi.org/10.1038/s41586-020-2314-9 (Link)