In a new study in Nature, researchers found that the most common brain imaging studies in psychiatry—those that use a small sample to compare brain structure or function with psychological measures—are likely to be false.
These studies, according to the researchers, find a false positive result—a result that is due to chance statistical correlation rather than an actual effect. These highly positive results—even if false—are the most likely to be published.
Then, when future researchers attempt to replicate the findings by conducting another study on the same correlation, they find a negative result. This has been termed the “replication crisis” in psychological research.
The researchers refer to these types of studies as BWAS or brain-wide association studies.
“BWAS associations were smaller than previously thought, resulting in statistically underpowered studies, inflated effect sizes, and replication failures at typical sample sizes,” the researchers write.
The research was led by neuroscientist Scott Marek at Washington University in St. Louis. The study was also reported on by The New York Times.
Marek and his colleagues studied brain scan correlations from around 50,000 participants using three enormous datasets. They found that the correlations between brain volume and function and psychological states were much smaller than individual brain imaging studies have suggested.
In statistics, correlations like these are measured on a scale of 0 to 1. A correlation of 0 means there is no connection between the data, while a correlation of 1 is a perfect match. (However, even random data is likely to correlate slightly by chance.)
In their study, the average correlation between brain measures and psychological measures was 0.01—about as close to 0 as a test like this will ever reach. The largest correlation that they were able to replicate reached 0.16—still a far cry from a clinically relevant correlation.
A good correlation—one approaching 1—looks like this.
And here’s an example of one of the correlations from the study. This is the correlation between cognitive ability and resting-state functional connectivity:
The fact that these correlations are so small indicates that almost everyone overlaps on these measures. For example, almost every person diagnosed with “depression” will have the same brain connectivity as someone without the diagnosis. Likewise, almost every person diagnosed with “ADHD” will have the same brain volume as someone without ADHD.
Yet, in the smaller studies that are far more common in psychological research, correlations are almost always greater than 0.2 and sometimes much larger.
So why the discrepancy? According to Marek and his colleagues, these smaller studies are inflating these correlations due to chance variability—and then only the most inflated actually end up being published.
The most common sample size for these studies is 25 people. At that size, if you conducted two different studies, they could easily each reach the opposite conclusion about the correlation between brain findings and mental health.
“High sampling variability in smaller samples frequently generates strong associations by chance,” the researchers write.
The established method for dealing with this is to increase the threshold for statistical significance (called a multiple-comparison correction). However, according to the researchers, this can actually backfire in these small MRI studies because it inadvertently ensures that only the largest—and thus, least likely to be true—brain differences end up passing the significance test and then being published.
These chance findings and inflated results are ubiquitous in these studies. And even larger samples did not solve the problem. Only massive studies in the tens of thousands began to find more reliable (and tiny) correlations.
“Statistical errors were pervasive across BWAS sample sizes. Even for samples as large as 1,000, false-negative rates were very high (75–100%), and half of the statistically significant associations were inflated by at least 100%,” Marek and his colleagues wrote.
This is far from the first time researchers have noted that brain imaging is unreliable. MRI data is massively complex and notoriously “noisy”—full of random fluctuations that the researchers have to account for in order to find meaningful results. Computer algorithms are used to guess which data is “noise” and which data is important.
In a 2020 study in Nature, 70 teams of researchers analyzed the same brain imaging data. Each team picked a different method to analyze it, and they came to wildly different conclusions, disagreeing on each outcome measure.
A study from 2012 found thousands of ways of analyzing the same MRI results and multiple ways to try to “correct” those analyses. In the end, there were 34,560 possible final results and no way to choose which of these was “correct.”
In a 2020 commentary in JAMA Psychiatry, researchers argued that any conclusions based on MRI scans needed to be considered inconclusive and preliminary. Other researchers suggested that brain imaging was too unreliable to be a useful tool in psychological research.
Marek, S., Tervo-Clemmens, B., Calabro, F. J., Montez, D. F., Kay, B. P., Hatoum, A. S., . . . & Dosenbach, N. U. F. (2022). Reproducible brain-wide association studies require thousands of individuals. Nature. doi:10.1038/s41586-022-04492-9 (Link)