In a new study, researchers analyzed existing MRI (brain scan) studies to determine their test-retest reliability. They then conducted a new study in two newly published, large datasets. They found that conclusions based on MRI are highly unreliable.
“These findings demonstrate that common task-fMRI measures are not currently suitable for brain biomarker discovery or for individual-differences research,” the authors write.
In order for a test to be useful, it must be reliable. This means that it must be able to give consistent results each time it is administered. The researchers explain:
“If a measure is going to be used by clinicians to predict the likelihood that a patient will develop an illness in the future, then the patient cannot score randomly high on the measure at one assessment and low on the measure at the next assessment.”
The same is true for clinical research. If a test randomly shows a brain difference in the group of interest in one study, but then shows no brain difference (or a completely different brain difference) in another study, it is useless for doing research on those differences.
Reliability is measured using the ICC (interclass correlation coefficient). A score of .6 is considered “good” reliability, while a score of at least .75 is considered “excellent.” The meta-analysis conducted by the researchers, however, found that MRI studies averaged a score of .397, or “poor.”
The current study was led by Maxwell Elliott and Annchen Knodt at Duke University and published in the journal Psychological Science. Elliott and Knodt analyzed 56 studies (90 total sub-studies) that measured the reliability of MRI data, with 1,088 participants total. This meta-analysis provided the average ICC of .397.
Elliott and Knodt then conducted their own new study, using MRI data from both the Human Connectome Project (HCP) and the Dunedin Study. Structural MRI measures in this experiment were very reliable (cortical thickness, surface area, and subcortical volume were all in the .90s). However, all of the actual brain activity comparisons that relied on MRI analysis of specific regions of the brain averaged an ICC of .251, even worse than predicted by their meta-analysis.
Selective Reporting of Reliability Results
Concerningly, Elliott and Knodt found that ICCs in the studies included in their meta-analysis were selectively reported—some of the previous researchers had reported only the significant findings after conducting multiple ICC tests.
“During the study selection process, we discovered that some researchers calculated many different ICCs (across multiple ROIs, contrasts, and tasks) but reported only a subset of the estimated ICCs that were either statistically significant or reached a minimum ICC threshold.”
This is an unethical research practice that leads to an inflated sense of reliability. It suggests that when the researchers tested reliability and found that it was very poor, they would then do a different reliability analysis. If that came out poor as well, they would then do a third. And a fourth. Eventually, they’d find a positive result—supposed high reliability. They would then report this in their publication as if it were the only analysis they had done.
It was only when digging through the unreported data that Elliott and Knodt were able to determine that this manipulative practice had occurred and account for it in their analysis.
Elliott and Knodt first analyzed the average ICC for the 77 sub-studies that included all ICC data. Their result was an ICC of .397, considered “poor.”
Then they analyzed the average ICC for the 13 sub-studies that did not report all ICC data. The result there was an inflated ICC of .705, considered “good.”
That is, the unethical selective reporting of reliability measures inflated the average ICC from being considered “poor” to being considered “good.”
What About Other Studies?
Elliott and Knodt’s conclusions are supported by other recent studies. A study in Nature found that when the exact same MRI dataset was analyzed by 70 different teams, the results were wildly variable and contradictory. And an editorial in JAMA Psychiatry argued that any conclusions drawn from MRI data were “problematic if not unsubstantiated.”
Another study found that nearly every published study in the field used a different method of analyzing the data, and most did not even report their specific method used. A study from 2017 found that when the MRI data was weighted differently, the supposed “normal” development of the brain looked completely different.
This is especially unfortunate because studies with brain scan images are considered especially trustworthy by readers. This was true even when the brain scans had nothing to do with the actual content of the study. This effect—trusting incorrect or irrelevant brain scan imagery—was especially pronounced for students of psychology.
Elliott ML, Knodt AR, Ireland D, Morris ML, Poulton R, Ramrakha S, . . . & Hariri AR. (2020). What is the test-retest reliability of common task-functional MRI measures? New empirical evidence and a meta-analysis. Psychological Science, 31(7), 792-806. DOI: 10.1177/0956797620916786 (Link)