Stanford statistician and methodologist John P. A. Ioannidis, working with Don van Ravenzwaaij, examines how certain statistical methods—and the policies that rely on them—can skew the drug approval process in favor of ineffective drugs.
In the US, the approval process for new drug treatments is handled by the FDA (the EMA fulfills a similar role in Europe). The agency considers a number of factors in its decision, such as availability of other treatments, the severity of the disease under consideration, potential risks and harms of the drug treatment, and of course, the demonstrated efficacy of the drug in clinical trials.
The typical rule for demonstrating that the drug works is the presence of two positive trials of the drug. Of note, the number of failed experiments does not matter, as long as there are at least two positive trials. In some cases, the FDA has approved drugs based on a single positive test if it considers other evidence to support it.
For instance, the antidepressant bupropion (Wellbutrin) was approved after three trials. Only one of those demonstrated efficacy. The antidepressant sertraline (Zoloft)—which, with over 37 million prescriptions, is the most commonly prescribed psychiatric drug in the US—was approved after five clinical trials. Only one of the five trials showed efficacy (one subscale of a second trial also did).
Even in cases where there were two positive trials, the data paints a picture of unclear efficacy. For instance, mirtazapine was approved after ten trials—only five of which showed evidence of efficacy. The other half of the clinical trials showed it to be no better than a placebo.
In fact, an influential study in the top-tier medical journal New England Journal of Medicine found that only 51% of clinical trials for antidepressants demonstrated superiority over placebo.
Even in those trials, however, evidence may be skewed by the reliance on p-values (dichotomous tests of statistical significance that don’t consider clinical significance). For instance, even in positive trials, antidepressants typically demonstrate less than a three-point improvement over the placebo response, which is clinically insignificant—a three-point improvement isn’t noticeable by the patient or the clinician. But it might be statistically significant, and thus meet the criteria for the study to be considered the evidence for the drugs’ efficacy.
Ioannidis and van Ravenzwaaij suggest that other statistical methods may present a more objective accounting of the potential benefits of a new treatment. Two years ago, they demonstrated that Bayes factor statistics might provide a more accurate assessment of results:
“We recommend the use of Bayes factors as a routine tool to assess endorsement of new medications because Bayes factors consistently quantify the strength of evidence. Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications.”
Their current goal was to conduct statistical tests to find the sweet spot. If a method is too loose, it allows ineffective drugs to be approved. But if a method is too strict, it might prevent effective drugs from being approved. So the researchers wanted to find a balance between the two: keeping out most ineffective drugs, while still allowing effective drugs to be approved.
The authors conducted multiple statistical experiments, varying sample sizes, and sizes of effect, to try to determine where that sweet spot lies. They found that for almost all of their situations, Bayesian statistics were better at identifying positive trials and better at rejecting negative trials than standard p-value tests. Bayesian statistics were even better than meta-analytic techniques that synthesized multiple p-value-based studies. Only in situations with extremely large sample sizes did p-value tests achieve similar results.
According to the authors:
“The modest superiority of the Bayesian approach may be due to the fact that it considers all evidence in a cumulative manner, while the rule of having two statistically significant results adds a further dichotomization in counting ‘positive’ and ‘negative’ trials, with further loss of information.”
van Ravenzwaaij, D., & Ioannidis, J. P. A. (2019). True and false-positive rates for different criteria of evaluating statistical evidence from clinical trials. BMC Medical Research Methodology, 19(218). DOI: 10.1186/s12874-019-0865-y (Link)