Publication Bias and Meta-Analyses: Tainting the Gold Standard with Lead

Randy Paterson, PhD, RPsych

As Ben Goldacre notes in his excellent book Bad Pharma, for decades the gold standard for medical evidence was the review article – an essay looking at most or (hopefully) all of the research on a particular question and trying to divine a general trend in the data toward some conclusion (“therapy X seems to be good for condition Y,” for example).

More recently, the format of review articles has shifted – at least where the questions addressed have leant themselves to the new style. The idea has been to look at the original data for all of the studies available, and in effect reanalyze them as though the research participants were all taking part in one gigantic study. By increasing the number of data points and averaging across the vagaries of different studies, a clearer finding might emerge.
The meta-analysis has gone on to be revered as a strategy for advancing healthcare. It has vulnerabilities, of course:

  • It depends on the availability of a number of original studies.
  • It can be distorted by a particularly strong result in one study with a lot of participants
  • It can only be as strong as the research design of its constituent parts.

Nevertheless, if there are a number of well-designed studies with roughly similar formats addressing a similar question, the meta-analysis can provide a balanced, weighted result that points nicely toward treatment selection decisions.

But How are Meta-analyses Affected by Unpublished Studies? 

In my last post I discussed how a publication bias (most commonly, a bias against publishing negative results) leads to a situation in the literature roughly equivalent to reporting only the participants who benefited from a treatment – and slipping under the rug the data from those who did not. And in fact there is a problem for meta analyses.

Imagine that we want to evaluate the effectiveness of a radical new therapy in which depressed individuals talk about their relationships with their pets to the therapist. I don’t practice this form of therapy myself, you’ll be happy to know, but I’m sure someone does. Call it “Talking About Cats Therapy,” or TACT. Studies examining it compare participants’ mood improvements from pre- to post-therapy with the improvement seen in a placebo therapy (PT; let’s make it a sugar pill, for simplicity’s sake, though you’d generally want something that looks more like the treatment being tested).

We look at the published literature and find that there are six published studies. By an amazing coincidence, all six had the same number of participants (100; 50 in each condition), roughly similar outcomes (TACT participants improved on average 4 points more on the Beck Depression Inventory than PT participants), and the same amount of variability in response (lots: in every case, some people improved a lot and some less; a few even worsened).

Given this wide variability, we’ll imagine that only two of the studies meet the effect size necessary to achieve statistical significance. In the other four studies TACT was statistically no better than PT, despite still showing a 2-3 point advantage for TACT.

We conduct our meta-analysis, combining the subjects of the 6 studies into one analysis with 600 participants – 300 in TACT and 300 in PT. We’ve averaged the greater gains made by the participants in TACT – which comes to 4.0 points overall. But because we now have 300 people per group, our study is more powerful – and that 4-point difference is enough to reach statistical significance – at a higher level (p>.01) than the two original studies that were significant (both p>.05).

But there’s a secret.

In our fantasy universe there weren’t just 6 studies of TACT versus PT. There were 10. In 4 of the studies the results suggested that TACT actually made people worse, and the people receiving sugar pills improved a little due to expectancy (about the same amount as they did in the published trials).

Those four studies, like most of the many unsupportive studies of antidepressant medication discussed in my last post, were not published.

The developers of TACT, who firmly believe in the therapy (and stand to make big money from a well-supported therapy via training workshops), decided that there must be some flaw with these negative studies. In retrospect, the therapists weren’t perhaps so well-trained, and somehow there were a lot of people who didn’t actually like their cats in the TACT condition. And anyway, the journals surely wouldn’t be interested in publishing articles about therapies that are worse than placebo, so no point in trying.

But this unpublished data is important.

If we conducted a meta-analysis on all 10 studies, we would find that the positive-ish and negative studies average out, leading to a difference between TACT and PT of 0.00: a complete null effect. The unavailability of negative trials causes our state-of-the-art meta-analysis to misperceive a null therapy as effective.

Why Does this Matter?

When negative studies go unpublished, and when meta-analyses depend only on the published work, the problems of biased data are not averaged out; they are combined. The result can be a stronger finding for a null or harmful therapy than was found in ANY of the studies upon which the meta-analysis was based (stronger, that is, in terms of significance level). Theoretically, it would be possible to obtain a significant meta-analysis of a hundred studies, none of which had reached significance on their own.

Meta-analysis is often viewed as a way of averaging out results and flaws in constituent studies. The lack of representativeness brought about by the nonpublication of negative data (which is the most common type of publication bias) is not compensated for by combining the published studies – it is made worse.

The researchers working with the Cochrane Collaboration, a group dedicated to creating systematic reviews of medical therapies, attempt to correct this problem by locating research trials that have gone unpublished. The results are frequently at variance with the conclusions that would be reached by a review of the published data alone – largely because researchers (or funders) frequently opt not to publish trials that are unsupportive.

Does this really matter? After all, if you are arguing that it is possible for a human to climb Mount Everest without oxygen, it takes only one positive result to make your point. It is irrelevant how many previous attempts resulted in failure.

In healthcare research, however, it matters a great deal. We are looking to see not whether it is possible for a given therapy or approach to benefit at least one person who gets it. Every therapy – whether it is past-life regression, Vitamin C, or high-colonic enemas – will appear to have helped someone, whether because of expectancy, spontaneous recovery, or pure chance. It is for this reason that patient testimonials are not considered to be valid evidence in favour of health-related procedures.

The question we are always asking is whether a therapy is effective (or damaging) for a group of people, be they male airplane phobics, all diabetes sufferers, or post-transplant patients on immunosuppressive drugs. We look at the variability (versus consistency) of response across individuals in our target group, the magnitude of effect, and the size of the effect once the influences of expectancy are removed (usually by comparing the treatment group with a placebo condition). This is precisely the type of judgement likely to be affected by examining only a subset of the data.

What this means is that although meta-analysis is a tremendously useful tool in healthcare research, it remains subject to one of the largest sources of research bias – the selective publication of results.

What Should We Do?

The obvious solution, arrived at by anyone who looks at the problem, is to create a registry for trials before they are carried out, with the understanding that only pre-declared trials will be published, and that all pre-declared trials will be published regardless of the results.

This initiative, at least for pharmaceutical trials, has been agreed upon and declared by a consortium of prominent journals, leading many of us to believe that a big part of the problem had been solved. (At least for medications commencing trials now – it is still not helpful in resolving the situation for medications already on the market). I have openly stated as much at numerous workshops on depression treatment.

Unfortunately, I may have spoken too soon. According to Goldacre, the solemn pronouncements of the editors of many of medicine’s most prestigious journals have meant what few of us were cynical enough to fear: Nothing at all. The journals have gone on publishing unregistered trials much as they did before.

There’s just one difference. Having seen and acknowledged a fundamental problem that compromises the validity of the research they promote, their actions constitute an overt and conscious (rather than simply neglectful) abandonment of the principles of science.

Whether it will be decided, perhaps by future editors, that the welfare of patients merits an improvement in practice remains to be seen. We can only hope.

* * * * *


Goldacre, Ben (2012). Bad Pharma. New York: Faber & Faber.



  1. I’d say it’s way too early for “hope” as I’m sure the drug companies are going to be fighting tooth and nail to suppress those results that wouldn’t sell their product. Who, after all, do the drug companies have in their corner, why the researchers who are writing the articles in the reviews, some of which is being funded by the very same drug companies. There are also articles ghost written by drug reps to which researchers and consulting psychiatrists merely add their names.

    Thanks for your post. Anyone who thinks the articles in these peer reviewed journals are particularly scientific, should now have much cause to pause.

  2. For anyone who is a real scientist, the failure to publish “negative” results is an obviously unacceptable situation. Anyone can cherry-pick positive studies and make a drug appear effective. But that’s what the FDA actually encourages – all you need are two positive studies to get a drug approved, even if there are 45 negative studies saying it doesn’t work! It should be obvious that this approach asks for bias and corruption. And the larger suppression of even publicizing the negative studies makes it impossible for anyone to really know and understand whether something is effective or not.

    Which brings us all back to not being able to trust medical science. Caveat Emptor!

    —- Steve

  3. Thanks for this. There are other problems, built into the FDA’s guidelines for clinical trials. Adverse events must be reported, but only if the researchers think they were caused by the intervention, not the condition the drugs are meant to treat (think drug trials in psychiatry). So, if a participant in a mania drug trial gets akathisia, it can be written of as mania. If antidepressant goes paradoxic, sending a mildly depressed participant into depths of hell far worse than her condition at the outset, that can be coded as progression of her depression, or ignored, and not coded as an adverse drug event.

    This must explain the contrast between clinical trial results and lived reality for people who use them.

    I wrote a blog post that covers this, and more, in response to recent columns in the NTImes and LATimes by experts, who say depression caused the Germanwings crash, and more people should seek treatment. Neither writer mentioned the possibility of a drug reaction; Friedman misquoted a meta-analysis, saying no suicides had occurred, even, when jn fact there had been 8 (5 treated, 3 placebo).

    This is it. It links to both problematic columns.

    • I agree, thanks for this blog. I feel it’s very important to spread the word that the published medical journal articles and supposed “gold standard” of medical evidence is tainted. Especially since seemingly most within the medical community have chosen to adopt so called “evidence based medicine,” thus have opted to be no more credible than the pharmaceutical industry’s tainted (mis)information.

      Medicine used to be known as an art, as well as a science, but it seems strict adherence to belief in “evidence based medicine” only, has eclipsed this wisdom, to the detriment of patients. But, of course, this is much easier, and more lawsuit proof, for the doctors. However, we are discussing essentially corrupt scientific evidence that effects people’s lives and health, not inconsequential scientific fraud and apathy.

      “Adverse events must be reported, but only if the researchers think they were caused by the intervention, not the condition the drugs are meant to treat (think drug trials in psychiatry).” This current unscientific approach, for example, will never point out that the “gold standard” cure for schizophrenia actually causes the schizophrenia symptoms either. But does:

      “neuroleptics … may result in … the anticholinergic intoxication syndrome … Central symptoms may include memory loss, disorientation, incoherence, hallucinations, psychosis, delirium, hyperactivity, twitching or jerking movements, stereotypy, and seizures.”

      How do we know the most common cause of “schizophrenia” is not doctors mistakenly claiming patients are psychotic (for example, because they don’t want to deal with issues like child abuse), then misdiagnosing the central symptoms of neuroleptic induced anticholinergic intoxication syndrome as “schizophrenia”? Especially given Read’s research pointing out 77% of children brought in to hospitals with medical evidence of child abuse, also get a psychosis diagnosis. And the research shows that 85% of schizophrenics had dealt with adverse childhood experiences.

      Just curious, since our society spends millions or billions on schizophrenia research, and the medical evidence is now pointing out it may just be the the “gold standard” cure, that is actually the cause of most schizophrenia.

  4. I gained from reading this that the problem of biased research is not relegated simply to drug research. Biased research is evident in other areas of medicine, including ‘natural’ medicine. In other words, just because some promising new therapy is ‘natural’ or ‘non-harmful’ doesn’t mean that it is effective or that the evidence behind it isn’t biased or cherry picked. I think self empowerment is the number one healer of people. The mind/body connection is huge and to harness the power of the mind to heal, one needs to believe that one can heal. One needs to have hope. The enemy of hope is often found within institutions that promote the disease model.

  5. Yes, I did note this as limitation when I published my dissertation, that I used only PUBLISHED research for my meta-analysis. I’m working on a methodology paper for conducting qualitative meta-analysis and I propose using sources such as open-source journals, trade publications, blogs, newsgroups and other non-traditional sources as well as the “gold standard” of peer reviewed journals. However the biases is strong against such sources! For my dissertation I did two thing I didn’t want to do, one was limit myself to the so-called “High quality” journals and publications. I would have liked to included information from some trade journals and some advocacy blogs. i know if I didn’t limit my sources to those approved by “the academy” I’d never get my dissertation approved!

    In short if I had a good answer to this problem I’d publish it!

  6. Meanwhile, the parallel problems with the culture of labelling and prescribing must infuriate you as well… and everyone from social workers to judges to talk therapists contentedly rely on the compliance model, in the main. The need to disabuse all involved that they have so much as learned a thing about the drug interaction in the specific case before them, while taking their clients as “sick”, is the other road to getting scientific and more cautious. If caregivers have limited their inquiries to checklists and interviews, in terms of supposed symptoms and side effect notions that their checklists require, they have not yet seen anything but the generic version of human organism attached to insurance documents coming to see them. They have decided in anticipation of this event that this payee was only going to be able to babble. Because what people “have” are best called “symptoms”, and anyone wise to the fact tires of holding up their fingers to show the “supposed–“. Those who want counselling and psychoactive drugs have really to concern what we see are worries and fatigue and persistent dislikes. They have unmanageable doubts and discomfort about real conflicts they are facing. And in most cases, they contend with situations in which no one listens capably and fairly who might exercise their authority to clarify such unfortuitous situations if they did. Furthmore, typically, one manner of reacting to conflicts always predominates, with the rules applied to settling them getting enacted according to the standard assumption that their intents and purposes are self-explanatory, and that justice is therefore comprehensively regarded.

    The efficacy of drug solutions will remain undemonstrated if clients are thought of as enumerable substitutes for the trial subjects that were the real thing. Isn’t this effort by Gary Greenberg, one or your counterparts here, the essential type of complementary and needed analysis of the evidence and publication problems, and the theory and practical approach problems, that you see we have to obtain more of in order to achieve the fullest appreciation of human science applied in clinical settings?

    We needed Turner and the heat turned up in subsequent expose’ s, truly. Yet, isn’t the focus in Greenberg’s article, on the same issues of protections against fraud, from front to back? Isn’t cultural criticisim absolutely necessary for arriving at the right decisions about models for psychopharmacological science, and for establishing protections against fraud and abuse? Aren’t fraud and abuse the last things we can expect to hear about after allied mental health industry drug revolutions, and the first thing we get called to our attention before some painless “reforms” that leave the fundamentals of the traditional program firmly in place? Haven’t we got to rely on human sciences, history, and sociocultural criticism for detailing the scope and the significance of the problem of missing and unreliable consumer protections in behavioral healthcare, and the problem of their being overlooked again and again? Will the human science ever study themselves, routinely? I would love to see the connections formulated in a future piece that reigned in the generalities of the statistical criticisms, and that then went on to illuminate the unimaginably bureaucratic purposes behind the unmasterful intentions to do any of this behavioral science scientifically at all, so far, in this long ruinous first fifteen years of the century.

    Thanks, Randy

  7. Here is an idea. If psychiatrists want to do studies so badly, why don’t they do them on themselves like Freud did with cocaine? I wouldn’t wish this upon anyone, but if people are so enamored with “science,” can’t we find a bunch of psychiatrists and pharmaceutical sales reps who will volunteer to take these dangerous psychotropic drugs for a couple of weeks, and then withdraw cold-turkey from them to see what happens? Another problem is that studies cannot possibly measure the long-term deleterious effects of drugs that have just been introduced in the market. There is always a new drug for a new diagnosis with new so-called “side effects.” We need to have an honest discussion about the corruption of the medical industry, the profit motive, and the contortion of “science” to produce results in accordance with the demands of the pharmaceutical industry. These aren’t conspiracies. These are documented, observable realities.

  8. I think the dependent measures in drug trials are a huge problem.

    The Hamilton scale only has one question about mood per se.

    Trials also don’t specifically ask about adverse outcomes, like arguing, violence, and mania, as far as I know.

    Hamilton was not devised to measure outcomes of SSRIs, after all. Focusing only on a flawed depression scale has surely muddied the waters.

    There should be a scale that includes questions about all known drug effects, not just the desired effects.

  9. Re: (Kirsch et al, 2008)

    It is not even necessary to suppress failed trials. Why risk the file drawer when you can eject subjects early in a study for failure to improve on antidepressants, and hope that luck of the draw will work in your favor in round two:

    “Replacement of patients who investigators determined were not improving after 2 wk was allowed in three fluoxetine trials and in the three sertraline trials for which data were reported.”

    That is, “we threw out data we didn’t like.”

  10. Bad therapy ideas come from the problem oldhead pointed out.

    If, in response to something scary, my level of neurotransmitter x falls, that does not mean I should increase my x level to escape feeling scared. It could be that the drop in x that occured was part of a coping response which, if reversed, would allow the perpetuation of my fear. Maybe x has to get out of the way so wonderful y can come in and calm me.

    Not sure but that example might describe what went wrong when SSRIs were developed to ease depression. Recent news is that depressed people have high serotonin availability, not low. Perhaps depression is experienced when (or because) serotonin levels are high for some reason, and begins to remit spontaneously as serotonin drops (or is caused to drop by endogenous or environmental events). If so, increasing serotonin’s re-uptake (and thus multiplying its effects, I assume) would prevent the natural decline of depressive symptoms.

    ^that second paragraph is to science what Beanie Babies are to zoology. Please forgive me for the flight of fancy, unless it happens to be correct.

  11. I want to start by telling you my mathmatical aren’t just questionable. They are nearly nonexistent due to a complete and absolute fear of all things mathmatics. However, my first day of statistics, following the professor’s explanation of what statistics is, was so using this method to track the number of whales in the world. I could tag and track a population study, go back and again count the tracked whales and let this represent the number of whales in the world? But we live in Gerogia…. does that mean there are no whales?The one thing I retained from that course is simply that any detail, no matter how seemingly innocuous, effectively stacks the deck. When the public began to believe there was an notable spike in mass shootings, I ran the numbers. My version is extremely over complicated because I take the legal equality view of statistics meaning these two are exactly alike or they don’t count. I’m sure you know what I uncovered. Completely bogus. I use a similar method when I evaluated the published findings for Xanex. That was when statistics became a word game because, yes, the majority of patients did experience a notable decrease in panic attacks and anxiety FOR THE FIRST two months. After that, the majority experienced as much as 8 times as many instances with what appeared to be no end in sight. This is more than unethical. It’s a dangerous standard when you are dealing with a group that is already troubled and misunderstood. I don’t understand why they bothered to even to conduct a study. Their patients don’t have the luxury of saying no most of the time, and then it clicked. As a branch of “medicine” with extremely limited actual established science, the falsified data serves to establish psychiatry as medicine instead of what it is…. legal drug dealing with notably more established profit margins. I think medical research, in order to be established should be tested by completely unbiased team who know as little as possible going in and have nothing to gain. I also think that psychiatrists deliberately found to skew statistics even as a means to fear monger should be very publically be questioned by ethical review board.

    • You are onto something there. I had similar experiences with a regrettable course of prescribed speed, designed by a lunatic psychiatrist. I was taking Adderall, Ritalin, and Prozac, and recall having an “awake dream” during which I was lying on my bed in the morning, dreaming convincingly that I was out doing my errands for the day. I went to the bank, UPS, and grocery shopping, except I didn’t.

      When I was a young adult I often had sleep paralysis and hypnogogic hallucinations, as well as lucid dreams. I always liked them, during and after.

      If you love chaos it probably means you have the ability to enjoy and accept multiple stimuli at once, which sounds like a strength, not a symptom or deficiency.

      Whether Chantix only makes the bad crazies in people with atypical minds is something that is studied. I pored over so many pages yesterday that I can’t recall all the findings I saw. Some studies recruited individuals with diagnoses of schizophrenia, and bipolar, while others excluded anyone who had sought treatment for depression in recent times. Pfizer funded a lot of them; they are probably trying to create evidence for a “pre-existing condition defense” against slews of lawsuits.

      I too used Chantix, and experienced the unbidden and compelling idea that I ought to kill myself.
      I wasn’t unhappy with anything at the time, nor depressed. It was astounding, because it arose sui generis, not as something I idly considered and easily discarded. There was no possibility that I would commit suicide, but the feeling was a pure transmission of doom, which conveyed the idea that I really should pack it in at the first opportunity. It was a day or so before I remembered that I was on a drug and the drug might have side effects. Once I figured it out, I discontinued it and the suicidal imperative expired.

      PS I will re-examine the Chantix studies and see what they concocted for their findings with the schizophrenic people.

  12. Hi Acidpop, I have spent much of the day and night reading the research on Chantix, a smoking cessation drug. I know what you mean in your post above.

    In the Chantix (varencycline) studies, the authors often include someone who works for, accepts speaking gigs, or holds stock in Pfizer, which makes the drug.

    I have been hoping to figure out how the various studies avoid detecting many serious adverse events when, in real life, by the second year of its availability, it had prompted 988 adverse reaction reports, the most reports received for that time period. For 769 other drugs, the median number of reports is 5.

    I think I found the dishonest data trick, which is what I complained about earlier–the throwing away of data. I did not discover this trick–all the critics of psychiatry write about it.

    In the Chantix meta-analyses, when the researchers are counting adverse events, they often discard any adverse-event data for events that only happened to 5% of fewer patients, or, in one study, to 10% of fewer. The ability of Chantix to cause murder, suicide, and unprovoked violence is known. How many users experience that, though? Probably fewer than 5% or 10%.

    Some of these meta-analyses include more than 10,000 smokers, so even 1% with lethal outcomes is a lot of people.

    Of course, the researchers also often exclude people who drink, or are depressed or anxious, “bipolar” or “schizophrenic,” or have been to a shrink in the last year, to further ensure few adverse events.

    Then, when they cannot hide a finding showing Chantix users do have neuropsych adverse effects, they say “smokers tend to have a high rate of mental illness” and “withdrawal symptoms for smoking cessation may be responsible.”

    Chantix does, remarkably, lead to a higher rate of smoking cessation than do nicotine patches. But still, only about 22% stay clean for a year after quitting with Chantix, so why do we have a murder/suicide pill out there, from which 80% of users will not benefit, and will risk destroying their lives and those of others?

    Here is a group of customer reviews. Many of them say that their bodies and minds are seriously damaged after using the drug only a short while.

    You would not know that from reading all the studies i have read today.

    This situation is possible because of research manipulations, Pfizer-designed research, and advertising that is false.

    FDA: False Data Always?

    On the FDA site, there are the usual platitudes, and the results of two massive studies done by the VA and the DOD. No differences in psychiatric hospitalizations between Chantix users and nicotine replacement users. But, that is from a subset of Chantix users. One of the studies excluded PTSD hospitalizations from the data, and the other one only considered the outcomes for the first 30 days of Chantix use.

    Meanwhile, a medical watchdog group analyzed reports made to the FDA, and learned what the FDA already knew.

    • Interestingly enough, I took Chantix., and I have a small curiosity as far as whether it effects those deemed mentally ill differently than those deemed otherwise. My experience was that the drug worked as a far as smoking cessation provided you continued to take the drug. Once you quit, all bets were off. I didn’t experience suicidal thoughts or violent tendencies. I had some of the most bizarre lucid dreams I have heard of though. At one point I woke early having set my clock because I had to put my phone back together. I recall with absolute clarity getting wet and taking it apart down to sorting the screws the way you would a laptop to make sure everything goes back in the way it came out. That never happened. However, at some point, I did set the alarm for precisely that reason. When I stopped taking it, I developed sleep paralysis. That was a new concept for me. I suppose the concept of sleep had seemed safe enough up until then. I would have these nightmares where I was attempting to hide, usually under a bed, and I would know I was asleep, but I couldn’t move or make any sound. Eventually, this bizarre whistling in my throat would wake me fully as I attempted to scream myself awake. I cannot say with any certainty that the Chantix was responsible for that, but that weirdly lucid feeling was both alien and prevalent in both series of experiences. The reason I wonder if it effects the group differently is another weird theory of mine… I am indecisive, largely incompetent, and given to moments of anxiety when dealing with things as normal as getting dressed. Give me an atmosphere of absolute chaos. Normal people seem to break down entirely. Introduce me to absolute chaos, and I am suddenly in my element. I’ve watched this phenomenon on several occasions and from several different perspectives. My current working theory is that sane people really can’t process even the slightest introduction of “crazy” so I wondered if the studies took mental state into account, whether the conducted separate studies, or they chose the most likely option to create skewed results and simply didn’t note it as a factor.