Way back in the 1970s when I first started studying psychology I heard about publication bias. It was easier to get a study published if it had significant results than if it didn’t.
That made a certain amount of sense. A study producing only nonsignificant results (group against group, variable against variable, pretest versus post-test) might be badly designed, underpowered (too weak to detect a genuine effect), or simply misconceived. No wonder no one wanted to publish it. And who cares about hypotheses that turn out not to be true anyway?
Partly, of course, the problem is obvious: if positive studies are much more likely to be published than negative ones, then erroneous positive results will tend to live on forever rather than being discredited.
More recently the problem of publication bias has been shaking the foundations of much of psychology and medicine. In the field of pharmacology, the problem is worse, because the majority of outcome trials (on which medication approval and physician information is based) are conducted by pharmaceutical firms that stand to benefit enormously from positive results, and run the risk of enormous financial loss from negative ones. Numerous studies have found that positive results tend to be published, while negative ones are quietly tucked under the rug, as documented by Ben Goldacre in his excellent book Bad Pharma.
In a case examining outcome trials of antidepressants (Turner et al, 2008), 48 of 51 published studies were framed as being supportive of the drug being examined (meaning that the medication outperformed placebo). Of these, 11 were regarded by the US Food and Drug Administration as being questionable or negative but were framed as positive outcomes in publication.
So the published data look like this (P = positive, N = negative):
Given that a great number of readers only look at the study abstract or conclusion, or lack the skills to detect spin, they’ll miss the reality that many of the positive trials aren’t so positive. The real published data look more like this:
In contrast, only 1 of 23 unpublished studies supported the idea that the medication being tested was effective.
So the real picture is more like this:
Given that physicians, who are urged to prescribe based on the research, only have access to published data, the result is likely to be a systematic exaggeration of drug benefits.
Smug psychologists (and others) have stood by smirking, unaware that their perspective is elevated only because they are being hoisted by their own petards. True, there are no billion-dollar fortunes to be made from a psychological theory or a therapeutic technique, but there remain more than enough influences to result in a publication bias for noncorporate research:
- A belief (often justified) that journals are more likely to reject articles with nonsignificant results.
- A tendency to research one’s own pet ideas, and a corresponding reluctance to trumpet their downfall.
- A bias to attribute nonsignificant results to inadequate design rather than to the falsehood of one’s hypotheses.
- Allegiance to a school of thought that promotes specific ideas (such as that cognitive behavior therapy is effective – one of my own pet beliefs) and a fear of opprobrium if one reports contrary data.
Does Publication Bias Fundamentally Violate the Principles of Science?
Although science can lead to discoveries of almost infinite complexity, science itself is based on a few relatively simple ideas.
- If you dream up an interesting idea, test it out to see if it works.
- Observation is more important than belief.
- Once you’ve tested an idea, tell others so they can argue about what the data mean.
- And so on.
Even science, in other words, isn’t rocket science. One would think that in execution it would be about as simple as in explanation. But no. In practice, it’s extremely easy for things to go wrong.
An early statistics instructor of mine showed our class an elementary problem with research design by discussing a study of telekinesis (the supposed ability to move things with the mind). The idea was to determine whether a talented subject could make someone else’s coin tosses to come up “heads.” As the likelihood of a properly balanced coin coming up heads is 50%, anything significantly above this would support the idea that something unusual was going on. And indeed, the results showed that the coin came up as heads more often than random chance would suggest. The instructor invited us to guess the problem in the study.
A convoluted discussion ensued in which we all tried to impress with our (extremely limited) understanding of statistics and research design – and with our guesses about the tricks the subject might have employed. Then the instructor revealed what the experimenters had done.
They knew that psychics reported sometimes having a hard time “tuning in” to a task. So if they used all of the trials in the experiment, they might bury a genuine phenomenon in random noise – like trying to estimate the accuracy of a batter in baseball when half the time he is blindfolded. Instead they looked for sequences in which the subject “became hot,” scoring more accurately than chance would allow, and marked out these series for analysis. Sure enough, when compared statistically to chance, there were more ‘heads’ than random chance could account for.
We stared at the instructor, disappointed that his example wasn’t a bit, well, less obvious. How could reasonably sane people have deluded themselves so easily? Clearly this little exercise would have nothing useful to teach us in future.
Try it yourself sometime. Flip a coin (or have someone else do so), and try to make it come up heads. One thing it will almost certainly not do is this:
Instead, you’ll get something like this (I just tried it and this is what I got):
Totals: Heads = 63; Tails = 64
Now imagine that you only analyze sequences of 6 or more where I seem to have been “hot” at producing heads.
Drop the rest of the trials, assuming that I must have been distracted during those ones, and analyze the “hot” sequences:
Heads: 43 Tails: 12
Et voila: Support for my nonexistent telekinetic skills.
Okay, so That Feels Belabored Because it is so Completely Obvious. Why Bother With it?
Well, let’s shift the focus from different periods of a single subject’s performance, to between-subjects’ performances.
Imagine a drug trial in which half the subjects receive our new anti-pimple pill (“Antipimpline”) and half get a placebo. We’ll compare pre-to-post improvement in those getting the drug to those not getting it. And we’ll look at a variety of demographic variables that might have something to do with whether a person responds to the drug: gender, age group, frequency of eating junk food, marital status, income, racial group.
Damn. Overall, our drug is no better than placebo. But remember that data are never smooth, like HTHTHTHTHT. They’re chunky, like HTTTHTHTTH. Trawl the data enough and we are sure to find something. And look! White males under 25 clearly do better on the drug than on placebo! The title of our research paper practically writes itself: Antipimpline reduces acne amongst young Caucasian males.
Okay, well even that causes some eye-rolling. Surely no one would be foolish enough to allow for a fishing expedition like this one. Or if they did, they would demand that you replicate the finding on a new sample to verify that it didn’t just come about as a result of the lumpiness of your data.
Well, wrong. Fishing expeditions like this appear throughout the literature.
The point, however, is that if we are looking for an effect, we will almost always find it in at least some of our subjects.
Let’s shift again – from comparing subject by subject data to study by study. We’ll do 20 studies of antipimpline, each on a hundred subjects. We’ll use the .05 level of statistical significance (meaning that we will get a random false positive about once in every 20 comparisons). Then we’ll define three primary outcomes (number of pimples, presence/absence of 5 or more severe lesions, and subject reports of skin pain) and two secondary outcomes (nurse ratings of improvement, reported self-consciousness about skin).
If these outcomes are not correlated with one another, we’ve just inflated the probability of getting at least one positive outcome to nearly 5 in 20 comparisons, or 25%. Nowhere will you see a study stating that the actual error rate is 25%, however. (In fact, the defined outcomes probably are correlated, so perhaps we’ve really only inflated our odds of success from 5% to 15% or so).
And what happens? Imagine we count as positive (we’ll denote that as ‘P’) any study that is superior to placebo on at least one outcome measure, and negative (‘N’) if no measure is significantly better than placebo. Here’s what we get from our 20 studies:
From our 20 studies we get 4 showing antipimpline to be superior to placebo on at least one outcome measure. We publish those studies, plus one more (at the insistence of a particularly vociferous researcher). The others we dismiss as badly done, or uninteresting, or counter to an already established trend. Something must have gone wrong.
Publication is how studies become visible to science. So what’s visible? Five studies of antipimpline, of which 4 are positive:
Fully 80% of the published literature is supportive, so it seems likely we have a real acne therapy here. Antipimpline goes on to be a bestseller. What’s missing? This:
Lest we nonpharmacologists reactivate our smugness, swap out “mynewgreat therapy” for antipimpline and we can get the same outcome.
Way back in introductory stats class we could not believe that our instructor was giving us such a blatantly bad example of research. Obviously the deletion of trials not showing the “effect” meant that the work could no longer be considered science. It was firmly in the camp of pseudoscience.
Switch to reporting only some subjects’ data, and we have exactly the same thing: Pseudoscience.
And conduct multiple studies on the same question and publish only some of them? Once again: exactly the same problem. By deleting whole studies (and their statistical comparisons) we inflate the error rates in the published literature. And by how much? By an amount that cannot be calculated without access to the original studies – which you do not know about and cannot find.
As a result, without the publication of all studies on a similar question without systematic publication bias – it becomes impossible to know the error rate of the statistics. Without that error rate, the statistics lose their meaning.
* * * * *
Goldacre, Ben (2012). Bad Pharma. New York: Faber & Faber.
Turner, EH, Matthews, AM, Linardatos, E, Tell, RA, & Rosenthal, R (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358, 252-260.