R. A. Fisher is the famous mathematician and statistician who introduced many of the concepts and procedures of modern statistics. Two of Fisher’s innovations were the ideas of randomization and statistical significance (Healy, 2012). These tactics helped Fisher solve the problem of how he could tell whether or not different fertilizers affected crop yield. Essentially, Fisher’s problem was this: If Fertilizer A and Fertilizer B were used on two different fields of wheat, and more grain came from the field using Fertilizer A, how could he be sure that it was actually the fertilizer that produced the difference and not something else that he hadn’t taken into account.
Fisher’s ingenious idea was to randomly mix up the allocation of fertilizer to field. So, if he had 10 fields, he would randomly allocate Fertilizer A and Fertilizer B between them. By using this procedure, if there were any differences between the fields, they would not systematically influence the crop yield associated with either fertilizer.
Also, Fisher was not prepared to accept the results of just one experiment (Healy, 2012). He reasoned that if the same experiment was repeated 20 times and, on at least 19 of those times the field that was fed with Fertilizer A produced more grain than the field receiving Fertilizer B, then it would be reasonable to conclude that Fertilizer A was indeed a better fertilizer than Fertilizer B in terms of crop yield.
The idea of having a standard for a particular result being repeated a specified number of times forms the basis of statistical significance. Testing for statistical significance in this fertilizer example begins with the assumption that there is no difference in impact on crop yield between the two fertilizing agents. Suppose in one experiment there is a difference of 15 units of crop yield between Fertilizer A and Fertilizer B. Should we conclude that Fertilizer A is better? Or, could it be that the soil in the field upon which Fertilizer A was sprinkled was already more enriched than the field to which Fertilizer B was scattered. If it is unlikely that this result would occur just by luck or chance or nature’s whim, then we conclude that it is a statistically significant result indicating that there is an actual difference between the fertilizers.
Randomization and statistical significance are two of the fundamental ingredients of the research methodology known as the Randomized Controlled Trial (RCT) which is widely regarded as a “gold standard” of evidence (Healy, 2012). From this perspective, the results obtained through an RCT are considered to be more noteworthy and more believable or dependable than results obtained by other means. It is now increasingly recognised, however, that the idea of a hierarchy of evidence is “fundamentally wrong” (Jadad & Enkin, 2007, p. 106). The “best evidence” is obtained in any particular situation by matching an appropriate methodology to a well-articulated and meaningful research question.
An RCT is simply a research tool and, as a tool, it can be used in a variety of ways. Unfortunately, the idea of a hierarchy of evidence seems to be hypnotically seductive for many people and powerfully useful for the drug companies. In order to get a drug to market, regulators in the US such as the Food and Drug Administration (FDA) and also in Europe, only require the drug companies to produce two RCTs with statistically significant positive results (Healy, 2012). Perhaps this very low standard has contributed to the fact that RCTs can be much more useful as marketing tools for drug companies than for discovering new and useful ways for people to live healthy and meaningful lives.
Why should two RCTs with statistically significant positive results be considered a low standard? Partly because of the nature of statistical significance and partly because a standard such as this makes no comment about how many RCTs there might have been that showed a negative result.
Gotzsche (2013) illustrates just how fickle statistical significance can be. He describes an example in which 200 people received an active drug and were compared with 200 people who received a placebo. If 121 people in the drug group improved but only 100 people in the placebo group improved, the probability of obtaining a difference of 21 or greater if the treatment and the placebo were really having a similar effect is 0.04. In this case, then, the researchers could claim that this is a statistically significant result because the probability value is less than the conventional standard of 0.05.
If the numbers were only slightly different, however, the opposite result would be obtained (Gotzsche, 2013). So, if 119 people in the drug group improved compared with 100 people in the placebo group, the probability of this result is now 0.07 so it would not be considered to be statistically significant because it is greater than 0.05. A difference of only two people out of 400, therefore, can be the difference between a statistically significant result and a non-significant result.
In an experiment, the size of the effect, the number of participants, and statistical significance are all related. This means that as long as the effect is not zero, increasing the number of participants in a study will virtually guarantee that statistical significance is achieved. It may be this particular relationship between treatment effect, sample size, and statistical significance that led Healy (2012) to conclude that “the greater the number of people needed in a trial, the more closely the treatment resembles snake oil – which contains omega-3 fatty acids and can be shown in controlled trials to have benefits if sufficiently large numbers of people are recruited” (p. 68).
The standard of only requiring two positive results allows drug companies to mask adverse outcomes. This means that while the difference on a rating scale may be statistically significant, the number of deaths or other serious adverse events might not be (Healy, 2012). While it is likely to be relatively straightforward to tell whether one field of wheat produces more grain than another field of wheat, it can be much more ambiguous to decide whether one group of people are healthier than another group of people after receiving some treatment for their psychological turmoil.
Jachuck and colleagues (1982), for example, investigated the effect of stabilising blood pressure through drug therapy on the quality of life of a group of 75 patients. They asked the patient, the patient’s physician, and a relative or close companion of the patient about the effects of the drugs on the patient’s quality of life. The physicians rated all 75 patients as having an improved quality of life. The relatives rated 74 patients as having a worse quality of life and only one as improved. According to the patients, 36 had an improved quality of life, 7 had a worse quality of life, and 32 reported no change in their quality of life.
This type of variability in reported outcomes provides a lot of “wiggle room” for people such as those working in the marketing departments of drug companies. A compelling example of just how wiggly the evidence base can be is provided by Turner and colleagues (2008).
These researchers investigated 74 FDA-registered RCTs of 12 antidepressant agents. Of the 74 studies the FDA determined that 38 had a positive result (Turner et al., 2008). From these 74 studies, however, only 51 were published. All except one of the studies showing a positive result were published, three studies showing a negative result were published, and another 11 were published as positive results even though this was at odds with the conclusions the FDA made about these studies. So, whereas the published evidence-base shows 94% (48 out of 51) of these studies having a positive benefit for antidepressants, the actual research results were only 51% (38 out of 74) showing a positive benefit (Turner et al., 2008).
Although Jadad and Enkin (2007) consider it unethical to conduct RCTs primarily for commercial interests, it appears that this is precisely what drug companies are currently doing with regard to drug treatments for psychological unrest. Whereas once the research and development budgets of drug companies were larger than their marketing budgets, the situation is now reversed (Healy, 2012). Healy (2012) points out that the aim of drug companies is to get their drugs to market in order to generate profits for their shareholders. The drug companies have been so successful in doing this that, in 2002, the combined profits for the 10 drug companies in Fortune 500 exceeded the profits of all the other 490 companies put together (Gotzsche, 2013).
The RCT has become a powerful device for helping drug companies achieve their aim of maximising profits. Virtually everything we know about drugs comes from what the drug companies tell us (Gotzsche, 2013). To say that a drug “works” for example, simply means that a drug company has been able to produce two studies that showed statistically significant differences on the average scores of rating scales or blood tests between an active drug group and a comparison placebo group (Healy, 2012).
It is not in a company’s best interest to design drugs to cure health problems. Companies will generate more profits for longer if they can market drugs as being necessary to take for long periods of time. Perhaps even for the rest of a person’s life. The situation now has a “through the looking glass” quality to it. Currently the pills people take are saving the lives of the drug companies that produce them (Healy, 2012) rather than correcting a deficient supply of some well-being chemical in our brains.
Fundamentally, even when a chemical does help a person feel better, we have literally no scientific understanding of why or how that has occurred (Healy, 2012). RCTs primarily produce associations between drugs and rating scales (Healy, 2012). We remain clueless, however, as to why any particular association might exist. If we came to see RCTs as identifying relationships that need to be explained, we might be just as interested in the studies that did not produce the desired associations as those that do.
Will we look back on this period of our pharmacological treatment of psychological distress in the same way that we now think about the Thalidomide era? We need much better information than we currently have about the drugs we are continually pressured to ingest. Resources such as this website (www.madinamerica.com) and David Healy’s www.rxisk.org are helping to turn the tide.
People are not fields of wheat. To shift our research attention from fields of wheat to fields of dreams we need different methods and different understandings. If we are to help more people sow their own fields of dreams, and to harvest the benefits of all that a mind unrestrained by damaging drugs is capable of producing, we need a fundamental change in our approach. We need to break the spell of the omnipotence of RCTs and use different methodologies to thoroughly understand the nature of psychological torment and how it is resolved.
Medication needs to become an ancillary or supplementary aspect of treatment if it is used at all. People need to be understood as active agents who are somehow being thwarted in their attempts to live lives of meaning and value. To offer help that will be experienced as helpful, clinicians and researchers must focus on understanding the process of living as it is lived, not as it is observed, and do all that they can to assist and support this process rather than impeding or retarding it. In this endeavour the voice of the person being helped will be a central and guiding factor.
* * * * *
Carey, T. A. (2015). Some problems with randomized controlled trials and some viable alternatives. Clinical Psychology and Psychotherapy. DOI: 10.1002/cpp.1942
Gotzsche, P. C. (2013). Deadly medicines and organised crime: How big pharma has corrupted healthcare. London: Radcliffe Publishing.
Healy, D. (2012). Pharmageddon. Berkeley, CA: University of California Press.
Jachuck, S. J., Brierley, H., Jachuck, S., & Willcox, P. M. (1982). The effect of hypotensive drugs on quality of life. Journal of the Royal College of General Practitioners, 32, 103-105.
Jadad, A. R., & Enkin, M. W. (2007). Randomized controlled trials: Questions, answers and musings (2nd ed.). Malden, MA: Blackwell Publishing.
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. The New England Journal of Medicine, 358, 252-260.
Mad in America hosts blogs by a diverse group of writers. These posts are designed to serve as a public forum for a discussion—broadly speaking—of psychiatry and its treatments. The opinions expressed are the writers’ own.
The arbitary idea of “statistical significance” always seemed silly to me. Who ever decided that .05 was the magic level at which some result became important or non-important? Degrees of significance occur along a continuum or spectrum; situations don’t suddenly become valid or results we should point to when they happen 96% of the time, and unimportant or non-results when they happen 94% of the time.
Because there was a need to have some cutoff. In fact in other circumstances people accept p values as high as 0.1 or only as low as 0.01. It is arbitrary. To make it more complicated, as the author rightly suggests making more comparisons increases the chance of getting a positive result so a post hoc test should be applied. There are many post hoc tests and deciding which one to use if any is even more arbitrary.
Yes, exactly. That’s just the sort of thing I wanted my article to highlight. There are moves to change our reliance on the 0.05 benchmark but it’s a hard tradition to change.
Great article exposing the many problems with corruption of “evidence based medicine”.
Thanks B. I’m really glad you liked it.
In addition to the author’s very valid observations, there is another contaminating factor, and that is that the actual definition of what is “better” can differ depending on values and priorities of the party doing the evaluating. Most psych drug studies focus only on symptom reduction as the goal of treatment. Using this as a guide, we could easily and correctly design a study showing that alcohol is an excellent treatment for anxiety, probably as effective as the benzos with a somewhat better side effect profile. In fact, I’d love to see low doses of marijuana tested against SSRIs for antidepressant effects – I’d bet marijuana would come out on top with ease.
But if you measure other data points, like long term health, employment, relationship stability, community engagement, personal satisfaction with life, you get very, very different results. I think part of the mesmerization has been the act of convincing everyone that symptom reduction is the ultimate measure of effectiveness, because the drug companies can ask “what symptom can we say this drug reduces” and then find or invent a disorder that encompasses the symptom in question.
Part of the reason Whitaker’s work has been so influential is that he cuts through that assumption and asks the important question, “Does symptom reduction in the short term lead to better lives for patients in the long term?” And the answer in every circumstance appears to be NO.
Yep, great point. Robert Whitaker has done a great job of highlighting these problems. As has David Healy. Not only do we focus almost exclusively on symptoms but we also investigate these symptoms using, mostly, self-report questionnaires.
We need a much greater focus on the effects our treatments have on the person’s life from the person’s perspective. Are they able to socialise more, or hold down a job, or build lasting relationships, and so on. Unfortunately, these things are hard to pin down and not necessarily changeable over the few months that most RCTs last.
Tim – Despite the very concentrated focus on the dry run of facts, I can tell we are not dealing with a ghostwriter, here. Thanks a million for getting your theoretical points about rapport and narrative solutions and the honest weighing of apples and oranges in partnership in there, in the concluding paragraphs especially. After looking between the lines from the start, that was satisfying to come away with–the math is just revealing for different reasons than the reductionists want it to be: what’s new? Nice way of very competently retaining the human element.
Thanks travailler-vous – I’m glad you liked the way it ended up.
Tim – It’s a keeper, math, science, ethics. Keep up the good work.
Why should I have swallowed a pill for my harrowing life?
I realized for the first time just the other day that I actually have a 6th grade education. Oh, I attended school all the way through senior year. Sort of. But because of my chronic absenteeism and tardiness I was going to fail my senior year so I signed myself out and didn’t graduate. I spent all four of my high school years in detentions and suspensions because of a severe sleeping disorder that went ignored in favor of the social control, mental illness, which essentially murdered me.
I never needed any of psychiatry’s drugs.
Yes, I was some sort of depressed. No, it was not a chemical imbalance or anything wrong with my brain. I was unfathomably traumatized in a way that the vast majority of people do not experience.
The point I’m making is this: there are no amount of studies that will ever be proof of a true “medicine” when there is no true biological disease occurring in a trauma victim’s body.
There was never anything wrong with my brain. In fact, I was the golden-horned unicorn if there ever was one. Keyword: was.
I stopped learning in the 6th grade. That was right around the time I knew for certain just exactly how bad my home life was, and how hopeless and bleak my future was. It was when I arrived in foster care – foolishly believing those people would understand, and help – that I was forced into psychiatry and psychiatry forced onto and into me.
Just to say it again because it’s so true…
there are no amount of studies in all of ever that will ever be proof of a true “medicine” when there is no true biological disease occurring in a trauma victim’s body.
Sleep disorders are not mental illnesses or psychological diseases. My sleep disorder is inherited, caused by my narcoleptic grandfather’s polio.
The question you started with is brilliant!? That’s exactly the point isn’t it? Thanks for your insights.
Part of the problem with sleep disorders is that the term “sleep disorder” can cover a wide range of problems. Just like a cough can indicate very different underlying problems I expect the same thing is going on with sleep disorders. As you illustrate so clearly, it’s important to understand each person and the nature of their particular problem rather than making assumptions based on inaccurate theories.
Here in the end of the project is one proposal for a new evidence model made by Tomlin & Borgetto. http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=4092155&fileOId=4092156
Thanks for your comments. Alternatives are definitely needed. One of my favourite books was published way back in 1991. It’s called Casting Nets and Testing Specimens. It’s written by Phil Runkel. Phil puts forward the idea that our current reliance on statistics to answer basically all questions from a quantitative perspective is a problem. The “Casting Nets” part of the title refers to statistical approaches and Phil argues that, in order to understand how people function we need to complement statistical approaches with a “Testing Specimens” attitude as well. From this perspective, a model building approach is adopted based on understanding individuals and building up common principles from the establishment of functional, rigorous models. This is the approach used by Perceptual Control Theory (www.pctweb.org). One of the interesting results from conducting research this way is the discovery that we need a different model of causality when we consider living thngs. The independent variable – dependent variable (IV-DV) methodology is a limited way of understanding behaviour. Much like the holistic paradigm you suggest, we need a model of circular rather than linear causality in order to understand accurately how living things function.
Thank you, a great article!!!
Thanks Carina! I’m really glad you liked it.
Great article Tim. And great comments too. Of course, there is also the problem of how to make individual decisions based on group results. Even if drug A is truly more effective than a placebo on average — the null is truly false — it’s not necessarily true that any particular individual taking drug A will do better than they they would had they taken the placebo. I think an individual can only expect to do better with drug A if the variance in the drug A and placebo conditions is not the result of a subject by treatment interaction. That is, if the effect of drug A is the same for all subjects and the same is true for the placebo. An interaction exists of the effect of drug A is different for each subject, as is the effect of the placebo. I don’t think you can test to see whether or not there is such an interaction without doing a within subjects version of the RCT with each subject tested in both the drug and placebo conditions several times. And, of course, this is never done (to my knowledge) and it would be quite impractical. I usually tell my students that group data, like that from RCTs, is relevant to people, such as policy makers, who are trying to get the best results for groups. I really don’t know what people who deal with individuals — clinicians like you — should do with the results of RCT. I guess I’d suggest ignoring it; what do you think?
Thanks for coming in on this – you’re spot on with your comments. Even when RCTs are very good the direction of inference is from the sample to the population but, of course, this is exactly the opposite direction from the direction clinicians are interested in. Clinicians want to know who they can use the results of studies to inform their work with individuals. Unfortunately, statistical inference is currently silent on that topic although conducting studies like the ones you suggest would be a start. Of course, that still wouldn’t answer the questions that are relevant in routine clinical practice such as what’s happening for clients when they are on a gallimaufry of different drugs in different combinations.
Statistics are fantastic resources when they’re used appropriately but very blunt instruments when they’re being used for purposes for which they were not designed.
Tim – Implicit emphasis on psychoactive action here, right? The inference to indicating the effects of some neosporin for that cut aren’t so dicey, are they? But cuts don’t talk themselves into staying clean or dirty, and don’t decide how well they’ve done for themselves which ever way they go.
Thanks for highlighting this. Yes, I think psychological distress is quite different from a physical cut or a broken bone. Most of our current emphasis seems to be on understanding and treating particular symptom patterns which are considered to be analogous to various physical maladies such as diabetes. My focus, however, is on the distress associated with any particular set of symptoms rather than the symptoms themselves. For any particular symptom or symptom pattern there are almost always people in the general population that experience similar things but are not bothered by them at all. From my perspective, it’s the botheration or distress that is the defining feature of what we now call mental health disorders and it’s this distress – closely associated with a person’s agency – that should be the focus of our research and treatment efforts.
Very cool, Tim. My spirit of engagement in interjecting the analogy was impersonal, you understand, my indication of how I am keeping account of the drift to your scientific and clinical approaches. My motive had to do with Rick’s suggestion, however, in that it leads me to consider saying, Yessirree, metaphysics is very “impractical”–if you get my meaning. You just could never finish RCTs on the ineffable, and if that’s a happy go lucky suggestion for saying you’ve got to test and prompt verbal responses without unnecessary introduction of the scientific viewpoint, hooray. Finding out someone’s allergic to medication or goes unconscious at once from the minimum dose are consequences amenable to statistical analysis generally. Do I want to get high like this pill gets me tomorrow? is not a thought appropriately conceived about the drugs in questions.
You’re right, there are many questions that RCTs are not designed to answer and, of course, the other point you raise is that RCTs won’t help you identify the someone who is allergic to the mediction because RCTs aren’t interested in ‘someones’, they’re only interested in group averages. At best, an allergic reaction might get included in a count of adverse events but that would require that it was identified in the study and judged to be an effect of the medication. And, as you finish your comment with, there’s always the question of why a particular individual decides to take a particular drug. RCTs are silent on this matter as well.
Thank you for the follow up and rich focus on all essential matters of context, for how we get from here to there in harm reduction, predicting and assessing efficacy, and deciding why to care, Dr. Carey. To leave my interests fully disclosed, and for one reason since I’m not aiming to practice in the fields of the human sciences myself, let me explain how this article and your views matter for me in respect to the universe of discourse and issues of surviving psychiatry, personally. Very abbreviatedly, anyway, and ala shorthand in that I won’t post links. My intention is to get well-acquainted with Bennett and Hacker’s respective volumes on the history and philosophy of neuroscience and their chief recommendation to trained neurologists, which is the book titled Attention by (if memory serves) philosopher Allen White. My introduction to nonphysicalist approaches in philosophy of science is adequate to appreciate your research interests as you explain your own program, already, at least as far as that means wanting to see that science and industry remain independently signifying terms referring to separate real objects in these slow to evolve fields. I feel that the exciting developments you mention for modelling the intentional and reflexive structures of living organisms and persons in terms of natural world conditions for their autonomous functioning (or something like that, to learn and understand eventually), can then make a great deal more sense to me with these other books read. The plan to read Bennett and Hacker came first chronologically, but the vision of the overall project you refer to in psychology (etc.) seems probably as apt as any to me for helping us to establish a future cognitive neuroscience worth the name. We deserve the promising possibilities these researches are creating, and I want to understand what it takes to see the work through, for myself.
Thanks for explaining where you’re coming from.