R. A. Fisher is the famous mathematician and statistician who introduced many of the concepts and procedures of modern statistics. Two of Fisher’s innovations were the ideas of randomization and statistical significance (Healy, 2012). These tactics helped Fisher solve the problem of how he could tell whether or not different fertilizers affected crop yield. Essentially, Fisher’s problem was this: If Fertilizer A and Fertilizer B were used on two different fields of wheat, and more grain came from the field using Fertilizer A, how could he be sure that it was actually the fertilizer that produced the difference and not something else that he hadn’t taken into account.
Fisher’s ingenious idea was to randomly mix up the allocation of fertilizer to field. So, if he had 10 fields, he would randomly allocate Fertilizer A and Fertilizer B between them. By using this procedure, if there were any differences between the fields, they would not systematically influence the crop yield associated with either fertilizer.
Also, Fisher was not prepared to accept the results of just one experiment (Healy, 2012). He reasoned that if the same experiment was repeated 20 times and, on at least 19 of those times the field that was fed with Fertilizer A produced more grain than the field receiving Fertilizer B, then it would be reasonable to conclude that Fertilizer A was indeed a better fertilizer than Fertilizer B in terms of crop yield.
The idea of having a standard for a particular result being repeated a specified number of times forms the basis of statistical significance. Testing for statistical significance in this fertilizer example begins with the assumption that there is no difference in impact on crop yield between the two fertilizing agents. Suppose in one experiment there is a difference of 15 units of crop yield between Fertilizer A and Fertilizer B. Should we conclude that Fertilizer A is better? Or, could it be that the soil in the field upon which Fertilizer A was sprinkled was already more enriched than the field to which Fertilizer B was scattered. If it is unlikely that this result would occur just by luck or chance or nature’s whim, then we conclude that it is a statistically significant result indicating that there is an actual difference between the fertilizers.
Randomization and statistical significance are two of the fundamental ingredients of the research methodology known as the Randomized Controlled Trial (RCT) which is widely regarded as a “gold standard” of evidence (Healy, 2012). From this perspective, the results obtained through an RCT are considered to be more noteworthy and more believable or dependable than results obtained by other means. It is now increasingly recognised, however, that the idea of a hierarchy of evidence is “fundamentally wrong” (Jadad & Enkin, 2007, p. 106). The “best evidence” is obtained in any particular situation by matching an appropriate methodology to a well-articulated and meaningful research question.
An RCT is simply a research tool and, as a tool, it can be used in a variety of ways. Unfortunately, the idea of a hierarchy of evidence seems to be hypnotically seductive for many people and powerfully useful for the drug companies. In order to get a drug to market, regulators in the US such as the Food and Drug Administration (FDA) and also in Europe, only require the drug companies to produce two RCTs with statistically significant positive results (Healy, 2012). Perhaps this very low standard has contributed to the fact that RCTs can be much more useful as marketing tools for drug companies than for discovering new and useful ways for people to live healthy and meaningful lives.
Why should two RCTs with statistically significant positive results be considered a low standard? Partly because of the nature of statistical significance and partly because a standard such as this makes no comment about how many RCTs there might have been that showed a negative result.
Gotzsche (2013) illustrates just how fickle statistical significance can be. He describes an example in which 200 people received an active drug and were compared with 200 people who received a placebo. If 121 people in the drug group improved but only 100 people in the placebo group improved, the probability of obtaining a difference of 21 or greater if the treatment and the placebo were really having a similar effect is 0.04. In this case, then, the researchers could claim that this is a statistically significant result because the probability value is less than the conventional standard of 0.05.
If the numbers were only slightly different, however, the opposite result would be obtained (Gotzsche, 2013). So, if 119 people in the drug group improved compared with 100 people in the placebo group, the probability of this result is now 0.07 so it would not be considered to be statistically significant because it is greater than 0.05. A difference of only two people out of 400, therefore, can be the difference between a statistically significant result and a non-significant result.
In an experiment, the size of the effect, the number of participants, and statistical significance are all related. This means that as long as the effect is not zero, increasing the number of participants in a study will virtually guarantee that statistical significance is achieved. It may be this particular relationship between treatment effect, sample size, and statistical significance that led Healy (2012) to conclude that “the greater the number of people needed in a trial, the more closely the treatment resembles snake oil – which contains omega-3 fatty acids and can be shown in controlled trials to have benefits if sufficiently large numbers of people are recruited” (p. 68).
The standard of only requiring two positive results allows drug companies to mask adverse outcomes. This means that while the difference on a rating scale may be statistically significant, the number of deaths or other serious adverse events might not be (Healy, 2012). While it is likely to be relatively straightforward to tell whether one field of wheat produces more grain than another field of wheat, it can be much more ambiguous to decide whether one group of people are healthier than another group of people after receiving some treatment for their psychological turmoil.
Jachuck and colleagues (1982), for example, investigated the effect of stabilising blood pressure through drug therapy on the quality of life of a group of 75 patients. They asked the patient, the patient’s physician, and a relative or close companion of the patient about the effects of the drugs on the patient’s quality of life. The physicians rated all 75 patients as having an improved quality of life. The relatives rated 74 patients as having a worse quality of life and only one as improved. According to the patients, 36 had an improved quality of life, 7 had a worse quality of life, and 32 reported no change in their quality of life.
This type of variability in reported outcomes provides a lot of “wiggle room” for people such as those working in the marketing departments of drug companies. A compelling example of just how wiggly the evidence base can be is provided by Turner and colleagues (2008).
These researchers investigated 74 FDA-registered RCTs of 12 antidepressant agents. Of the 74 studies the FDA determined that 38 had a positive result (Turner et al., 2008). From these 74 studies, however, only 51 were published. All except one of the studies showing a positive result were published, three studies showing a negative result were published, and another 11 were published as positive results even though this was at odds with the conclusions the FDA made about these studies. So, whereas the published evidence-base shows 94% (48 out of 51) of these studies having a positive benefit for antidepressants, the actual research results were only 51% (38 out of 74) showing a positive benefit (Turner et al., 2008).
Although Jadad and Enkin (2007) consider it unethical to conduct RCTs primarily for commercial interests, it appears that this is precisely what drug companies are currently doing with regard to drug treatments for psychological unrest. Whereas once the research and development budgets of drug companies were larger than their marketing budgets, the situation is now reversed (Healy, 2012). Healy (2012) points out that the aim of drug companies is to get their drugs to market in order to generate profits for their shareholders. The drug companies have been so successful in doing this that, in 2002, the combined profits for the 10 drug companies in Fortune 500 exceeded the profits of all the other 490 companies put together (Gotzsche, 2013).
The RCT has become a powerful device for helping drug companies achieve their aim of maximising profits. Virtually everything we know about drugs comes from what the drug companies tell us (Gotzsche, 2013). To say that a drug “works” for example, simply means that a drug company has been able to produce two studies that showed statistically significant differences on the average scores of rating scales or blood tests between an active drug group and a comparison placebo group (Healy, 2012).
It is not in a company’s best interest to design drugs to cure health problems. Companies will generate more profits for longer if they can market drugs as being necessary to take for long periods of time. Perhaps even for the rest of a person’s life. The situation now has a “through the looking glass” quality to it. Currently the pills people take are saving the lives of the drug companies that produce them (Healy, 2012) rather than correcting a deficient supply of some well-being chemical in our brains.
Fundamentally, even when a chemical does help a person feel better, we have literally no scientific understanding of why or how that has occurred (Healy, 2012). RCTs primarily produce associations between drugs and rating scales (Healy, 2012). We remain clueless, however, as to why any particular association might exist. If we came to see RCTs as identifying relationships that need to be explained, we might be just as interested in the studies that did not produce the desired associations as those that do.
Will we look back on this period of our pharmacological treatment of psychological distress in the same way that we now think about the Thalidomide era? We need much better information than we currently have about the drugs we are continually pressured to ingest. Resources such as this website (www.madinamerica.com) and David Healy’s www.rxisk.org are helping to turn the tide.
People are not fields of wheat. To shift our research attention from fields of wheat to fields of dreams we need different methods and different understandings. If we are to help more people sow their own fields of dreams, and to harvest the benefits of all that a mind unrestrained by damaging drugs is capable of producing, we need a fundamental change in our approach. We need to break the spell of the omnipotence of RCTs and use different methodologies to thoroughly understand the nature of psychological torment and how it is resolved.
Medication needs to become an ancillary or supplementary aspect of treatment if it is used at all. People need to be understood as active agents who are somehow being thwarted in their attempts to live lives of meaning and value. To offer help that will be experienced as helpful, clinicians and researchers must focus on understanding the process of living as it is lived, not as it is observed, and do all that they can to assist and support this process rather than impeding or retarding it. In this endeavour the voice of the person being helped will be a central and guiding factor.
* * * * *
Carey, T. A. (2015). Some problems with randomized controlled trials and some viable alternatives. Clinical Psychology and Psychotherapy. DOI: 10.1002/cpp.1942
Gotzsche, P. C. (2013). Deadly medicines and organised crime: How big pharma has corrupted healthcare. London: Radcliffe Publishing.
Healy, D. (2012). Pharmageddon. Berkeley, CA: University of California Press.
Jachuck, S. J., Brierley, H., Jachuck, S., & Willcox, P. M. (1982). The effect of hypotensive drugs on quality of life. Journal of the Royal College of General Practitioners, 32, 103-105.
Jadad, A. R., & Enkin, M. W. (2007). Randomized controlled trials: Questions, answers and musings (2nd ed.). Malden, MA: Blackwell Publishing.
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. The New England Journal of Medicine, 358, 252-260.