In my 2017 e-book Schizophrenia and Genetics: The End of an Illusion (and in previous publications), I showed that the famous Danish-American schizophrenia adoption studies of the 1960s-1990s were environmentally confounded, methodologically flawed, and genetically biased to an extreme degree, and therefore provide no scientifically acceptable evidence in favor of genetic influences on schizophrenia.1 Combined with the questionable validity of the “schizophrenia” concept, the faulty assumptions underlying psychiatric twin studies, and the failure to identify causative genes, a thorough reevaluation of the “genetics of schizophrenia” debacle is long overdue.
While I was working on the book, cognitive neuroscientist Chris Chambers of Cardiff University published The Seven Deadly Sins of Psychology: A Manifesto for Reforming the Culture of Scientific Practice.2 In this book, Chambers pointed to several problem areas in the research/publication process in psychology and other fields. These include the “deadly sins” of “bias,” “hidden flexibility,” “unreliability,” “data hoarding,” “corruptibility” (fraud), and “bean counting” (funding and publication issues). Although many other problem areas in social and behavioral science research were not covered in this book, Chambers provided a valuable framework for describing biased, deceptive, and even fraudulent research practices.
Although Chambers focused on research in psychology, his message is clearly relevant to most other areas of research, including psychiatric and behavioral research, as well as drug safety and effectiveness trials. Contrary to popular belief, science is not immune to the corrupting influences of the society it operates in. On an individual level, research findings and conclusions are influenced by confirmation bias, which is the tendency for people to search for, interpret, favor, and recall information in a way that confirms their preexisting beliefs or theories.
Like other areas of research, psychiatric investigators claim statistically significant findings after producing results that fall below the conventional .05 level of statistical significance. This means that there was less than a 5% probability that the finding occurred by chance. Larger sample sizes increase the likelihood that differences between groups will reach statistical significance; smaller samples have the opposite effect. In scientific research, a probability value (“p-value”) below the .05 threshold is the researchers’ make-or-break gold standard, enabling them to conclude that they found statistically significant results.
In normal experiments based on the standard “hypothetico-deductive” (H-D) scientific method, in sequence researchers generate hypotheses, design a study, collect data, analyze data and test hypotheses, interpret data and determine statistical significance, and submit their findings for publication. In the process they perform “null hypothesis significance testing” (NHST). The “null hypothesis” is a default position which states that there is no difference between the specified populations under study, and that any observed differences are due to chance, or to experimental error. In schizophrenia adoption research, for example, the null hypothesis states that there is no difference in schizophrenia diagnoses between the schizophrenia experimental group versus the control group, meaning that genetic factors play no role in causing the condition. If researchers find group comparisons below the .05 threshold in the genetic direction, they reject the null hypothesis and conclude that hereditary factors are responsible for the group differences.
Researchers are expected to formulate their hypotheses before they obtain their data. After they collect, review, and analyze the data, they determine whether their results point to the acceptance or rejection of these hypotheses. Although a “cardinal rule in experimental design” is “that any decision regarding the treatment of data must be made prior to an inspection of the data,” in behavioral research as currently practiced it is difficult to verify this.3
P-Hacking, HARKing, and Data Dredging
P-Hacking. P-hacking is the practice of consciously or unconsciously manipulating data to produce results that fall below the .05 level of statistical significance. Researchers have “degrees of freedom” that allow them the “hidden flexibility” to change various aspects of their study after reviewing the data, but before submitting their paper for publication and peer review.4 As Chambers defined it, p-hacking is “exploiting researcher degrees of freedom to generate statistical significance.” A “key feature” of researchers’ decisions “is that they are hidden and never published.”5 P-hacking occurs, as a group assessing its impact put it, “when researchers collect or select data or statistical analyses until nonsignificant results become significant.”6
Some ways that researchers can p-hack data include (1) conducting analyses midway through experiments to decide whether to continue collecting data (“peeking” at data), and stopping the collection of data if an analysis yields a statistically significant p-value; (2) recording many response variables and deciding which to report after the fact; (3) deciding after the fact whether to include or remove outliers; (4) excluding, combining, or splitting treatment groups after the fact; and (5) continuing to collect data past the planned stop point if significant comparisons are not found.7 Because social and behavioral science researchers have the hidden flexibility to change definitions and methods without else anyone knowing, as Chambers noted they are able to decide when to stop counting participants (subjects), and are able to redefine the condition or characteristic they are studying. This enables researchers to “navigate either deliberately or unconsciously in order to generate statistically significant effects.”8 Surveys suggest that “questionable research practices” are common in psychology, and occur in part because there are many built-in incentives and pressures in academic research to p-hack, but few safeguards in place to prevent it.
HARKing. The term “HARKing” was introduced by psychologist Norbert Kerr in 1998, and stands for “hypothesizing after the results are known.”9 Kerr defined HARKing “as presenting a post hoc hypothesis in the introduction of a research report as if it were an a priori hypothesis.”10 In other words HARKing occurs when, after researchers inspect their data, they create a new hypothesis which they claim or imply was created before they inspected their data. In Chambers’ words, “HARKing is a form of academic deception in which the experimental hypothesis (H1) of a study is altered after analyzing the data in order to pretend that the authors predicted results that, in reality, were unexpected.” This method produces the “clean and confirmatory papers that psychology journals prefer while also maintaining the illusion that the research is hypothesis driven and thus consistent with the H-D method.” Chambers concluded that “deliberate HARKing . . . lie on the same continuum of malpractice as research fraud.”11 Again, there are few safeguards in place to prevent HARKing. The peer-review process in science, which usually takes place after a paper is submitted for publication, is not equipped to detect HARKing or p-hacking, even if peer reviewers wish to do so.
Data Dredging. Another unsound research practice is “data dredging” (also known as a “fishing expedition”), which involves investigators searching through data in an attempt to find statistically significant trends or differences, without testing a prior hypothesis. Identifying correlations and potential factors can be useful to help arrive at a hypothesis, but that hypothesis must then be tested on a different set of data. As the authors of a medical textbook emphasized, a hypothesis cannot be developed and tested in the same study. If this happens, data dredging has occurred:
“The scientific process requires that hypothesis development and hypothesis testing be based on different data sets. One data set is used to develop the hypothesis or model, which is used to make predictions, which are then tested on a new data set.”12 [italics in original]
Data dredging is related to the “Texas sharpshooter’s fallacy,” which describes a sharpshooter who fires his gun at the side of a barn, and later draws targets around a cluster of points that were hit. Although people viewing the barn might think that he hit his targets, the sharpshooter drew these targets after he fired his gun. According to Wikipedia, this “fallacy is characterized by a lack of a specific hypothesis prior to the gathering of data, or the formulation of a hypothesis only after data have already been gathered and examined.” It is a fallacy in part because, in a large dataset based on multiple comparisons, we would expect to find statistically significant correlations by chance alone.
Although data dredging is a form of p-hacking, researchers can select statistically significant results or comparisons after the fact without manipulating their data to do so. Data dredging also differs from HARKing because, although researchers are pointing to comparisons that they did not plan to make or highlight, they are not necessarily claiming that they are testing a prior hypothesis.
The Urgent Need for the Preregistration of Research in the Social and Behavioral Sciences
P-hacking, HARKing, and data dredging are methods that some researchers use to achieve statistically significant results even though the null hypothesis may in fact be true, thereby misleading science and the public. There are a number of possible motivations for doing this. Scientific researchers are under pressure to produce statistically significant findings in order to get their studies published in prestigious journals, which might tempt them to use their “degrees of freedom” to produce results that these journals will publish. Other possible motivations include financial motives, a desire to achieve career advancement and prestige, the need to obtain research funding (grants), supporting their field against critics, helping the companies they work for increase profits, and ideological motives. Genetic (biological) determinism is an ideology, although its adherents usually deny this and claim that their beliefs are based on nothing more than objective scientific evidence.
Building on calls by previous authors going back to the 1960s, which includes my own 2000 proposal co-authored by the late psychologist Steve Baldwin, Chambers called for the establishment of psychology research “preregistration,” where investigators would be required to submit an introduction, and their proposed methods, definitions, and analyses, before they collect their data.13 As Chambers described it:
“The essence of preregistration is that the study rationale, hypotheses, experimental methods, and analysis plan are stated publically in advance of collecting data. . . . Since authors will have stated their hypotheses in advance, preregistration prevents HARKing and ensures adherence to the H-D [normal] model of the scientific method. . . . Preregistration also prevents researchers from cherry-picking results that they believe generate a desirable narrative.”14
The preregistration of research would greatly reduce p-hacking, HARKing, data dredging, and other deceptive methods. Fortunately, a movement is now underway to make preregistration the norm in the social and behavioral sciences. Although “we may never be able to eliminate bias altogether from human nature,” Chambers wrote, a “sure way to immunize ourselves against its consequences . . . is peer-reviewed study preregistration.”15 And yet, it is likely that people and institutions with a vested interest in maintaining the current system will oppose research preregistration, and if implemented might attempt to work around it.
In Schizophrenia and Genetics I reviewed the widely cited Danish-American adoption studies in depth, and examined how the researchers arrived at their conclusions. In addition to the major problems found in psychiatric adoption studies in general,16 I pointed to several instances where the Danish-American researchers clearly or likely resorted to p-hacking, HARKing, or data dredging in order to arrive at conclusions in favor of genetics. When false results produced by p-hacked research have social, scientific, and political importance, and affect or harm the lives of millions of people while entire fields look on, it constitutes a scientific scandal.
* * *
My 20 years of analyzing genetic research in the social and behavioral sciences leads me to conclude that the practices described by Chambers and others are common, and have contributed to the acceptance of false conclusions about the role of genetic influences on psychiatric disorders and other behavioral characteristics (IQ, personality, criminality, and so on). These practices may also be occurring in psychiatric drug trials. Chris Chambers has performed a valuable service to science and society by helping us better understand, explain, uncover, and reduce biased, deceptive, and fraudulent methods in scientific research. The Seven Deadly Sins of Psychology is a must-read for the consumers of scientific research, and also for the current and future debunkers of pseudoscience.