More than a year on from the release of DSM-5, a Medscape survey found that just under half of clinicians had switched to using the new manual (Stetka & Ghaemi, 2014). Most non-users cited practical reasons, typically explaining that the health care system where they work has not yet changed over to the DSM-5. Many, however, said that they had concerns about the reliability of the DSM, which at least partially accounted for their non-use. Throughout the controversies that surrounded the development and launch of the DSM-5 reliability has been a contested issue: the APA has insisted that the DSM-5 is very reliable, others have expressed doubts. Here I reconsider the issues: What is reliability? Does it matter? What did the DSM-5 field trials show?
The basic idea behind reliability tests is that the diagnosis that a patient receives should depend on their symptoms, rather than on who does the diagnosing. Suppose I go and see a clinical social worker in the United States and am judged to have schizophrenia. If a reliable classification system is used then it should enable, say, a psychiatrist in Kenya, to decide on the same diagnosis.
When the DSM-III was published in 1980, it was presented as solving the problem of ensuring diagnostic reliability (A.P.A., 1980, pp.467-472). The story told was that while in the dark days of psychoanalytic dominance a patient judged neurotic by one therapist might well appear psychotic or normal to another, with the employment of the DSM-III patients could expect to be given the same diagnosis by all clinicians. Proof of improvement was taken to be shown by a statistical measure, Cohen’s kappa, which assesses the chances that two clinicians will agree on a diagnostic label. As DSM-III, and its successors, demonstrated “acceptable” values of kappa, the reliability problem was widely taken to have been solved.
But then with reliability tests in the field trials of DSM-5 diagnostic criteria, something odd happened. In reports of the DSM-5 field trials, results that found kappas at values which for thirty-five years would have been judged “poor” or “unacceptable” suddenly became “good.” Commentators with long memories pointed out the inconsistency (for example, 1 Boring Old Man, 2012; Frances 2012; Spitzer, Williams & Endicott, 2012; Vanheule et al 2014).
What had happened? Is the reliability of the DSM-5 really no better than that of classifications fifty years ago? What is truly a good value for kappa? And how much does the reliability of psychiatric diagnosis matter anyway? Let’s go back and look at the debates in more detail to answer these questions.
The reliability of psychiatric diagnosis started to be a matter of some concern in the 1960s and 1970s. A number of studies sought to investigate the issue of reliability. Comparing the results of the different studies was difficult, as different studies employed different statistics and it was unclear what level of agreement one might reasonably expect (for a review of the debates see Kirk & Kutchins, 1992). Those who produced these early studies were unsure what to make of their results, but Robert Spitzer, who would later become the chairman for DSM-III, thought he knew both how to understand the problem of reliability, and, once he’d demonstrated a “crisis”, also how to fix it. The statistical measure, Cohen’s kappa, was key to Spitzer’s argument (Spitzer, Cohen, Fleiss & Endicott, 1967).
Cohen’s kappa provides a measure of agreement that seeks to take into account that some level of agreement could be expected by chance. Cohen’s kappa is defined as being (po – pc) /(1 – pc) where po is the observed proportion of agreement andpc the proportion expected by chance. A value of 0 indicates chance agreement; 1 indicates perfect agreement. At this point many readers’ eyes will have glazed over. This, glazing, Kirk and Kutchins (1992) point out in their history of the DSM-III, is important to understanding the evolution of debates about reliability in psychiatry. Cohen’s kappa is a statistical innovation, but its utilisation complicated discussion of reliability to the extent where lay people and average clinicians could no longer contribute. While everyone may have a view as to whether it seems acceptable that a patient judged schizophrenic by one clinician should have only a fifty per cent of being similarly diagnosed on a second opinion, who knows whether a kappa of 0.6 is acceptable?
Having introduced Cohen’s kappa to psychiatrists, Spitzer (with co-author Joseph Fleiss 1974) used it to reanalyse the existing reliability studies and to argue that the agreement achieved by clinicians using DSM-I and II was unacceptable. In their meta-analysis, Spitzer and Fleiss judged a kappa of over 0.7 to be “only satisfactory” (a level achieved only by diagnoses of mental deficiency, organic brain syndrome, and alcoholism), and condemned the kappas of less than 0.5 that were achieved by many of the diagnoses studied “poor.” They conclude that “The reliability of psychiatric diagnosis as it has been practised since at least the late 1950s is not good” (Spitzer & Fleiss, 1974, p.345). This judgment was echoed by later commentators. A key point for us is that in this paper Spitzer and Fleiss judged only values of Cohen’s kappas greater than 0.7 to be satisfactory. Where had this threshold come from? No reference for this threshold is provided in the paper. Kappa had not previously been employed in psychiatry and no conventional values for an acceptable kappa had been established. Spitzer and Fleiss were free to pick a threshold at their discretion.
When Spitzer became the chairman of the taskforce to develop DSM-III he continued to be concerned about the reliability of diagnosis. Field trials for the DSM-III included reliability tests. In these a Cohen’s kappa of 0.7 continued to be the threshold for “good agreement” (A.P.A., 1980, p.468). For the most common diagnoses in adults – substance use disorders, schizophrenic disorders, and affective disorders – Cohen’s kappas of 0.8 plus were reported. Spitzer and his colleagues were pleased, and concluded “For adult patients, the reliability for most of the classes … is quite good, and in general higher than that previously achieved with DSM-I and DSM-II” (A.P.A., 1980, p.468). Kirk and Kutchins provide a critique of the DSM-III field trials and provide a more modest assessment of their achievements. For us, however, the key question isn’t whether the DSM-III truly was reliable, but that it was claimed to be – with reported kappas of 0.7 plus taken to be the proof.
When it came to the DSM-5, however, the goal posts seemed to shift. Prior to the results being available, members of the DSM-5 taskforce declared that a kappa of over 0.8 would “be almost miraculous,” a kappa between 0.6 and 0.8 would be “cause for celebration,” values between 0.4 and 0.6 were a “realistic goal”, and those between 0.2 and 0.4 would be acceptable (Kraemer, Kupfer, Clarke, Narrow, & Regier 2012a). Data from a motley assortment of other reliability studies in medicine was cited to support the claim that such thresholds would be reasonable. These benchmarks were much lower than those employed in the DSM-III trials. Many commentators viewed these new standards as an attempt to soften up readers prior to the announcement of reliability results that, by historical standards, appeared shockingly poor (1 Boring Old Man, 2012; Frances 2012; Spitzer, Williams & Endicott, 2012). Schizophrenia, which achieved a kappa of 0.81 in the DSM-III trial, had a kappa of 0.46 in the DSM-5 trial (Regier et al 2013). Major affective disorders had a kappa of 0.8 with DSM-III and 0.28 with the DSM-5. Mixed anxiety-depressive disorder achieved a negative kappa – meaning that in this case clinicians would have been better off putting their diagnostic criteria in the bin and simply guessing. Of the twenty diagnoses studied in the DSM-5 field trial only three obtained kappas of over 0.6. Although commentators found the DSM-5 reliability results distinctly unimpressive, using their new thresholds for an acceptable kappa, the DSM-5 task force looked at their results and found that “most diagnoses adequately tested had good to very good reliability” (Regier et al, 2013, p.59)
What should one make of these field trials? Were the results appalling, or good? Why were lower kappa scores obtained in the DSM-5 trials than in the DSM-III trials? And, what threshold should one adopt for “acceptable” reliability?
First we can note that the methodology of the reliability studies had shifted, such that seeking to directly compare the DSM-III and DSM-5 studies is unfair. Many of the diagnoses studied in the DSM-5 field trial were new diagnoses, and were generally at a “finer-grained level of resolution” than the diagnoses studied in the DSM-III study – for example, while the DSM-III study examined the reliability of “eating disorder” the DSM-5 trial looked at “binge eating.” In the DSM-5 study, clinicians independently interviewed the patients, at time intervals that ranged from four hours to two weeks. In the DSM-III trial, clinicians either jointly interviewed patients (but recorded their diagnoses separately) or interviewed them separately but as soon as possible. Such differences might partly account for the differing results.
Now for the shifting thresholds: While many psychiatrists have become used to thinking of Spitzer’s threshold of 0.7 as the cut-off point for a “good” kappa, there are precedents for employing lower benchmarks in the statistical literature. Influentially, Landis and Koch (1977) count 0.21-0.4 fair, 0.41-0.6 moderate, 0.61-0.8 substantial, and 0.81 – almost perfect. Altman (1991), condemns only kappas of less than 0.2 poor, and considers anything above 0.61 as good. Fleiss, Levin and Cho Paik (2003) counts kappas below 0.4 poor, those between 0.4 and 0.75 fair to good, and those above 0.75 excellent. Clearly there are no universally agreed standards for what would count as a “good” Cohen’s kappa.
In any case, I suggest that seeking some threshold for “acceptable” reliability to be applied across all contexts and all diagnoses is a mistake. Sometimes it is important for diagnosis to be very reliable; sometimes disagreements can be tolerated. In a research setting, it may matter a very great deal that the subject groups employed in different studies should be comparable. For research that depends on all subjects having the same disorder, the values of kappa that should be sought should be high. Sometimes the diagnosis that a patient receives is important because it makes a difference to the treatment that will be given.
In many contexts, however, exacting standards of reliability need not be required. Suppose I am a marriage counsellor. My clients receive a DSM diagnosis which I place on their insurance forms, but I don’t prescribe drugs; all my clients, regardless of diagnosis, receive exactly the same sort of talk-based therapy. In such a context, what does it matter if I diagnose a client as having a major depressive disorder while my colleague would have diagnosed them with an anxiety disorder? Even in drug-based therapy the link between diagnosis and drug type may not be tight. Many psychoactive medications are approved for the treatment of broad swathes of disorders and in such cases so long as a wrong diagnosis makes no difference to treatment little harm will be done.
The importance of achieving reliability varies with the diagnosis in question and with the context of use. When it makes little difference whether a particular diagnosis or those it is likely to be confused with gets made, “acceptable” kappas may be quite low. When there is a real risk that unreliable diagnosis will lead to harm, standards must be higher – either a higher value of kappa should be demanded, or, if diagnostic criteria can’t themselves be made reliable, then mechanisms for dealing with uncertainty in practice may need to be employed (e.g., the routine use of second, or even third, opinions).
As we conclude, we are left with a puzzle: The point of the reliability tests was to demonstrate that the diagnostic criteria are reliable, but now that the results are in it remains unclear whether the levels of reliability achieved are acceptable. This is because there are no generally accepted standards for what counts as reliable enough against which the DSM criteria can be judged.
With a different trial design, it might have been possible to at least show that progress had been made, and that the DSM-5 revisions produced criteria that could be applied more reliably than those in the DSM-IV. However, the shifts in methodology and statistics mean that the results of the DSM-5 field trial could never be directly compared with those of field trials for earlier DSMs. Changes in trial design were defended on the basis that methodology had improved; the DSM-III trials used the bad old ways, while the DSM-5 studies would use the new, good ways (e.g., Clarke et al 2013; Kraemer, Kupfer, Clarke, Narrow, & Regier 2012a). Fair enough. But then why was no head-to-head comparison of DSM-5 and DSM-IV criteria incorporated into the tests? (a possibility discussed by Ledford, 2012). The task force said head-to-head comparisons would make the trials too cumbersome, but in the absence of such tests, now that the DSM-5 field trial results are in, it is unclear whether or not the new system is more reliable than its predecessors.
* * * * *
1 Boring Old Man (2012). To take us seriously. Posted 22 May 2012. [last accessed 28 August 2014].
Altman, D. (1991). Practical Statistics for Medical Research. London: Chapman and Hall.
American Psychiatric Association (1980). Diagnostic and Statistical Manual of Mental Disorders. (3rd edition). Washington, DC: American Psychiatric Association.
Clarke, D., W. Narrow, D. Regier, S. Kuramoto, D. Kupfer, E. Kuhl, L. Greiner, H. Kraemer (2013) DSM-5 field trials in the United States and in Canada, Part 1: Study Design, Sampling strategy, implementation, and analytic approaches. American Journal of Psychiatry: 170: 43-58
Fleiss, J., Levin, B. & Cho Paik, M. (2003). Statistical Methods for Rates and Proportions. Third edition. New York: John Wiley.
Frances, A. (2012). DSM-5 field trials discredit the American Psychiatric Association. Huffington Post Science. The Blog. Posted 31 October 2012. [Last accessed 28 August 2014].
Kirk, S. and H. Kutchins (1992) The Selling of DSM: The rhetoric of science in psychiatry. New York: Aldine de Gruyter.
Kraemer, H., D. Kupfer, D. Clarke, W. Narrow, D. Regier (2012a) DSM-5: How reliable is reliable enough? American Journal of Psychiatry. 169: 13-15
Landis, J. & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33: 159-174.
Ledford, H. (2012). DSM field trials inflame debate over psychiatric testing. Nature News Blog. Posted 5 November. [Last accessed 28 August 2014].
Regier, D., W. Narrow, D. Clarke, H. Kraemer, S. Kuramoto, E. Kuhl, D. Kupfer (2013) DSM-5 Field trials in the United States and Canada, Part II: Test-retest reliability of selected categorical diagnoses. American Journal of Psychiatry: 170: 59-70
Spitzer, R., J. Cohen, J. Fleiss, J. Endicott (1967) Quantification of agreement in psychiatric diagnosis. Archives of General Psychiatry. 17: 83-87
Spitzer, R. and J. Fleiss (1974) A Re-analysis of the reliability of psychiatric diagnosis. British Journal of Psychiatry. 125: 341-347.
Spitzer, R., J. Williams. J. Endicott (2012) Standards for DSM-5 reliability. Letters to the editor. American Journal of Psychiatry. 169: 537
Stetka, B. & N Ghaemi (2014) DSM-5 a year later: clinicians speak up. [Last accessed 28 August 2014].
Vanheule S, Desmet M, Meganck R, Inslegers R, Willemsen J, De Schryver M, Devisch I.(2014) Reliability in Psychiatric Diagnosis with the DSM: Old Wine in New Barrels. Psychotherapy and Psychosomatics. 83:313-314.
* * * * *
This is an edited extract from Diagnosing the Diagnostic and Statistical Manual of Mental Disorders, by Rachel Cooper (published by Karnac Books in 2014), and is reprinted with kind permission of Karnac Books.