Screening Instruments Do Not Reflect Individual Experiences of Depression

Researchers detect discrepancies between the language used to describe lived experiences of mental health and the language used in modern screening tools.

Hannah Emerson

A new study investigates the relationship between how people discuss their mental health and the language used to describe, label, and categorize it. The researchers utilize a mixed-methods analysis of 698 interviews on emotional health and depression screening instruments and explore the disconnect between lived experiences of depression and computational measures of depression.

“Categories for mental health risk being so articulated and abstracted that they lose touch with the diversity of illness experiences,” the authors Arseniev-Koehler, Mozgai, and Scherer, write.

“This paper re-examines the detection of depression from language and revisits old and current debates in mental health classification. Along the way, we highlight strengths and weaknesses of modeling approaches and propose several strategies for more reflexive modeling.”

Photo Credit: Pixabay

While the prospect of valid tools to detect mental health disorders has inspired a vast amount of research over the years, this study calls attention to the discrepancy between these tools and descriptions of individuals’ personal experiences.

Arseniev-Koehler and colleagues explain that nearly a century of research has produced modern screening instruments for detecting depression but they note, “particularly in the realm of mental health, we can’t take labels at face-value.” They argue that mental health labels are too often understood as “objective truth,” but “unlike a ‘broken bone,’ or a ‘sprained wrist,’ mental health is a gray area. Mental health is largely defined by our conceptions of what is ‘normal’ and what is ‘disordered’— conceptions which can change across culture and time.” Thus, while psychiatric diagnoses are designed to maximize reliability, they are weak regarding validity.

The authors reviewed efforts to detect depression from written text data and transcribed verbal data by examining peer-reviewed research that identifies and predicts depression from text data. Additional quantitative and qualitative evidence is drawn from 698 interviews of the Distress Analysis Corpus (DAIC), obtained from two populations living in Los Angeles, the public, and veterans of the U.S. armed forces. Ellie, an avatar, conducts the interviews in a way that simulates a mental health screen. The 8-item version of the Patient Health Questionnaire (PHQ-8) is included in these interviews.

For their qualitative data, the authors open-coded a subset of interviews of participants talking about mental health and emotions, then searched for lexicon relevant to depression (e.g., depressed, sad, blue, happy, content), and finally open-coded interview sections with this lexicon and compared the data to the interviewees PHQ 8 scores.

Participants in this study averaged a PHQ-8 score of 6 (a score of 10 or higher is considered having depression) and 25% scored as currently meeting criteria for depression according to the scale. Additionally, individuals with higher PHQ-8 scores used more words expressing negative emotions as well as the first-person singular (“I”) rather than third-person singular pronouns (“we”), in accordance with extant research on individuals experiencing depression.

The qualitative and quantitative data for this study was often “mismatched.” For example, some participants rated as low risk of depression according to the PHQ described struggling with depression in their interviews. Arseniev-Koehler admits that what we label as depression remains “enigmatic in medicine and psychiatry.” The researchers write:

“In modern psychiatry, diagnoses are descriptive, co-occurring clusters of symptoms. They do not reference to underlying mechanisms or causes, and categories provide little information on treatment responses.”

The data also points to inconsistent understandings of terms like depression, happiness, contentment, and other states of mood. For example, one participant asked for clarification when asked the last time they were happy, inquiring, “What type of happiness are you looking for?” Another stated that they are seeking contentment rather than happiness, an attempt to put into words the meaning of personal experiences and feelings.

Concerning the PHQ and other self-report diagnostic scales, the researchers write, “implicitly, these scales are proxies for psychiatric ratings from structured interviews. Of course, self-report diagnostic scales are an imperfect proxy.” An algorithm used to predict PHQ scores from the language “likely has a wide margin for errors for detecting depression when compared to a mental health professional rather than the proxy measure on which it is trained.”

The authors suggest the following modifications to current diagnostic approaches:

  • Underlying models should depict depression as continuous and more dimensional, including duration of the depressive episode, depression history, and level of impairment to livelihood.
  • The focus should be on detecting symptoms of depression rather than detecting depression itself.
  • In developing models, it may be imperative to work with low specificity (proportion of those without depression who are correctly detected as not having depression) and precision to enable greater sensitivity (accurately identifying those who have depression).
  • For valid constructs of mental health, incorporate multiple clinicians’ ratings, along with other clinical/non-clinical measures.
  • Consider how to develop predictive models that include “the uncertainty in our understanding of depression and other cultural idioms of distress.”

Ultimately, this study is among others urging researchers and clinicians to consider reflexively the culturally constructed labels used in mental health. Arseniev-Koehler and colleagues conclude:

“While research in this area has recently focused on the production of high-performing models, it seems likely that literature will soon reach saturation in the number of published models. Now, models will need to be reflexively tuned, borrowing additional insights from areas such as medicine and social sciences.”



Arseniev-Koehler, A., Mozgai, S., & Scherer, S. (2018). What type of happiness are you looking for? – A closer look at detecting mental health from language. Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, (pp 1-12). New Orleans, LA. (Link)


  1. Another sad example of how these professionals assume they have power to diagnose and determine some are mentally inferior. Note the language used in the quoted passages, the use of “we” when describing those who diagnose.

    “We,” therefore, does not refer to the entire human population, but a group of elites who claim they know better than the rest of humanity. Apparently they hold so much power that they are the authorities on who is suffering and who isn’t. Baloney.

    • Apparently, these professionals didn’t bother themselves with the 50 year old Hoffer/Osmond (HOD) Diagnostic, which is quantitative and inquires directly about experiences. Maybe this is because it’s inconsistent with initial diagnoses, but more consistent with final diagnoses than the initial diagnoses are- humiliating to the professional diagnostician, who prides himself on his diagnostic skill. That, and the fact of both Hoffer and Osmond being advocates of the use of megavitamin B3 as a primary treatment for schizophrenia in lieu of psychiatric drugs, puts their test into the world of crystal gazing and evil sorcery, instead of modern organized psychiatry.

    • It seems the proper conclusion would be, “We (the professionals) really suck at predicting anything to do with ‘depression’ and should give up on our ridiculous tests and just ASK people what’s going on, since that appears to give much more accurate and useful results.” Your point about pronouns is very well taken, as well – why does “we” not include the client “we” are supposed to be helping? Perhaps this is the center of “our” difficulty in predicting “depression?” Perhaps “we” need to give up on the idea that “depression” is a thing to be measured in the first place?

    • Agreed Julie. Researchers and clinicians validate scales before they are used. They often become the exclusive “we”. However, I see it as super messy because not only might they “we” be blind to the patient’s perspective, but in general, the different ideas and experiences that any person is exposed to will shape what elements are seen as problematic and therefore worthy of measurement by a scale.

      Let me illustrate this with some of my problems with scales. The first time that I was introduced to evaluation of my mood was when I was younger and my well-meaning parents said something like “we are worried that you might have depression”, however, prior to that, I had often thought of my unhappiness being because “my life sucks”. If I were to have made a scale as a teen it might have involved elements for my dislike of various things. My perspective changed with involvement in the system, and I then learned to think more in terms of symptoms. I had assessments, phq9s and treatment for depression which followed me into adulthood. After many years of treatment I was even enrolled in something where I had to take a form of the BDI almost every day. Overall, this approach generated by other people had harmed me.

      I have since departed from thinking like that and I now have a similar but more mature view than when I was a teen. I’m now working to improve both the circumstances of my life and my acceptance of circumstances that I cannot change. (As well as other things.)

      Would I want to take a scale now? No, because at this point in my life, I actively resist thinking about my life in terms of “symptoms” which seems to be serving me well. Who would I trust to create a useful scale to be used for everyone? I don’t know because every person has a different perspective. Could some people or computers somewhere make a scale that could be useful for some people? I think so.

      • I think you make a great point – not only the scales themselves, but the decisions of what to “measure” are very much culturally bound, which prevents them from ever really being “scientific” in the sense of truly objective. And I also have found, for me and for others, that thinking in terms of what I don’t like and want I want to change and what I do like (learned this one a LOT later in life) and what I want to preserve and appreciate is much more helpful that thinking of “what is wrong with me?”

    • Glad to hear you got away Julie :). I believed their evaluations of me for waaay too long, I think in part because it all seemed so official. At one place I had to go where they made me do questionnaires, the doc brought me into his office at one month of treatment. He said that my results indicated a 30% improvement. They were very confident and professional about their evaluations. They appeared as though they were “helping” and measuring my progress. I now see that clinic as just another place that harmed me bigtime. I’m still struggling to make sense of my time in the system, still coming off the drugs too, it’s like waking up from a really bad dream.