A new study investigates the relationship between how people discuss their mental health and the language used to describe, label, and categorize it. The researchers utilize a mixed-methods analysis of 698 interviews on emotional health and depression screening instruments and explore the disconnect between lived experiences of depression and computational measures of depression.
“Categories for mental health risk being so articulated and abstracted that they lose touch with the diversity of illness experiences,” the authors Arseniev-Koehler, Mozgai, and Scherer, write.
“This paper re-examines the detection of depression from language and revisits old and current debates in mental health classification. Along the way, we highlight strengths and weaknesses of modeling approaches and propose several strategies for more reflexive modeling.”

While the prospect of valid tools to detect mental health disorders has inspired a vast amount of research over the years, this study calls attention to the discrepancy between these tools and descriptions of individuals’ personal experiences.
Arseniev-Koehler and colleagues explain that nearly a century of research has produced modern screening instruments for detecting depression but they note, “particularly in the realm of mental health, we can’t take labels at face-value.” They argue that mental health labels are too often understood as “objective truth,” but “unlike a ‘broken bone,’ or a ‘sprained wrist,’ mental health is a gray area. Mental health is largely defined by our conceptions of what is ‘normal’ and what is ‘disordered’— conceptions which can change across culture and time.” Thus, while psychiatric diagnoses are designed to maximize reliability, they are weak regarding validity.
The authors reviewed efforts to detect depression from written text data and transcribed verbal data by examining peer-reviewed research that identifies and predicts depression from text data. Additional quantitative and qualitative evidence is drawn from 698 interviews of the Distress Analysis Corpus (DAIC), obtained from two populations living in Los Angeles, the public, and veterans of the U.S. armed forces. Ellie, an avatar, conducts the interviews in a way that simulates a mental health screen. The 8-item version of the Patient Health Questionnaire (PHQ-8) is included in these interviews.
For their qualitative data, the authors open-coded a subset of interviews of participants talking about mental health and emotions, then searched for lexicon relevant to depression (e.g., depressed, sad, blue, happy, content), and finally open-coded interview sections with this lexicon and compared the data to the interviewees PHQ 8 scores.
Participants in this study averaged a PHQ-8 score of 6 (a score of 10 or higher is considered having depression) and 25% scored as currently meeting criteria for depression according to the scale. Additionally, individuals with higher PHQ-8 scores used more words expressing negative emotions as well as the first-person singular (“I”) rather than third-person singular pronouns (“we”), in accordance with extant research on individuals experiencing depression.
The qualitative and quantitative data for this study was often “mismatched.” For example, some participants rated as low risk of depression according to the PHQ described struggling with depression in their interviews. Arseniev-Koehler admits that what we label as depression remains “enigmatic in medicine and psychiatry.” The researchers write:
“In modern psychiatry, diagnoses are descriptive, co-occurring clusters of symptoms. They do not reference to underlying mechanisms or causes, and categories provide little information on treatment responses.”
The data also points to inconsistent understandings of terms like depression, happiness, contentment, and other states of mood. For example, one participant asked for clarification when asked the last time they were happy, inquiring, “What type of happiness are you looking for?” Another stated that they are seeking contentment rather than happiness, an attempt to put into words the meaning of personal experiences and feelings.
Concerning the PHQ and other self-report diagnostic scales, the researchers write, “implicitly, these scales are proxies for psychiatric ratings from structured interviews. Of course, self-report diagnostic scales are an imperfect proxy.” An algorithm used to predict PHQ scores from the language “likely has a wide margin for errors for detecting depression when compared to a mental health professional rather than the proxy measure on which it is trained.”
The authors suggest the following modifications to current diagnostic approaches:
- Underlying models should depict depression as continuous and more dimensional, including duration of the depressive episode, depression history, and level of impairment to livelihood.
- The focus should be on detecting symptoms of depression rather than detecting depression itself.
- In developing models, it may be imperative to work with low specificity (proportion of those without depression who are correctly detected as not having depression) and precision to enable greater sensitivity (accurately identifying those who have depression).
- For valid constructs of mental health, incorporate multiple clinicians’ ratings, along with other clinical/non-clinical measures.
- Consider how to develop predictive models that include “the uncertainty in our understanding of depression and other cultural idioms of distress.”
Ultimately, this study is among others urging researchers and clinicians to consider reflexively the culturally constructed labels used in mental health. Arseniev-Koehler and colleagues conclude:
“While research in this area has recently focused on the production of high-performing models, it seems likely that literature will soon reach saturation in the number of published models. Now, models will need to be reflexively tuned, borrowing additional insights from areas such as medicine and social sciences.”
****
Arseniev-Koehler, A., Mozgai, S., & Scherer, S. (2018). What type of happiness are you looking for? – A closer look at detecting mental health from language. Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, (pp 1-12). New Orleans, LA. (Link)