The P-Value Problem in Psychiatry

Stanford researcher writes that readers should check the effect size of results instead of looking at the p-value.


The importance of the P-value is often misunderstood and has been at the center of statistics debates for 20 years, writes Stanford researcher Helena Chmura Kraemer. Writing in JAMA Psychiatry, she notes that there have been calls to ban the statistic for being misleading. She has a simple solution, though:

“Whether P values are banned matters little. All readers (reviewers, patients, clinicians, policymakers, and researchers) can just ignore P values and focus on the quality of research studies and effect sizes to guide decision-making.”

Photo credit:, CC BY 2.0

The P-value is a statistic, expressed as p < a number (i.e., p < 0.05). This statistic measures the possibility of finding the results by chance (in this case, less than 5%), assuming that the null hypothesis is true. That is, it doesn’t measure whether that hypothesis is true; it simply measures the strength of a study to find evidence disproving it.

According to Kraemer, “The P-value does not give the probability that the hypothesis being tested is false (i.e., that the null hypothesis is true); nor is it true that the smaller the P-value, the stronger the hypothesis being tested.”

This leads to problems when the P-value is misinterpreted. For instance, an article in Aeon describes some scenarios in which P-value is easily misunderstood.

Kraemer provides a clear example: imagine a hypothetical scenario in which researchers could identify people “at-risk” for developing an illness within ten years. In the general population, the risk of developing the illness is 10%. But the researchers hypothesize that in their “at-risk” sample, the risk will be larger than 10%. They conduct a study to find out—they use their measure to find “at-risk” people and then follow them to see if they develop the illness at a rate higher than 10%.

Let’s say their “at-risk” sample has a rate of 11%. That would be a pretty meaningless increase—one additional person out of every 100 would be identified by this “at-risk” designation. However, it might be statistically significant because there is a real difference between the two groups.

As the sample size of the study increases, the P-value becomes smaller and smaller: if there are 100 people in the study, P is approximately .37, a non-significant number. In that case, the study fails to find the effect, even though it is real. If there were 2500 people, though, P is about .05—the research is just on the verge of confirming the effect. Increase the sample to 50,000 people, and the P-value appears tiny, making it statistically significant (about 5 × 10−14).

The true effect doesn’t change here at all—the prediction is marginally better for the at-risk group (11% instead of 10%). However, changes to the sample size drastically alter whether the result will be statistically significant.

No matter what, though, the “risk difference” is tiny at 0.1. That statistic—essentially an effect size—tells the reader how much of a difference there is between groups. That’s more important information than just “is there any difference between groups.” Whenever you’re comparing two groups, there’s almost always going to be a tiny difference—groups are rarely exactly the same, even when randomized. So it’s far more critical to know how different the groups are.

Kraemer notes that there are two main types of research study, hypothesis testing studies (HTS) and hypothesis-generating studies (HGS). The P-value, she writes, is a reasonable metric (but only if accompanied by effect size data as well) for HTS. However, she writes that HGS should not use the P-value at all. HGS should be understood as not providing any conclusions, but rather as creating hypotheses that can later be tested using HTS. When HGS are reported as finding statistically significant effects, they may be criticized as “fishing expeditions”— studies that test more and more ideas until they come up with a statistically significant result by chance, then present it as if it were their hypothesis all along.

HGS are vital for creating hypotheses for further testing. Still, it’s misleading when they’re presented in the media as finding a positive result because of P-values, as if they meant something in this context. According to Kraemer, “There should be no P values in an HGS and few in an HTS.”

Kraemer provides this heuristic when making sense of a research paper:

“Any P-value reported alone should also be ignored. Every valid P value reported, whether significant or not, should be accompanied by descriptive statistics (eg, tables, graphs) and an interpretable effect size with an indication of its estimation accuracy (eg, 95% CIs) to allow each reader to judge its potential importance.”



Kraemer, H. C. (2019). Is It Time to Ban the P-Value? JAMA Psychiatry, 76(12), 1219-1220. DOI:10.1001/jamapsychiatry.2019.1965 (Link)


  1. Even science and scans and neuropsychiatry are fishing expeditions.
    Once they drop the idea that humans are “sick in the head”, that somehow that sickness means that people are
    no longer worthy of the same laws, same protection, same respect,
    that somehow the shrink owns “normal”
    and the “sick in the head” is not normal.
    Once THEY, the authorities on sick drop their crazy, discriminatory actions, once a few are not allowed to dictate what normal looks like, only then can the shrinks hold validity as being a caring subspeciality.
    Psychiatry at best keeps the world entertained as a system.
    It would be wise to have anti-psychiatry education in schools, because every harmful system should be taught about in school.
    We have to teach our kids to think critically. To examine systems from the bottom up. To hear pros and cons of a system.
    If psychiatry is secure, why would this be an issue?
    Ohh, because it might prevent people from seeking ‘help’?
    And that those kids will then run and kill themselves?
    Nothing like fear mongering from systems and cults that are not based on anything solid.
    it might be wise to point out to kids who in their midst will most likely become a shrink. Since I believe they can most likely be identified early on.
    These kids need help, early on, so that they don’t go on to become so dastardly discriminatory and breed their human hatred among the vulnerable.
    A shrink never picked on more powerful at school, and it becomes a habit for him to pick on vulnerable ones that still have very healthy abilities.
    He seeks to destroy the bits of healthy he sees, through his own pathology, yet fails to see this.

    Report comment

  2. Hi Peter. You did a good job explaining why relying just on p-values is a problem, but you did not apply this problem specifically to psychiatry as your title suggests. Would you mind sharing some specific examples (i.e., published manuscripts who misinterpreted or over-emphasized p-values) of how p-values are a problem in psychiatry? I believe what you say is true – I just think your article would have a been a lot more powerful if you had included specific examples, not just hypothetical situations. Thank you. -David

    Report comment

  3. But people who get psychiatric “help”, and who take psych drugs, are far more likely to then go get a gun and shoot and kill a bunch of people. this is PROVEN by the fact that MOST mass-casualty school shooters were on psych drugs from a psychiatrist….From Sandy Hook to Columbine, psych drugs pulled the triggers, and psychiatrists loaded the bullets….

    Report comment

  4. What exactly do they call “ëffectiveness” in psychiatric treatment? Is this regarded from patient’s pov or from hospital and family’s pov?

    Psychiatric drugs, like lobotomy too, have been invented in order to reduce the government expenses or to make the private psychiatric establishments profitable. They are sedatives, or tranquilizers, aiming at keeping lower personnel numbers in these hospitals. They cure nothing, is not healing why they’re created.

    Most of those who take dopamine – inhibiting drugs perceive this as horrendous, sometimes even as the most horrible experience of their lives.

    Check out the comments here (scroll down a little):

    Report comment

    • It is clear from the focus of the “studies” that the definition of effectiveness is “reduction in symptoms.” This may or may not be of interest to the client specifically, but it certainly makes it obvious that resolving the actual issues that created the “symptoms” is never the goal. It’s like spending a ton of money on topical rash treatments without bothering to figure out if you have poison ivy, the measles, prickly heat, or syphilis. But it certainly is “effective” for creating lifetime patients and blockbuster drug sales!

      Report comment

      • That’s quite a double standard and dehumanizing view. For the physical illnesses, drugs or other kinds of treatment seek the improvement of patient’s life quality.

        In Psychiatry are treated the symptoms, like the patient is an object, not subject. They just seem to think that once declared “mentally ill”, you’re no more a person, what you wish or feel doesn’t matter, maybe because is thought you lack judgement.

        Report comment

        • “They just seem to think that once declared “mentally ill”, you’re no more a person, what you wish or feel doesn’t matter …”

          This is very true, I was appalled when I read my medical records, and realized my psychiatrist didn’t refer to me by name, or even as a fellow human being. But rather, he referenced me by the stigmatization he had given me. People are not DSM disorders, they’re people.

          But, of course, when a psychiatrist doesn’t bother to ever listen to anything you say. He proceeds to fill his medical records with so many provable untruths about that client, that he looks like the insanely delusional person in the end, because he is.

          My psychiatrist was so insanely paranoid of a malpractice suit when I was walking out the door on my last appointment with him. That he had his receptionists attempt to get me to sign a sheet full of clear stickers that said, “I declare this is true” on them.

          I offered to go through my medical records with them, and sign the stickers as placed. If I agreed that what they wanted confirmation regarding, was in fact, the truth. They declined, in embarrassment.

          But it is amazing how easy it is to turn a psychiatrist into a “paranoid schizophrenic.” And I didn’t even need the neuroleptic drugs to turn my psychiatrist into a “paranoid schizophrenic,” I did it with the truth alone.

          Report comment

  5. Of course this is only one piece of the puzzle– but a very important one. The first thing I usually look at if I’m doing my back-of-the-envelope literature review is, who is in the sample? If something about that seems weird or contorted, and the explanation for the selection seems byzantine or contains weasel words, it often means the sample was manipulated to get a better P value. If the sample is weird, and the P value– and I can only really wrap my brain around what that means on a good day– is unimpressive, and no effect size or confidence interval is listed, I usually don’t bother going past the “Results” section. I’m done, the study is probably junk.

    By this crude criteria, basically, most research on drugs published after 1980-something is… kind of… junk.

    Another issue is that we’ve become so used to huge, complicated RCTs that focus on P-values that we’ve started accepting them even when the results fly in the face of clinical practice. And the longer I practice, the more this bothers me. If this research did not exist at all, and we had to evaluate efficacy only by a completely non-‘scientific’ hunch about what clinicians witness in the world, the conclusion would be obvious: Many young people experience serious depressive symptoms, but the ones who have the most debilitating symptoms, and the ones who actually kill themselves or others are usually the ones who have been exposed to SSRIs, usually after they’ve started or stopped or changed dosage. It seems like every third or fourth client I have who is persuaded to try SSRIs (not by me, certainly) also has some horrible physical reaction– something that looks like a convulsion, or they pass out during exercise, or have some other bizarre episode that’s unlike any symptom they’ve experienced previously.

    Thus, clinicians are gradually trained to ignore what’s actually going on with people they care for… they work the checklist and not the case. The best short-term antidote I’ve stumbled across is simply that when I think someone is at risk, I consult with more experienced clinicians first, rather than doing a literature review. When I do a literature review, it’s (usually) not during a crisis, and I keep my expectations very low– and look for smaller samples that are defined by criteria that make sense.

    Report comment

  6. Effect size is not helpful either because there are approximately 30 legal ways to distort the results of a randomized controlled trial, each of them is not outright fraud, but altogether, these tricks can turn the results upside down. Powers that be can force science to show whatever they want to “prove” to the masses. All these tricks are documented in Appendix I of a free ebook, “Fight Cancer” Charles Spender.

    Report comment