The importance of the P-value is often misunderstood and has been at the center of statistics debates for 20 years, writes Stanford researcher Helena Chmura Kraemer. Writing in JAMA Psychiatry, she notes that there have been calls to ban the statistic for being misleading. She has a simple solution, though:
“Whether P values are banned matters little. All readers (reviewers, patients, clinicians, policymakers, and researchers) can just ignore P values and focus on the quality of research studies and effect sizes to guide decision-making.”
The P-value is a statistic, expressed as p < a number (i.e., p < 0.05). This statistic measures the possibility of finding the results by chance (in this case, less than 5%), assuming that the null hypothesis is true. That is, it doesn’t measure whether that hypothesis is true; it simply measures the strength of a study to find evidence disproving it.
According to Kraemer, “The P-value does not give the probability that the hypothesis being tested is false (i.e., that the null hypothesis is true); nor is it true that the smaller the P-value, the stronger the hypothesis being tested.”
Kraemer provides a clear example: imagine a hypothetical scenario in which researchers could identify people “at-risk” for developing an illness within ten years. In the general population, the risk of developing the illness is 10%. But the researchers hypothesize that in their “at-risk” sample, the risk will be larger than 10%. They conduct a study to find out—they use their measure to find “at-risk” people and then follow them to see if they develop the illness at a rate higher than 10%.
Let’s say their “at-risk” sample has a rate of 11%. That would be a pretty meaningless increase—one additional person out of every 100 would be identified by this “at-risk” designation. However, it might be statistically significant because there is a real difference between the two groups.
As the sample size of the study increases, the P-value becomes smaller and smaller: if there are 100 people in the study, P is approximately .37, a non-significant number. In that case, the study fails to find the effect, even though it is real. If there were 2500 people, though, P is about .05—the research is just on the verge of confirming the effect. Increase the sample to 50,000 people, and the P-value appears tiny, making it statistically significant (about 5 × 10−14).
The true effect doesn’t change here at all—the prediction is marginally better for the at-risk group (11% instead of 10%). However, changes to the sample size drastically alter whether the result will be statistically significant.
No matter what, though, the “risk difference” is tiny at 0.1. That statistic—essentially an effect size—tells the reader how much of a difference there is between groups. That’s more important information than just “is there any difference between groups.” Whenever you’re comparing two groups, there’s almost always going to be a tiny difference—groups are rarely exactly the same, even when randomized. So it’s far more critical to know how different the groups are.
Kraemer notes that there are two main types of research study, hypothesis testing studies (HTS) and hypothesis-generating studies (HGS). The P-value, she writes, is a reasonable metric (but only if accompanied by effect size data as well) for HTS. However, she writes that HGS should not use the P-value at all. HGS should be understood as not providing any conclusions, but rather as creating hypotheses that can later be tested using HTS. When HGS are reported as finding statistically significant effects, they may be criticized as “fishing expeditions”— studies that test more and more ideas until they come up with a statistically significant result by chance, then present it as if it were their hypothesis all along.
HGS are vital for creating hypotheses for further testing. Still, it’s misleading when they’re presented in the media as finding a positive result because of P-values, as if they meant something in this context. According to Kraemer, “There should be no P values in an HGS and few in an HTS.”
Kraemer provides this heuristic when making sense of a research paper:
“Any P-value reported alone should also be ignored. Every valid P value reported, whether significant or not, should be accompanied by descriptive statistics (eg, tables, graphs) and an interpretable effect size with an indication of its estimation accuracy (eg, 95% CIs) to allow each reader to judge its potential importance.”
Kraemer, H. C. (2019). Is It Time to Ban the P-Value? JAMA Psychiatry, 76(12), 1219-1220. DOI:10.1001/jamapsychiatry.2019.1965 (Link)