Taking – or leaving – a P: lifting a leg, or getting a leg-up on metrics of statistical rigor

Is it the end of the road for the p-value as we know it? Christina Cantrell and James Giordano, Editor in Chief of Philosophy, Ethics, and Humanities in Medicine, talk us through the recent discourse over the value of the p-value.

Recently Ioannidis et al. have proposed that a p value ≤ 0.05 may be an inadequate threshold to establish the statistical rigor needed to demonstrate outcomes’ validity. There may be something to this position. Increasing the stringency of p-values from ≤0.05 to ≤ 0.005 or 0.001 would certainly enable a more granular approach to – and criteria for – statistical significance.

But rarely are things of benefit without suffering some burden, if not issue or problem(s). Thus, a valuable next step would be to appropriately identify and weigh relative pros and cons that such a shifting standard could or would incur. This task should conjoin as many of the stake- and share-holders in the research enterprise as possible, as it will be vital to assess and gain representation of the benefits or burdens that shifting standards and norms of statistical stringency would bring to these respective communities.

To start, it will be important (or necessary) to gauge broad view(s) – if not consensus – within the scientific community (and perhaps key publics) of which disciplines and domains of biomedicine (and other fields) might be best served by adopting the use of more rigorous p-values. For example, some fields, such as those in which studies may be hampered by low numbers of viable research subjects, and/or relatively small difference in outcomes and effects (e.g.- emerging techniques and technologies), might be better served by using p-values ≤ 0.005, 0.001, etc. [1,2]. The same may be true for research in certain aspects of complementary medicine, in which previous studies were either not well-conducted, or outcomes not well-received [1,2,3]. In both such instances, greater stringency and scrutiny of (positive) outcomes might obtain greater acceptance of findings reported.

It might also be that the use of more stringent p-values reveals a greater incidence of placebo effect. This could be viewed as either pro or con, in large measure depending upon the definition – and implication – of placebo applied. Certainly, seeing placebo responses as merely a lack of effect would be viewed in a negative light.  But are such responses truly a “lack of effect”, or rather, do they reflect a physiological effect induced by some factor other than the drug or technology being evaluated[1,2,3]?

We propose the creation of criteria that establish the relative acceptability and requirements of given p-value(s) in particular types and contexts of research.

If we are calling for a more current approach to research by adopting more rigorous p-values, should we not also be equally current in an understanding of placebo responses as multi-componential processes and mechanisms that induce changes in biological functions?

The use of more stringent p-values in research addressing substrates of placebo (and other) responses may yield information about what works, what doesn’t, in whom, and under what conditions. The cost-benefit of such findings, as viewed through a more granular lens (e.g.- of p≤0.005 -0.001), might be significant, indeed. As well, the move adopt p-values of greater stringency may prompt, justify (and provide newly gauged tools) to re-examine and -evaluate prior research.

Toward this goal, we suggest starting with those studies that have investigated mechanisms (of health, disease and injury), tools and techniques that offer the greatest potential for benefit or harm, so as to re-assess the relative good and/or detriment that the outcomes yielded could incur.

Or, should p-values be left aside altogether, as some have suggested [1,2]? Is there no longer value in p-values?  We beg to differ, and argue that there is still merit to the p-value as a useful construct; at least in part.  Taking p-values together with additional queries might provide more accurate metrics for the quality, meaning and worth of certain types of outcomes. For example, what kind of effect was produced by the intervention being tested? Are such findings meaningful to clinicians and patients? What is the confidence interval? Could a different form of statistics (e.g.- Bayesian methods) be employed to evaluate what is being tested?

In this light, we propose the creation of criteria that establish the relative acceptability and requirements of given p-value(s) in particular types and contexts of research.  To be sure, this could afford a useful toolkit. So while the old adage, “…if all one has is a hammer, then everything is a nail”, may (rightly) lift a leg on the dated territoriality of employing a single tool, perhaps it also poses a challenge and opportunity to develop instruments and metric anew.  We think that the idea of  “…having hammers – of varying sizes and weights – that can be used in concert with other tools” de-limits both the adage and its application, and lends a leg-up on developing research methods that may be better suited for the current – and coming – work that at hand.

View the latest posts on the On Medicine homepage