Questioning the Value of the P-Value

The p-value expresses the probability of obtaining a test statistic that is at least as extreme as the one result actually observed, assuming that the null hypothesis is true. The scientific method attempts to disprove the null hypothesis. The null hypothesis is disproved when the p-value is less than the statistically significant level. That level is 0.05 or the p-value. Another way to express this may be that the p-value is the chance of a random false positive.

Not only does modern science rest on hypothesis testing and the use of the p-value to assign statistical significance to test results, it is a familiar number to practitioners of six sigma. We could say that the p-value is a standard for testing the outcomes of our experiments, of validating the results of our continuous improvement efforts.

An article in the August 2011 Scientific American argues that the p-value is in fact an arbitrary standards and not always a trustworthy one. The author states:

Many scientific papers make 20 or 40 or even hundreds of comparisons. In such case, researchers who do not adjust the standard p-value threshold of 0.05 are virtually guaranteed to find statistical significance” in what are actually statistically meaningless results.

Father of modern statistics Ronald A. Fisher invented the p-value as an informal measure of evidence against the null hypothesis. Although often overlooked, Fisher called on scientists use other types of evidence such as the a priori plausibility of the hypothesis and the relative strengths of results from previous studies in combination with the p-value.

Can a dead salmon read your mind? According statistically significant results backed by the p-value, they can. With the application of skepticism based on past experience and plausibility, researchers in the Scientific American article example recognized that the hypothesis and their method of testing it was questionable, a p-value of 0.01 notwithstanding.

Inasmuch as lean thinking and kaizen are reliant on the scientific method, we need to follow Fisher’s advice and apply statistical (if arbitrary) standards in combination with the intuition based on common sense and experience. At the same time, the truths within lean systems run deeply counter to our intuitions and even our (mis)perceptions of experience. One at a time is faster than many at a time. A lot of safety stock creates unsafe conditions. Taking time to clean up actually saves time.

Years ago I asked a six sigma MBB how we could possibly test the hypothesis that the p-value was a valid standard. He looked at me as if I was an idiot and said, “We live in a universe governed by statistics.” The history of science is littered with universal laws which became outdated as our understanding of our world advanced. All standards are temporary and subject to improvement. If we continue to use the p-value as a standard, we are required to kaizen this standard.

2 Comments

  1. Rick Kennedy

    July 26, 2011 - 8:42 am

    So if you do 20 hypothesis tests using .05 as your p-value cut-off, you should not be surprised to find, on average, one false positive. This is elementary. That’s why a good practitioner (or a good researcher) will not take a single hypothesis test as proof, but will support a positive test with some confirming piece of evidence. A followup test with new data, confirmation of an additional predicted effect, and a pilot or test of change are all good methods.

  2. John Santomer

    July 30, 2011 - 12:14 am

    Dear Jon,
    Why do we place realities into binary results of true or false? We often get non-confirming figures when trying to measure realities within finite formulas? Even so, we also accept “gray” results such as false positive of true negative. The null hypothesis can never be proven. A data set can simply reject or fail to reject a null hypothesis. What if the measurement involved a critical undetermined value, unquantifiable, incorporeal say a soul? Animism teaches that all living things have a soul(spirit). In the spirit of learning, let us subject this to the null hypothesis – “My pet dog, my best friend has a soul, has an essence.” What confirming evidences do we consider that can reject or simply fail to reject this hypothesis without losing my best friend and buddy? I can not test on other dogs because I am only concerned with my best friend and buddy.