By Ron Pereira Updated on April 30th, 2020

As more and more medical studies related to potential Covid-19 treatments are released in the coming days, weeks, and months, you’re sure to hear terms like “significance” and “P-values” mentioned many times. And as citizens of this world we should all have a basic understanding of what these terms mean. So, in this article, I’ve adapted a module from Gemba Academy’s School of Six Sigma into text to shed some light. Let’s get started.

A hypothesis test is a method that helps us make decisions and conclusions about the overall population using sample data. For example, doctors may conduct a study with 1,000 randomly selected patients (a sample) in order to understand how a particular treatment may work across the world (the population).

We sometimes speak about the results of a hypothesis test as being *statistically significant*…meaning the result is unlikely to have occurred by chance alone.

Let’s imagine we work in a cookie making factory and we’ve been tasked with verifying whether the cookies on production line 2 are on target with respect to their weight. Specifically, these particular cookies are supposed to weigh, on average, 15 ounces.

Once we collect a random sample of around 50 cookies and weigh them we could use a specific type of hypothesis test called a 1 Sample t-Test to help us determine whether our process is on target or not.

And please don’t worry if you’re not familiar with a 1 Sample t-Test. We simply want you to understand how a hypothesis test can be leveraged.

With this said, no matter what type of test we’re working with, there are certain characteristics that never change. One of these characteristics is that every test will have a null hypothesis (Ho) and an alternate hypothesis (Ha). Let’s explore these terms.

First, the *null hypothesis, *often referred to as *Ho, is the statement of equality or no change*. The null hypothesis is always assumed true until proven otherwise, much like a defendant in the U.S. legal system is presumed innocent until proven guilty. In our 1 Sample t-Test cookie example, the null hypothesis (Ho) would be that cookie weight *equals* 15 ounces.

Next, the *alternate hypothesis (Ha) is the statement of inequality or change*. Put another way, the alternate hypothesis (Ha) is what we’re trying to prove. In our 1 Sample t-Test cookie example our alternate hypothesis (Ha) would be that the cookie weight *does not* equal 15 ounces.

Since we’re attempting to characterize an entire population using sample statistics, there’s always the chance errors can be made. Let’s see what these errors look like using the diagram below.

Along the top we see the “true or actual” condition. In one situation, Ho is true and in another, Ho is false. Then, along the side, we note the decisions we can make once the hypothesis test is complete. We can either reject the null hypothesis or fail to reject the null hypothesis.

If Ho is in fact true and we fail to reject it, we make the correct decision. And if Ho is actually false and we reject it we also make the correct decision. But, if Ho is true and we reject it, we commit what is referred to as a *Type I Error*. And if Ho is false and we fail to reject it, we make a *Type II Error*.

From a statistical perspective, *the probability of rejecting a null hypothesis that’s true is denoted as Alpha* and *the probability of committing a Type II error is denoted as Beta*. Alpha, also referred to as the level of significance, is set by us.

The most common alpha value is 0.05 which means we’re willing to accept a 5% chance that we could incorrectly reject the null hypothesis. Of course, if you’re dealing with extremely critical processes, you may decide to reduce alpha to something like 0.01 which would mean you’re only willing to accept a 1% chance of incorrectly rejecting the null hypothesis.

As an aside, one of the most common mistakes people make when working with hypothesis tests occurs when they talk about “accepting” the null hypothesis. And the reason this is a mistake is that *we never accept a hypothesis when working with sample statistics* since there’s always a chance we could be proven wrong once more data is collected.

This is similar to the U.S. justice system since a defendant is never proven innocent. Instead, a defendant is either proven guilty or not guilty beyond a reasonable doubt. Instead of accepting the null hypothesis, we either reject it or fail to reject it.

Obviously, our goal when working with hypothesis tests is to correctly reject, or fail to reject, the null hypothesis. The way we improve our chance of success is by increasing the power of the test. Defined, *power is the likelihood that we’ll identify a significant effect when one really exists*.

There are several factors that influence the power of a test. The first factor, and the one we have the most control of, is sample size. As our sample size increases, so does the power of the test. This means our chance of correctly rejecting the null hypothesis also increases.

Next, with hypothesis tests, we’re often attempting to determine if there are differences between two populations. For example, we may want to compare how cutting speed impacts the diameter of a drilled hole. If the difference between the actual diameters is large, the power of our test increases. But, if the differences between the actual diameters are small, our power decreases.

Finally, the variability in the overall population also impacts the power of our hypothesis test. Specifically, as the variability in our process increases power decreases, but as the variability in our process decreases our power increases.

When we run a hypothesis test, the results will contain an extremely important statistic called a P‐value, which stands for probability value. We use the P‐value to help us determine whether we should reject, or fail to reject, the null hypothesis based on the alpha value we set.

To explain this further, let’s go back to our cookie weight example. Let’s assume we set our alpha value to 0.05 meaning we’re willing to assume a 5% chance that we could reject Ho when, in actuality, we should have failed to reject it.

As a reminder for this hypothesis test, our null hypothesis, or Ho, is that the cookie weight mean equals 15 ounces and the alternate hypothesis, or Ha, is that the overall population cookie weight mean is not equal to 15 ounces. We then use statistical software like Minitab or SigmaXL to analyze the results of the 1 Sample t Test. And when we do we learn that the P‐value is 0.01 which is obviously less than our alpha value of 0.05.

When our P‐value is less than our alpha value, we reject the null hypothesis. Let me say that again because this is extremely important. *When our P‐value is less than our alpha value, we reject the null hypothesis*. And when our P‐value is greater than our alpha value, we fail to reject the null hypothesis.

Again, in this example, our P‐value ended up being less than 0.05 which means we reject the null hypothesis and state that there is evidence that the alternate hypothesis is true at the observed confidence level (1% in our cookie example).

Now, if this is the first time you’ve ever been exposed to this topic this may be a little confusing. The reason this concept can be challenging is because of all the double negatives.

I mean who fails to reject something, right? Well, let me share a little rhyme that I’ve personally used for at least 20 years of my professional life to help me remember how to deal with P‐values.

*“If P is low, Ho must go. If P is high, Ho’s the guy.”*

In other words, if our P‐value is low, meaning less than our alpha value, we reject Ho which means it must go. If P is high, we fail to reject Ho since Ho’s the guy everyone wants to hang out with so why would we ever reject him? Go ahead and write this little rhyme down, I promise you’ll find it helpful down the road.

Now, if statistics aren’t your thing your head may hurt a little right about now. Don’t stress. In the end, the most important thing to remember is that hypothesis tests help to study a subgroup (sample) of an entire group (population) in order to see if something is statistically significant.

So, when you read about things like the most recent remdesivir study just know they tested a small group of people (sample) in order to better understand how the drug may impact (or not impact) the entire world (population). And when they do the test they will use things like P-values to determine whether they should reject, or fail to reject, the null hypothesis.

## Jason Stokes

May 1, 2020 - 7:12 amI still use that rhyme 14 years later. Thanks for your teaching and coaching, Ron. Great article that makes it easy to understand (which is tough to do!)

## Ron Pereira

May 1, 2020 - 8:48 amThanks, Jason. Good to hear from you! I hope you’re well.

## Christian Pezo

May 2, 2020 - 9:54 amExcelente, muchas Gracias por la enseñanza.!!!