TC3 → Stan Brown → Statistics → HT Steps
revised 30 Mar 2013 (What’s New?)

Hypothesis Tests: Six Steps (Plus One)

Copyright © 2010–2013 by Stan Brown, Oak Road Systems

See also: Inferential Statistics: Basic Cases and Top 10 Mistakes of Hypothesis Tests

Advice: Always number your steps. That helps others find the key features of your test, and it helps you to make sure you don’t forget any steps.

The name I give to each step is just a convenient reference. Write down the step numbers, not these step names, when doing your hypothesis test.

1. Hypotheses

Always state them in symbols. When it’s helpful, particularly for two-population tests, state them in English as well.

H0 and H1, the null hypothesis and the alternative hypothesis, are always identical except for the relational symbol. Case 0 through 3 hypotheses will always look like one of these three possibilities:

H0: (parameter) = (number)
H1: (parameter) < (number)
H0: (parameter) = (number)
H1: (parameter) ≠ (number)
H0: (parameter) = (number)
H1: (parameter) > (number)

(parameter) is μ or μd for numeric data, p for binomial data. Use the right one! (For Case 3, the parameter is μd and you must define d before writing the hypotheses.)

The ≠ test is called a two-tailed test because it tests for a difference in either direction, above or below; the others are one-tailed tests. See One-Tailed or Two-Tailed Hypothesis Test? for help deciding whether you need a one-tailed or two-tailed test.

The (number) is what you are testing for — the claim. Never use sample data here!

Example: Suppose management claims that the average deposit is $200 with standard deviation $45, and your random sample of 100 deposits has a mean of $189.56. Here are your hypotheses:

(1)

H0: μ = 200, management’s claim is correct

H1: μ ≠ 200, management’s claim is wrong

(This example will continue through the other steps.)

Notice that the words add something; they’ll be helpful when you come to write your conclusion in Step 6. If all you said in words was “the mean is not 200”, that wouldn’t be helpful enough to be worth the ink it takes.

For Cases 4 and 5, you compare parameters of two populations:

Case 4 (numeric data): Define Population 1 and Population 2, then
H0: μ1 = μ2
H1: μ1 < μ2
H0: μ1 = μ2
H1: μ1 ≠ μ2
H0: μ1 = μ2
H1: μ1 > μ2
Case 5 (binomial data): Define Population 1 and Population 2, then
H0: p1 = p2
H1: p1 < p2
H0: p1 = p2
H1: p1 ≠ p2
H0: p1 = p2
H1: p1 > p2

For Cases 6 and 7, there are no parameters and you give H0 and H1 in words. Case 6 hypotheses are something like

H0: The (state the model) model is good

H1: The model is bad

For Case 7, which tests the statistical significance of a two-way table, the hypotheses vary more, but one common formulation is

H0: (row variable) is independent of (column variable)

H1: (row variable) is not independent of (column variable)

However it’s phrased, H0 is always “nothing special going on here” and H1 is always “something is happening.”

2. Significance Level

State α here. Usually it will be given to you in the problem, but if you have to pick an α yourself then you pick lower α when the consequences of a Type I error are more severe, and you pick higher α when the consequences of a Type I error are less severe.

α = 0.05 is the most common choice, especially in a business context. α = 0.01 and α = 0.001 are less common, but one application is research involving human medicine.

Continuing with the example:

(2) α = 0.05

Always write α as a decimal, never as a percentage.

RC. Requirements Check

This step isn’t numbered, because it can occur at different places in the sequence. For Cases 0 through 4, you can check requirements at this point; for Cases 5 through 7, it’s easier to check requirements after calculating the p-value in Steps 3–4.

Continuing with the example:

(RC) n = 100 > 30. Therefore the sampling distribution is normal.

For the specific requirements by data type, see Inferential Statistics: Basic Cases. Caution: test requirements for Case 2 by using po, the number specified in the hypotheses; for Case 5, use , the blended proportion on your TI-83/84 output screen.

3–4. Test Statistic and p-Value

This is the heart of a hypothesis test. You assume that the null hypothesis is true, and then use what you know about the sampling distribution to ask: How likely is this sample, given that null hypothesis?

Definition: A test statistic is a standardized measure of the discrepancy between your null hypothesis and your sample.

Definition: The p-value is the probability of getting your sample, or a sample even further from H0, if H0 is true. (For more on this, see What Does the p-Value Mean?)

In our example, the sample mean is $189.56 and the null hypothesis says that the population mean is $200.00. The difference is $11.44, but is that too large to believe? That depends on the sampling distribution of the mean. The standard error of the mean (SEM) is

σ = σ/√n = 45/√100 = $4.50

Now, how many standard errors is the mean of our sample above or below the mean of the population? This is a good old z-score:

z = (−μo) / (σ/√n) = (189.56−200) / (45/√100) = −2.32

When you know the standard deviation of the population, z is the test statistic. In this case, z = −2.32, so the sample mean is 2.32 standard errors below the population mean.

Next you ask: How likely is a sample mean of $189.56 (or one even more different from $200) in that sampling distribution? You already know how to compute the probability, which we call a p-value. But your calculator has a custom menu for this hypothesis test: press [STAT] [] [1] for the Z-Test screen. Here are the inputs and outputs, with a sketch of the sampling distribution:

Z-Test input screen: Stats, 200, 45, 189.56, 100, mu not equal mu-sub-0 Z-Test output screen: z=−2.32, p=.020340828 distribution of x-bat with mean 200, left tail bounded at 189.56, equal and opposite right tail

What is this telling you?

You don’t actually use the z-score, but I want you to understand something about what a test statistic is. Every case we study will have a different test statistic, and in fact choosing a test statistic is the main difference between cases.

There were some deep ideas in this section, but they boil down to writing two simple lines. To show your work, write down the screen name, all of the inputs including the hypothesis on the next-to-last line of the screen, and all outputs that don’t duplicate inputs.

Continuing with the example:

(3–4) Z-Test: μo=200, σ=45, =189.56, n=100, μ≠μo

outputs: z=−2.32, p=0.0203

 

By convention, we always round the test statistic to two decimal places and the p-value to four decimal places. Caution! Watch for powers of 10 (E minus whatever) and never write something daft like “p-value = 5.6212”.

5. Decision Rule

There are two and only two correct things you can say here:

p < α. Reject H0 and accept H1

or

p > α. Fail to reject H0.

These are standard language, so don’t get creative. You can add the numbers, if you like — p < α (0.0203<0.05) — but the symbols are required.

Continuing with the example:

(5) p < α. Reject H0 and accept H1.

Since p < α, this sample is too unusual; it’s further away from H0 than we can expect from random chance; the data cast too much doubt on H0. We say that the result is statistically significant.

6. Conclusion

Write your conclusion in English, mentioning the significance level. If you failed to reject H0, write your conclusion in neutral language. Please see Proper Conclusions to Your Hypothesis Tests.

When you reject H0 and accept a two-tailed H1, you can draw a further conclusion than just “different from”. See p < α in Two-Tailed Test: What Does It Tell You?.

Continuing with the example:

(6) At the 0.05 level of significance, the true mean of all deposits is different from $200 and management’s figures are incorrect. In fact, the true mean of all deposits is less than $200.
Or,
(6) The true mean of all deposits is different from $200 (p = 0.0203), and management’s figures are incorrect. In fact, the true mean of all deposits is less than $200.

Your conclusion must include either the significance level or the p-value. p-values give more information, but books generally includes significance levels.

What Can Go Wrong?

Even if you do everything right, sample variability means you are never certain of your conclusion, and sometimes you will reach a wrong conclusion without knowing it.

You should understand Type I and Type II errors. A Type I error is rejecting H0 when it’s actually true, and a Type II error is failing to reject H0 when it’s actually false. If H0 is “the defendant is innocent”, then a Type I error would convict an innocent person and a Type II error would let a guilty person go free.

These are not errors in the sense of mistakes, but they do represent incorrect results. If your significance level (Step 2) is 0.05, and you do everything right, still in the long run you’ll reject a true H0 about one time in twenty (0.05 = 1/20). The problem, of course, is that you don’t know which one out of twenty.

You can never eliminate the possibility of a Type I or Type II error, but they are both less likely with larger sample sizes. The significance level α is the probability of a Type I error, so if you can’t live with a 5% chance of a Type I error then you choose a lower α. There’s no free lunch, though: if you lower α at a given sample size, you’re also making a Type II more likely. So you have to weigh the seriousness of a Type I error versus a Type II error.

Should You Also Do a Confidence Interval?

In homework problems and on quizzes and exams, not unless the problem specifically asks for it.

In real life, including the ESP Lab and Field Project, a CI is often a good idea. When you reject H0 and accept H1, the CI tells you the size of the effect, which can help to gauge whether the statistically significant result is also practically significant. On the other hand, when you fail to reject H0, you can’t draw a conclusion, but at least with a CI you can get some idea of the maximum size of the effect.

If you do a CI, what’s the appropriate confidence level? Unless you have a specific reason to choose a different level, it would be 1 minus the two-tailed α, expressed as a percent. If you did a one-tailed test (> or <) at the .01 significance level, the two-tailed α would be .02 and the corresponding confidence level would be 1−.02 = 98%.

Failing to Reach a Conclusion

In the earlier example, the p-value allowed you to reach a conclusion. Let’s do a similar example where you can’t reach a conclusion.

Suppose management claims that the average deposit is $200 with standard deviation $45, and your random sample of 50 deposits has a mean of $189.56. Test management’s claim at the 0.05 significance level.

This is the preceding example, but with a smaller sample. Here’s how it plays out:

(1) H0: μ = 200, management’s claim is correct
H1: μ ≠ 200, management’s claim is wrong
(2) α = 0.05
(RC)n = 50 > 30, and therefore the sampling distribution is normal.
(3) Z-Test input screen: z=−1.64, p=.1009 Z-Test input screen: Stats, 200, 45, 189.56, 50, mu not equal mu-sub-0 Z-Test, μo=200, σ=45, =189.56, n=50, μ≠μo
results: z = −1.64, p = 0.1009
(5) p > α. Fail to reject H0.
(6) At the 0.05 significance level, it’s impossible to say whether management’s claim, that the average deposit is $200.00, is correct or not.
Or,
p = 0.1009, and it’s impossible to say whether management’s claim, that the average deposit is $200.00, is correct or not.

Important: Just because you fail to disprove management’s claim, that doesn’t mean the claim is true. When p is greater than α, you fail to reach a conclusion: “It’s impossible to say whether [insert H1 in English] or not.” Don’t fall into the trap of implying that H1 is false or H0 is true; you must use neutral language.

What’s New


This page is used in instruction at Tompkins Cortland Community College in Dryden, New York; it’s not an official statement of the College. Please visit www.tc3.edu/instruct/sbrown/ to report errors or ask to copy it.

For updates and new info, go to http://www.tc3.edu/instruct/sbrown/stat/