# Inferences about Linear Correlation

Copyright © 2002–2015 by Stan Brown, Oak Road Systems

Copyright © 2002–2015 by Stan Brown, Oak Road Systems

**Summary:**
From the sample correlation it is possible to
**estimate the population correlation**.
This page previews the techniques of
**inferential statistics** that can estimate the correlation
coefficient of the population.

**See also:**
A downloadable Excel
workbook does all these calculations.

The downloadable TI-83/84 program
MATH200B Program part 6 computes confidence intervals and hypothesis tests.

After you have done a correlation analysis on your sample of points, there are three questions you might ask:

**Is there any correlation in the population**, or is the sample correlation just the luck of the draw?- What is the estimated
**correlation coefficient of the population**? - Is there any
**causal relationship**between the two variables?

The first two questions are matters of inferential statistics, and this page will explore them. (Some of the details may not make much sense till you’ve studied inference in the second half of the course.)

The third question is not a statistical
one. If you find a correlation, that suggests that a cause-and-effect
relationship may be worth looking for. But the mere fact that two
variables march more or less in step **does not let you conclude** that
one causes the other. For example, consider the variables “number of
murders per year” and “number of books in the public library”. If you
gather those values for many US cities, you will find a good
correlation. But one certainly does not cause the other; instead, they
are both explained by city population. Remember,
**“correlation is not causation”**.

Spiegel & Stephens, Theory and Problems of Statistics 3/e (McGraw-Hill, 1999). See “Sampling Theory of Correlation” on page 317, and the solved problems on pages 336–337.

Some textbooks use r^{*} for the sample correlation
coefficient and r for the population correlation coefficient; others
use r for the sample and ρ (rho, pronounced “roe”) for
the population.

I’ll use r for the sample and ρ for the population, consistent with the convention “Roman letters for sample statistics, Greek letters for population parameters”.

Why does this page exist?

Until summer 2007, the textbook for TC3’s MATH200 Statistics class was Dabes & Janik’s Statistics Manual (1999). That book presented decision points based on the sample size and correlation coefficient. If | r | was greater than the decision point for the particular sample size, then you could say that there was correlation in the population. The decision points were critical values at α = 0.05 for a two-tailed test of the hypothesis “population correlation coefficient ρ is nonzero.”

I began to wonder where these numbers came from, and couldn’t find any answers on the Web, so I created this page for students who might also wonder where the numbers came from. I programmed the accompanying Excel workbook at first to check my numbers against Dabes & Janik’s, but then expanded it to compute hypothesis tests and confidence intervals for ρ.

You have some correlation coefficient r in your
sample, and you wonder whether there’s any correlation in the
population, or if the population has no correlation and this sample r
is just normal sample variability.
This is a **hypothesis test**. It asks,
**is there some correlation in the population?** In other words,
**is there a linear relationship between the two variables?** In
symbols, you want to know:
**is ρ ≠ 0?**

As always, to perform a hypothesis test you begin with the null
hypothesis. **H _{o} is ρ = 0**, meaning
that there is no
linear correlation in the population, no linear relationship between X
and Y. If H

Remember, any hypothesis test asks, “is my sample far
enough from H_{o} that I can rule out random chance?” You get a
handle on “far enough” by computing a p-value: the chance of
getting a sample this far from H_{o} if H_{o} is actually true.
To do that, you have to know what the
distribution of sample statistics looks like. In this case
you’re concerned with the **distribution of sample r**.
If the true population
correlation ρ is 0 (which you assume it is, since that’s
H_{o}), then the correlations of samples generate a
statistic that follows a Student’s t distribution, namely

(1)

From the t statistic you can compute a p-value, which tells
you how likely it is that you would get at least the sample r that you
did, just by chance, if there is no correlation at all in the
population. If that p-value is very small (usually chosen with a
threshold of 0.05), you reject H_{o} and conclude that the sample
correlation is *not* due to chance and the population does have
some correlation.

(The first block of the accompanying Excel workbook will do these calculations for you. You can also use
MATH200B Program part 6,
or `LinRegTTest`

on your TI-83/84 or TI-89.)

Before you can make any inference (hypothesis test or confidence interval) about correlation or regression in the population, check these requirements:

- The data are a
**simple random sample**. - The
**plot of residuals versus x is featureless**— no bending, no thickening or thinning trend from left to right, and no outliers. - The
**residuals are normally distributed**. You can check this with a normal probability plot, available in most statistics packages and in MATH200A part 4. Since the test statistic is a t, and the t test is robust, moderate departures from normality are okay.

You measure x,y pairs from 20 individuals and find a correlation coefficient of 0.49. Can you conclude that a correlation exists in the population, at the 0.05 level of significance?

(To keep things simple, I’m just giving you the sample size and r and skipping over the requirements check.)

**Solution**: Begin by writing down the hypotheses:

H_{o}: ρ = 0,
there is no linear correlation between the variables in the population.

H_{a}:
ρ ≠ 0, there is some linear correlation in the population.

This is a two-tailed test, and α = 0.05.

With n = 20 and r = 0.49,
use equation 1 to compute the test statistic
t = 2.38 with df = 18.
From a table or a calculator find two-tailed p = 0.0283.
This is < α, so you reject H_{o} and accept H_{1},
concluding that ρ ≠ 0 in the population.
Furthermore, since the sample correlation r is
> 0 you can conclude that the population correlation ρ
is also > 0.
(See p < α in Two-Tailed Test: What Does It Tell You?)

**Caution**: From this hypothesis test you
don’t know how large the population’s linear correlation
is, only that it’s greater than 0. You also don’t know
whether one variable drives the other, some third factor drives both,
or it’s just a remarkable coincidence.

You can precompute a decision point (also known as critical value) for any sample size n and any two-tailed significance level α. If the absolute value of sample r is greater than that decision point, there is some nonzero correlation in the population.

In fact, back in lesson 4 I presented Decision Points for Correlation Coefficient with no explanation of where they came from.

The decision points are found by working the previous problem backward. First solve equation 1 for r:

(2)

Then from the sample size n, use MATH200B Program part 3 to find the critical t for df = n−2 and α/2. Plug n and critical t into the equation to find the critical r.

(The second block of the accompanying Excel workbook will do these calculations for you.)

For a sample size of n = 30, what is the decision point at the 0.05 level of significance?

**Solution**: If n = 30 then df = 28.
Divide α/2 = 0.025 for the one-tailed significance
level.
Then the critical t is t(28,0.025) = 2.048.
(You can find this using MATH200B Program part 3 in your TI-83/84, with the
Stats/List editor application on your TI-89,
with Excel, or with tables.)
Substituting n and t in
equation 2 yields critical r or a decision point of 0.361.

Conclusion: In a sample of 30 points, at the 0.05 significance level, if r > 0.361 then the linear correlation coefficient of the population, ρ, is positive. If r < −0.361 then ρ is negative. If r is between −0.361 and +0.361, you can’t tell anything about ρ.

This business of “different from zero” is all very well theoretically, but wouldn’t it be more useful to estimate the linear correlation coefficient of the population (ρ) from the linear correlation coefficient of the sample (r) in a confidence interval?

Indeed it would, but there’s a catch. The t distribution from equation 1 worked only when testing for a population correlation coefficient of 0, which would give a symmetrical sampling distribution of r. But when computing a confidence interval for ρ, you can’t also assume that ρ = 0 and therefore you can’t use that t statistic.

To resolve this paradox, use Fisher’s Z transformation, which is defined like this:

(3)

where “ln” is the natural or base-e logarithm. (Your calculator has a key for it, and you use LN( ) in Excel.) Fisher’s Z is a bit nasty to compute, but it is approximately normally distributed no matter what the population ρ might be. Its standard deviation is 1/√(n−3).

To compute a confidence interval for ρ, transform r to Z and compute the confidence interval of Z as you would for any normal distribution with σ = 1/√(n−3). To transform the Z interval to an interval about ρ, you need to solve equation 3 for r, like this:

(4)

Plug the low Z into equation 4 to compute the lower limit on ρ, then plug in the high Z to compute the higher limit on ρ.

(The third block of the accompanying Excel workbook will do these calculations for you. You can also use MATH200B Program part 3 on your TI-83 or TI-84.)

A sample of 25 points shows a linear correlation coefficient of
0.84. What is the 95% confidence interval for the correlation
coefficient in the population? (Again,
to keep things simple I’m giving you the sample statistics
instead of the raw data, and we’ll assume that the requirements are met. But in real life,
**always check the requirements** before computing a confidence
interval.)

The solution is a wild ride; hang on! (The Excel spreadsheet shows the intermediate steps.)

(a) From 1−α = 0.95. find α/2 = 0.025. Use a table, use TI-83/84/89 invNorm, or use Excel NORMSINV( ) to find that the 95% confidence interval is bounded by z = ±1.96.

(b) That critical z of ±1.96 bounds the confidence interval in the
*standard*
normal distribution with σ=1; for *this* one you must
multiply by the standard deviation of the Fisher Z, which is
1/√(n−3).
For n = 25 points that is
σ = 1/√(25−3) = 0.213.
Multiplying by the 1.96 from (a) gives
E = 0.418. E is the error of
the estimate, which is half the width of the confidence
interval for Fisher’s transformed Z.

(c) Now use equation 3 and r = 0.84 to compute Z = 1.221. This is the Fisher Z for this particular sample. Using the result from (b), the confidence interval for the transformed Z is 1.221 ± 0.418, which is 0.803 to 1.639.

(d) Plug those Fisher-Z endpoints into equation 4. Z = 0.803 yields ρ = 0.666, and Z = 1.639 yields ρ = 0.927.

Conclusion: If 25 points have a linear correlation coefficient
of 0.84, then
**you’re 95% confident that the population’s linear correlation coefficient is between 0.666 and 0.927.**

Remark: The sample statistic 0.84 is not at the middle of the confidence interval, because the sample r values have a skewed distribution around the population correlation coefficient ρ.

**25 Dec 2009**: rewrite the introductory section for the hypothesis test and add a requirements section; clarify the one-tailed interpretation of the two-tailed test; add a reference to the requirements in the CI example**4 Apr 2008**: change notation, formerly r^{*}for sample and r for population; add references to the new document “Inferences about Linear Correlation on TI-83/84/89” (since merged into MATH200B Program part 3); make a host of other small edits- (intervening changes suppressed)
**15 Jun 2002**: new document

This page is used in instruction at Tompkins Cortland Community College in Dryden, New York; it’s not an official statement of the College. Please visit www.tc3.edu/instruct/sbrown/ to report errors or ask to copy it.

For updates and new info, go to http://www.tc3.edu/instruct/sbrown/stat/