TC3 → Stan Brown → Statistics → Inferences about Correlation
revised Apr 4, 2008 (What’s New?)

Inferences about Linear Correlation

Copyright © 2002–2008 by Stan Brown, Oak Road Systems

Summary:  From the sample correlation it is possible to estimate the population correlation. This page previews the techniques of inferential statistics that can estimate the correlation coefficient of the population.

See also:  A downloadable Excel workbook does all these calculations.
A downloadable TI-83/84 program computes confidence intervals and hypothesis tests.

Contents: 

Introduction

After you have done a correlation analysis on your sample of points, there are three questions you might ask:

  1. Is there any correlation in the population, or is the sample correlation just the luck of the draw?
  2. What is the estimated correlation coefficient of the population?
  3. Is there any causal relationship between the two variables?

The first two questions are matters of inferential statistics, and this page will explore them. (Some of the details may not make much sense till you’ve studied inference in the second half of the course.)

The third question is not a statistical one. If you find a correlation, that suggests that a cause-and-effect relationship may be worth looking for. But the mere fact that two variables march more or less in step does not let you conclude that one causes the other. For example, consider the variables “number of murders per year” and “number of books in the public library”. If you gather those values for many U.S. cities, you will find a good correlation. But one certainly does not cause the other; instead, they are both explained by city population. Remember, “correlation is not causation”.

Reference

Spiegel & Stephens, Theory and Problems of Statistics 3/e (McGraw-Hill, 1999). See “Sampling Theory of Correlation” on page 317, and the solved problems on pages 336–337.

Notation

Some textbooks use r* for the sample correlation coefficient and r for the population correlation coefficient; others use r for the sample and ρ (rho, pronounced “roe”) for the population.

We use r for the sample and ρ for the population, consistent with the convention “Roman letters for sample statistics, Greek letters for population parameters”.

Historical Note

Why does this page exist?

Until summer 2007, the textbook for TC3’s MATH200 Statistics class was Dabes & Janik's Statistics Manual (1999). That book presented decision points based on the sameple size and correlation coefficient. If | r | was greater than the decision point for the particular sample size, then we could say that there was correlation in the population. The decision points were critical values at α = 0.05 for a two-tailed test of the hypothesis “population correlation coefficient ρ is nonzero.”.

I began to wonder where these numbers came from, and couldn’t find any answers on the Web, so I created this page for students who might also wonder where the numbers came from. I programmed the accompanying Excel workbook at first to check my numbers against Dabes & Janik’s, but then expanded it to compute hypothesis tests and confidence intervals for ρ.


Is There Correlation in the Population?

Of the two questions that inferential statistics can answer, the confidence interval estimate turns out to be fairly tough, but the hypothesis test is easier: Is there some correlation in the population? In other words, is ρ ≠ 0? The same question can also be phrased is there a linear relationship between the two variables? How do we answer this question? This forms the basis of a hypothesis test.

As always, to perform a hypothesis test we begin with the null hypothesis. Ho is ρ = 0. The null hypothesis is that there is no linear correlation in the population, no linear relationship between X and Y. What is our test statistic, and what is its distribution? If you have n points and the linear correlation coefficient is r, this is a sample statistic because you can look at this particular set of n points (n pairs x,y) as a sample from the population. Obviously different samples of n points could be drawn from the same population, and the correlation coefficient r could be calculated for those samples. If you could measure x,y pairs for the entire population, the population correlation coefficient ρ would probably be different from the r of any sample.

If the true population correlation r is 0, then the correlations of samples generate a statistic that follows a Student’s t distribution, namely

(1) t = r times square root of [ (n minus 2) over (1 minus r squared) ] with n minus 2 degrees of freedom

Thinking backward, as we normally do in statistics, we can then compute a p-value, which tells us how likely it is that we would get at least the sample r that we did, just by chance, if there is no correlation at all in the population. If that p-value is very small (usually chosen with a threshold of 0.05), we reject Ho and conclude that the sample correlation is not due to chance and the population does have some correlation.

(The first block of the accompanying Excel workbook will do these calculations for you. You can also use a downloadable TI-83/84 program, or LinRegTTest on your TI-83/84 or TI-89.)

Example

You measure x,y pairs from 20 individuals and find a correlation coefficient of 0.49. Can you conclude that a correlation exists in the population, at the 0.05 level of significance?

Solution: Begin by writing down the hypotheses:

     Ho: ρ = 0, there is no linear correlation between the variables in the population.

     Ha: ρ ≠ 0, there is some linear correlation in the population.

This is a two-tailed test, and α = 0.05.

With n = 20 and r = 0.49, use equation 1 to compute the test statistic t = 2.38 with df = 18. From a table or a calculator find two-tailed p = 0.0283. Therefore you reject Ho and conclude that ρ ≠ 0 in the population. (Caution: You don’t know how large the population’s linear correlation is, only that it is greater than 0. You also don’t know whether one variable drives the other, some third factor drives both, or it’s just a remarkable coincidence.)

Decision Points

You can precompute a decision point (also known as critical value) for any sample size n and any two-tailed significance level α. If the absolute value of sample r is greater than that decision point, there is some nonzero correlation in the population.

In fact, back in lesson 4 we presented Decision Points for Correlation Coefficient with no explanation of where they came from.

The decision points are found by working the previous problem backward. First solve equation 1 for r:

(2) r = t over the square root of [n minus 2 + t squared]

Then from the sample size n, find the critical t for df = n−2 and α/2. Plug n and critical t into the equation to find the critical r.

(The second block of the accompanying Excel workbook will do these calculations for you.)

Decision Points Example

For a sample size of n = 30, what is the decision point at the 0.05 level of significance?

Solution: If n = 30 then df = 28. Divide α/2 = 0.025 for the one-tailed significance level. Then the critical t is t(28,0.025) = 2.048. (You can find this with your TI-83/84/89, with Excel, or with tables.) Substituting n and t in equation 2 yields critical r or a decision point of 0.361.

Conclusion: In a sample of 30 points, at the 0.05 significance level, if r > 0.361 then the linear correlation coefficient of the population, ρ, is positive. If r < −0.361 then ρ is negative. If r is between −0.361 and +0.361, you can’t tell anything about ρ.


Confidence Interval for Correlation Coefficient

This business of “different from zero” is all very well theoretically, but wouldn’t it be more useful to estimate the linear correlation coefficient of the population (ρ) from the linear correlaton coefficient of the sample (r) in a confidence interval?

Indeed it would, but there’s a catch. The t distribution from equation 1 worked only when testing for a population correlation coefficient of 0, which would give symmetrical distributions of r. But when computing a confidence interval for ρ, we can’t also assume that ρ = 0 and therefore we can’t use that t statistic.

To resolve this paradox, use Fisher’s Z transformation, which is defined like this:

(3) Z = half of ln of [ (1 plus r) over (1 minus r) ]

where “ln” is the natural or base-e logarithm. (Your calculator has a key for it, and you use LN( ) in Excel.) Fisher’s Z is a bit nasty to compute, but it is approximately normally distributed no matter what the population ρ might be. Its standard deviation is 1/√(n−3).

To compute a confidence interval for ρ, transform r to Z and compute the confidence interval of Z as you would for any normal distribution with σ = 1/√(n−3). To transform the Z interval to an interval about ρ, you need to solve equation 3 for r, like this:

(4) r = (exp(2Z) − 1) over (exp(2Z) + 1)

Plug the low Z into equation 4 to compute the lower limit on ρ, then plug in the high Z to compute the higher limit on ρ.

(The third block of the accompanying Excel workbook will do these calculations for you. You can also use a downloadable program on your TI-83 or TI-84.)

Example

A sample of 25 points shows a linear correlation coefficient of 0.84. What is the 95% confidence interval for the correlation coefficient in the population?

The solution is a wild ride; hang on! (The Excel spreadsheet shows the intermediate steps.)

(a) From 1−α = 0.95. find α/2 = 0.025. Use a table, use TI-83/84/89 invNorm, or use Excel NORMSINV( ) to find that the 95% confidence interval is bounded by z = ±1.96.

(b) That critical z of ±1.96 bounds the confidence interval in the standard normal distribution with σ=1; for this one you must multiply by the standard deviation of the Fisher Z, which is 1/√(n−3). For n = 25 points that is σ = 1/√(25−3) = 0.213. Multiplying by the 1.96 from (a) gives E = 0.418. E is the error of the estimate, which is half the width of the confidence interval for Fisher’s transformed Z.

(c) Now use equation 3 and r = 0.84 to compute Z = 1.221. Using the result from (b), the confidence interval for the transformed Z is 1.221 ± 0.418, which is 0.803 to 1.639.

(d) Plug those Fisher-Z endpoints into equation 4. Z = 0.803 yields ρ ≥ 0.666, and Z = 1.639 yields ρ ≤ 0.927.

Conclusion: If 25 points have a linear correlation coefficient of 0.84, then with 95% confidence the population has a linear correlation coefficient between 0.666 and 0.927.

Remark: The sample statistic 0.84 is not at the middle of the confidence interval, because the sample r values have a skewed distribution around the population correlation coefficient ρ.

What’s New


This page is used in instruction at Tompkins Cortland Community College in Dryden, New York; it’s not an official statement of the College. Please visit www.tc3.edu/instruct/sbrown/ to report errors or ask to copy it.

For updates and new info, go to http://www.tc3.edu/instruct/sbrown/stat/