Inferences about Linear Correlation
Copyright © 2002–2013 by Stan Brown, Oak Road Systems
Copyright © 2002–2013 by Stan Brown, Oak Road Systems
Summary: From the sample correlation it is possible to estimate the population correlation. This page previews the techniques of inferential statistics that can estimate the correlation coefficient of the population.
A downloadable Excel
workbook does all these calculations.
The downloadable TI-83/84 program MATH200B Program part 6 computes confidence intervals and hypothesis tests.
After you have done a correlation analysis on your sample of points, there are three questions you might ask:
The first two questions are matters of inferential statistics, and this page will explore them. (Some of the details may not make much sense till you’ve studied inference in the second half of the course.)
The third question is not a statistical one. If you find a correlation, that suggests that a cause-and-effect relationship may be worth looking for. But the mere fact that two variables march more or less in step does not let you conclude that one causes the other. For example, consider the variables “number of murders per year” and “number of books in the public library”. If you gather those values for many U.S. cities, you will find a good correlation. But one certainly does not cause the other; instead, they are both explained by city population. Remember, “correlation is not causation”.
Spiegel & Stephens, Theory and Problems of Statistics 3/e (McGraw-Hill, 1999). See “Sampling Theory of Correlation” on page 317, and the solved problems on pages 336–337.
Some textbooks use r* for the sample correlation coefficient and r for the population correlation coefficient; others use r for the sample and ρ (rho, pronounced “roe”) for the population.
I’ll use r for the sample and ρ for the population, consistent with the convention “Roman letters for sample statistics, Greek letters for population parameters”.
Why does this page exist?
Until summer 2007, the textbook for TC3’s MATH200 Statistics class was Dabes & Janik’s Statistics Manual (1999). That book presented decision points based on the sample size and correlation coefficient. If | r | was greater than the decision point for the particular sample size, then you could say that there was correlation in the population. The decision points were critical values at α = 0.05 for a two-tailed test of the hypothesis “population correlation coefficient ρ is nonzero.”
I began to wonder where these numbers came from, and couldn’t find any answers on the Web, so I created this page for students who might also wonder where the numbers came from. I programmed the accompanying Excel workbook at first to check my numbers against Dabes & Janik’s, but then expanded it to compute hypothesis tests and confidence intervals for ρ.
You have some correlation coefficient r in your sample, and you wonder whether there’s any correlation in the population, or if the population has no correlation and this sample r is just normal sample variability. This is a hypothesis test. It asks, is there some correlation in the population? In other words, is there a linear relationship between the two variables? In symbols, you want to know: is ρ ≠ 0?
As always, to perform a hypothesis test you begin with the null hypothesis. Ho is ρ = 0, meaning that there is no linear correlation in the population, no linear relationship between X and Y. If Ho is true, then the correlation coefficient r of your sample is within normal sample variability for a sample of this size drawn from a population where ρ = 0. If Ho is false, then the alternative H1 is true, and there is a linear relationship between variables X and Y in the population: H1 is ρ ≠ 0.
Remember, any hypothesis test asks, “is my sample far enough from Ho that I can rule out random chance?” You get a handle on “far enough” by computing a p-value: the chance of getting a sample this far from Ho if Ho is actually true. To do that, you have to know what the distribution of sample statistics looks like. In this case you’re concerned with the distribution of sample r. If the true population correlation ρ is 0 (which you assume it is, since that’s Ho), then the correlations of samples generate a statistic that follows a Student’s t distribution, namely
From the t statistic you can compute a p-value, which tells you how likely it is that you would get at least the sample r that you did, just by chance, if there is no correlation at all in the population. If that p-value is very small (usually chosen with a threshold of 0.05), you reject Ho and conclude that the sample correlation is not due to chance and the population does have some correlation.
(The first block of the accompanying Excel workbook will do these calculations for you. You can also use
MATH200B Program part 6,
LinRegTTest on your TI-83/84 or TI-89.)
Before you can make any inference (hypothesis test or confidence interval) about correlation or regression in the population, check these requirements:
You measure x,y pairs from 20 individuals and find a correlation coefficient of 0.49. Can you conclude that a correlation exists in the population, at the 0.05 level of significance?
(To keep things simple, I’m just giving you the sample size and r and skipping over the requirements check.)
Solution: Begin by writing down the hypotheses:
Ho: ρ = 0, there is no linear correlation between the variables in the population.
Ha: ρ ≠ 0, there is some linear correlation in the population.
This is a two-tailed test, and α = 0.05.
With n = 20 and r = 0.49, use equation 1 to compute the test statistic t = 2.38 with df = 18. From a table or a calculator find two-tailed p = 0.0283. This is < α, so you reject Ho and accept H1, concluding that ρ ≠ 0 in the population. Furthermore, since the sample correlation r is > 0 you can conclude that the population correlation ρ is also > 0. (See p < α in Two-Tailed Test: What Does It Tell You?)
Caution: From this hypothesis test you don’t know how large the population’s linear correlation is, only that it’s greater than 0. You also don’t know whether one variable drives the other, some third factor drives both, or it’s just a remarkable coincidence.
You can precompute a decision point (also known as critical value) for any sample size n and any two-tailed significance level α. If the absolute value of sample r is greater than that decision point, there is some nonzero correlation in the population.
In fact, back in lesson 4 I presented Decision Points for Correlation Coefficient with no explanation of where they came from.
The decision points are found by working the previous problem backward. First solve equation 1 for r:
Then from the sample size n, use MATH200B Program part 3 to find the critical t for df = n−2 and α/2. Plug n and critical t into the equation to find the critical r.
(The second block of the accompanying Excel workbook will do these calculations for you.)
For a sample size of n = 30, what is the decision point at the 0.05 level of significance?
Solution: If n = 30 then df = 28. Divide α/2 = 0.025 for the one-tailed significance level. Then the critical t is t(28,0.025) = 2.048. (You can find this using MATH200B Program part 3 in your TI-83/84, with the Stats/List editor application on your TI-89, with Excel, or with tables.) Substituting n and t in equation 2 yields critical r or a decision point of 0.361.
Conclusion: In a sample of 30 points, at the 0.05 significance level, if r > 0.361 then the linear correlation coefficient of the population, ρ, is positive. If r < −0.361 then ρ is negative. If r is between −0.361 and +0.361, you can’t tell anything about ρ.
This business of “different from zero” is all very well theoretically, but wouldn’t it be more useful to estimate the linear correlation coefficient of the population (ρ) from the linear correlation coefficient of the sample (r) in a confidence interval?
Indeed it would, but there’s a catch. The t distribution from equation 1 worked only when testing for a population correlation coefficient of 0, which would give a symmetrical sampling distribution of r. But when computing a confidence interval for ρ, you can’t also assume that ρ = 0 and therefore you can’t use that t statistic.
To resolve this paradox, use Fisher’s Z transformation, which is defined like this:
where “ln” is the natural or base-e logarithm. (Your calculator has a key for it, and you use LN( ) in Excel.) Fisher’s Z is a bit nasty to compute, but it is approximately normally distributed no matter what the population ρ might be. Its standard deviation is 1/√(n−3).
To compute a confidence interval for ρ, transform r to Z and compute the confidence interval of Z as you would for any normal distribution with σ = 1/√(n−3). To transform the Z interval to an interval about ρ, you need to solve equation 3 for r, like this:
Plug the low Z into equation 4 to compute the lower limit on ρ, then plug in the high Z to compute the higher limit on ρ.
(The third block of the accompanying Excel workbook will do these calculations for you. You can also use MATH200B Program part 3 on your TI-83 or TI-84.)
A sample of 25 points shows a linear correlation coefficient of 0.84. What is the 95% confidence interval for the correlation coefficient in the population? (Again, to keep things simple I’m giving you the sample statistics instead of the raw data, and we’ll assume that the requirements are met. But in real life, always check the requirements before computing a confidence interval.)
The solution is a wild ride; hang on! (The Excel spreadsheet shows the intermediate steps.)
(a) From 1−α = 0.95. find α/2 = 0.025. Use a table, use TI-83/84/89 invNorm, or use Excel NORMSINV( ) to find that the 95% confidence interval is bounded by z = ±1.96.
(b) That critical z of ±1.96 bounds the confidence interval in the standard normal distribution with σ=1; for this one you must multiply by the standard deviation of the Fisher Z, which is 1/√(n−3). For n = 25 points that is σ = 1/√(25−3) = 0.213. Multiplying by the 1.96 from (a) gives E = 0.418. E is the error of the estimate, which is half the width of the confidence interval for Fisher’s transformed Z.
(c) Now use equation 3 and r = 0.84 to compute Z = 1.221. This is the Fisher Z for this particular sample. Using the result from (b), the confidence interval for the transformed Z is 1.221 ± 0.418, which is 0.803 to 1.639.
(d) Plug those Fisher-Z endpoints into equation 4. Z = 0.803 yields ρ = 0.666, and Z = 1.639 yields ρ = 0.927.
Conclusion: If 25 points have a linear correlation coefficient of 0.84, then you’re 95% confident that the population’s linear correlation coefficient is between 0.666 and 0.927.
Remark: The sample statistic 0.84 is not at the middle of the confidence interval, because the sample r values have a skewed distribution around the population correlation coefficient ρ.