Scatter Plot, Correlation, and Regression on TI-89
Copyright © 2007–2010 by Stan Brown, Oak Road Systems
Copyright © 2007–2010 by Stan Brown, Oak Road Systems
Summary: When you have a set of (x,y) data points and want to find the best equation to describe them, you are performing a regression. This page shows you how to determine the strength of the association between your two variables (correlation coefficient), and how to find the line of best fit (least squares regression line).
See also: a separate version of these instructions for the TI-83/84
Contents:
| Step 0. Setup |
| Step 1. Make the Scatter Plot |
| Step 2. Perform the Regression |
| Step 3. Display the Regression Line |
| Step 4. Display the Residuals |
| Set floating point mode, if you haven’t already. | [MODE] [▼] [▼] [►]
[ALPHA ÷ makes E] [ENTER]
|
The calculator will remember this setting when you turn it off: next time you can start with Step 1.
Before you even run a regression, you should first plot the points and see whether they seem to lie along a straight line. If the distribution is obviously not a straight line, don’t do a linear regression. (Some other form of regression might still be appropriate, but that is outside the scope of this course.)
Let's use this example from Sullivan, Michael, Fundamentals of Statistics (Pearson Prentice Hall, 2008), page 178: the distance a golf ball travels versus the speed with which the club head hit it.
| Club-head speed, mph (x) | 100 | 102 | 103 | 101 | 105 | 100 | 99 | 105 |
|---|---|---|---|---|---|---|---|---|
| Distance, yards (y) | 257 | 264 | 274 | 266 | 277 | 263 | 258 | 275 |
| Turn off other plots. | [◆] [APPS] and select
Stats/List Editor.
[ F2] [3] [F2] [4] turns off all plots and functions. |
| Enter the numbers in two statistics lists. | You will use two named lists for the x’s and
y’s. Any names are possible, but I’ll use
lx and ly because they’re short.
If those lists already exist, highlight the lx name
and press [CLEAR] [ENTER] to erase previous entries.
If lx isn’t there yet, move to an empty list
heading and press [L] [X]. (L is above the 4 key. When you
press 4 while naming a list, it will change to L automatically.)
Enter the x numbers, then clear list ly (or
create it) and enter the y numbers.
Note: You can hide an unwanted list by cursoring to the list name and pressing [ ◆ ← makes DEL]. The list
remains in memory until you use [2nd − makes VARLINK] to delete
it. |
Set up the scatter plot.
|
[F2] [1] [F1] opens a dialog box. You want these
settings:
ENTER] to complete the definition. |
| Plot the points. | [F5] automatically adjusts the window
frame to fit the data. |
(optional)You can adjust the grid to look better.
|
[◆ F2 makes WINDOW], set Xscl=1
and Yscl=5, then [◆ F3 makes GRAPH] to
redisplay it.
Appropriate values of Xscl and Yscl may
be different for other problems. Pick the values that make the
graph look best to you. |
Check your data entry by tracing the points.
|
[F3] shows you the first (x,y) pair, and then
[►] shows you the others. They’re shown
in the order you entered them, not necessarily from left to right.
|
If the data points don’t seem to follow a straight line reasonably well, STOP! Your calculator will obey you if you tell it to perform a linear regression, but if the points don’t actually fit a straight line then it’s a case of “garbage in, garbage out.”
For instance, consider this example from De Veaux, Velleman, and Bock, Intro Stats (Pearson Addison Wesley, 2009), page 179. This is a table of recommended f/stops for various shutter speeds for a digital camera:
| Shutter speed (x) | 1/1000 | 1/500 | 1/250 | 1/125 | 1/60 | 1/30 | 1/15 | 1/8 |
|---|---|---|---|---|---|---|---|---|
| f/stop (y) | 2.8 | 4 | 5.6 | 8 | 11 | 16 | 22 | 32 |
If you try plotting these numbers yourself, enter the shutter speeds
as fractions for accuracy: don’t convert them to decimals
yourself. The calculator will show you only a few decimal places, but
it maintains much greater precision internally.
You can see from the plot at right that these data don’t fit a straight line. There is a distinct bend near the left. When you have anything with a curve or bend, linear regression is wrong. You can try other forms of regression in your calculator’s menu, or you can transform the data as described in De Veaux, Velleman, and Bock, Intro Stats (Pearson Addison Wesley, 2009) Chapter 10 and other textbooks.
| Set up to calculate statistics. | [◆] [APPS] and select Stats/List
Editor. |
|
[F4] [3] [2] brings up the LinReg(ax+b) dialog box. You
want these settings:
Press [ ENTER] to perform the regression and paste
the regression equation into Y1. |
Write down a (slope), b (y intercept), r (correlation coefficient).
(See below for conventions on rounding.)
a = 3.1661, b = −55.8
R² = 0.88, r = 0.94
Look first at r, the coefficient of linear correlation. We usually round it to two decimal places unless it’s very close to ±1. As discussed in class, a positive correlation means that y tends to increase as x increases, and a negative correlation means that y tends to decrease as x increases.
For real-world data, 0.94 is a pretty strong correlation. But you might wonder whether there’s actually an association between club-head speed and distance traveled, as opposed to just an apparent correlation in this sample. The Web page Decision Points for Correlation Coefficient shows you how to answer that question.
Write the equation of the line using ŷ, not y, to indicate that this is a prediction. b is the y intercept, and we round it to one decimal place more than the data. a is the slope, and it’s harder to give a rule for rounding it, but generally four decimal places is a safe choice. You would write the equation of the line as
ŷ = 3.166x − 55.8
(Don’t write 3.166x + −55.8.)
These numbers can be interpreted pretty easily. Business majors will recognize them as intercept = fixed cost and slope = variable cost, but you can interpret them in non-business contexts just as well.
The slope, a, tells how much ŷ changes for a one-unit change in x. In this case, the ball travels about an extra 3.17 yards when the club speed is 1 mph greater. The sign of a is always the same as the sign of r.
The intercept, b, says where the regression line crosses the y axis: it’s the value of ŷ when x is 0. Be careful! That may not be meaningful. In this case, a club-head speed of zero is not meaningful. In general, when the measured x values don’t include 0 or don’t at least come pretty close to it, you can’t assign a real-world interpretation to the intercept.
The last number we look at (third on the screen) is R², the coefficient of determination. (The calculator displays r², but the capital letter is standard notation.) R² measures the quality of the regression line as a means of predicting ŷ from x: the closer R² is to 1, the better the line.
Another way to look at it is that R² measures how much of the total variation in y is predicted by the line. (I’ll have more to say about that under residuals, below.) In this case R² is about 88%, so 88% of the variation in y is associated with the variation in x.
Statisticians say that R² tells you how much of the variation in y is “explained” by variation in x, but if you use that word remember that it means a numerical association, not necessarily a cause-and-effect explanation.
Only linear regression will have a correlation coefficient r, but any type of regression will have a coefficient of determination R² that tells you how well the regression equation predicts y from the independent variable(s).
Show line with original data points.
|
[◆ F3 makes GRAPH] |
See also: Once you have the regression line, you can use the calculator to predict the y value for any x in the model.
See also: Do you wonder what sort of calculations the calculator does to find the best line? Least Squares, Down and Dirty explains what is meant by the “best” line and how to find it. Traditionally this is a calculus topic, but all that’s really necessary is some algebra.
“No regression analysis is complete without a display of the residuals to check that the linear model is reasonable.”
De Veaux, Velleman, and Bock, Intro Stats (Pearson Addison Wesley, 2009), page 227
The residuals are automatically calculated during the
regression, and stored in a resid list in your Stats/List Editor.
All you have to do is plot them on the y axis against your existing x
data.
Return to the editor; notice that a resid list
has appeared and contains the residuals. |
[◆] [APPS] and select Stats/List
Editor. |
| Turn off other plots. | [F2] [3] [F2] [4] |
Set up the plot of residuals against the x data.
![]() |
[F2] [1] [▼] [F1] selects Plot 2 and opens a dialog
box. You want these settings:
ENTER] to complete the definition. |
Display the plot.
|
[F5] displays the plot. |
You want the plot of residuals versus x to be “the most boring scatterplot you’ve ever seen”, in De Veaux’s words (page 203). “It shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. If you see any of these features, find out what the regression model missed.”
Don’t worry about the magnitude of the residuals,
because [ZOOM] [9] adjusts the vertical scale so that the points
take up the full screen.
If the residuals are more or less evenly distributed above and below the axis and show no particular trend, you were probably right to choose linear regression. But if there is a trend, you have probably forced a linear regression on non-linear data. If your data points looked like they fit a straight line but the residuals show a trend, it probably means that you took data along a small part of a curve.
Here there is no bend and there are no outliers. The scatter is pretty consistent from left to right, so you conclude that distance traveled versus club-head speed really does fit the straight-line model.
Refer back to the scatter plot of f/stop
against shutter speed.
I said then that it was not a straight
line, so you could not do a linear regression. If you missed the bend
in the scatterplot and did a regression anyway, you’d get a correlation
coefficient of r = 0.98, which would encourage you to rely on the
bad regression. But plotting the residuals (at right) makes it
crystal clear that linear regression is the wrong type for this data
set.
This is a textbook case (which is why it was in a textbook): there’s a clear curve with a bend, variation on both sides of the x axis is not consistent, and there’s even a likely outlier.
I said in Step 2 that the coefficient of determination measures the variation in the measured y associated with the measured x. Now that we have the residuals, we can make that statement more precise and perhaps a little easier to understand.
The set of measured y values has a spread, which can be measured by the standard deviation or the variance. It turns out to be useful to consider the variation in y’s as their variance. (You remember that the variance is the square of the standard deviation.)
The total variance of the measured y’s has two components: the so-called “explained” variation, which is the variation along the regression line, and the “unexplained” variation, which is the variation away from the regression line. The “explained” variation is simply the variance of the ŷ’s, computing ŷ for every x, and the “unexplained” variation is the variance of the residuals. Those two must add up to the total variance of the measured y’s, which means that if we express them as percentages of the variation in y then the percentages must add to 100%. So R² is the percent of “explained” variation in the regression, and 100%−R² is the percent of “unexplained” variation.
and
Now we can restate what we learned in Step 2. R² is 88% because 88% of the variance in y is associated with the regression line, and the other 12% must therefore be the variance in the residuals. This isn’t hard to verify: do a 1-VarStats on the list of measured y’s and square the standard deviation to get the total variance in y, s²y = 59.93. Then do 1-VarStats on the residuals list and square the standard deviation to get the “unexplained” variance, s²e = 7.12. The ratio of those is 7.12/59.93 = 0.12, which is 1−R². Expressing it as a percentage gives 100%−R² = 12% so 12% of the variation in measured y’s is “unexplained” (due to lurking variables, measurement error, etc.).
home page | problems with viewing?
This page is used in instruction at Tompkins Cortland Community College in Dryden, New York; it’s not an official statement of the College. Please visit www.tc3.edu/instruct/sbrown/ to report errors or ask to copy it.
For updates and new info, go to http://www.tc3.edu/instruct/sbrown/ti83/