STAB22 section 2.2. Figure 1: Plot of deforestation vs. price

STAB22 section 2.2 2.29 A change in price leads to a change in amount of deforestation, so price is explanatory and deforestation the response. There are no difficulties in producing a plot; mine is in Figure 1. It seems to be a pretty clear upward trend (positive association), with a higher price associated with more deforestation. So we d expect the correlation to be positive. Figure 1: Plot of deforestation vs. price We can also use Minitab to get the mean and SD for each variable. These are shown in Figure 2. Descriptive Statistics Variable N Mean Median TrMean StDev SE Mean price 5 50.00 54.00 50.00 16.32 7.30 deforest 5 1.738 1.690 1.738 0.928 0.415 Variable Minimum Maximum Q1 Q3 price 29.00 72.00 34.50 63.50 deforest 0.490 3.100 1.040 2.460 Figure 2: Descriptives for price and deforestation To calculate the correlation, standardize each data value, using the correct mean and SD. For example, for the first price, 29, the standardized value is (29 50)/16.32 = 1.29; the first deforestation value, 0.49, standardizes to (0.49 1.738)/0.928 = 1.35. Continuing in this way gives: z x z y z x z y -1.29-1.35 1.73-0.61-0.15 0.10 0.25-0.05-0.01 0.31 0.09 0.03 1.35 1.47 1.98 As you see, multiply the two standardized values together to get the last column. You can see that most of the pairs of standarized values have the same sign, so the number in the last column is positive. (The x-values and y-values tend to be above or below the mean together.) Finally, add up the numbers in the last column and divide by n 1 to get r = 3.82/4 = 0.955. We saw in the scatterplot that the relationship was upward and strong, so we d expect to get a correlation that is positive and clsoe to 1, as we did. Minitab should produce about the same value (to within rounding error) and does. Select Stat, Basic Statistics and Correlation. Select the two columns (order doesn t matter) and click OK. See Figure 3. Correlations (Pearson) Correlation of price and deforest = 0.955, P-Value = 0.011 Figure 3: Correlation of price and deforestation You can see that calculating the correlation by hand is a slow process, so don t expect to be asked to do it on an exam. 2.30 Here and in the next exercise, there s no value in calculating the correlation by hand, so use Minitab to do it: Stat, Basic Statistics, 1

Correlation. Select the two variables (doesn t matter which one is first), to get a correlation of -0.221. This is a weak correlation, and we saw that the relationship in Exercise 2.6 was weak: it was hard to predict final exam score from the first test score. Indeed, the correlation here is negative, which even suggests that students who do better on the first test do worse on the final exam! 2.31 Same deal as in the previous exercise: read the data into Minitab, and find the correlation. Here it is 0.519, which is positive and stronger than the previous one (if not itself very strong). This means that higher final exam scores go with higher second test scores, though the correspondence is not near perfect. This matches the message we got from the scatterplot in 2.7. 2.34 This is a moderately strong correlation. A correlation of 0.9 would show the points being much closer to the line, while a correlation of 0.1 would show much less pattern. So the correlation must be closest to 0.6. The actual correlation (from Minitab) is 0.6821. 2.35 This is again closest to 0.6, for the same reasons as in 2.34. (The correlation looks weaker to me, but it is often surprising how strong a correlation can actually be). 2.37 The correlation would not change; correlation is a number without units, and it doesn t change if the units of the variables change. 2.38 If you didn t do 2.28, draw a scatterplot here. Mine is in Figure 4. It makes more sense to think of 2003 as the response and 2002 as the explanatory variable. The correlation for all 23 funds is 0.623. (This was found using Minitab: select Stat, Basic Statistics, Correlation, and then selecting the two variables in either order.) The Gold Fund is in row 20 of the data set; click on the row label 20, right-click on the selected cells, then select Delete Cells. Figure 4: Scatterplot of sector funds The original row 20 disappears. Now you can re-calculate the correlation: it is now 0.872. Taking out just one pair of returns has changed the correlation quite dramatically, so the correlation is not resistant. (This is no surprise, given that the correlation is calculated using means and SDs, and they are not resistant.) The Gold Fund is the point on the far right of the scatter plot: an outlier because it does not lie on the general trend. Taking it out should make the relationship stronger and thus make the correlation closer to 1. 2.40 The correlation for the day length data is 0.280, and for the icicle lengths is 0.876. (I found these both by reading the data sets into Minitab and getting it to find the correlations.) This question illustrates that you can t really ask the question how high a correlation is high because the answer is always going to be it depends. What it depends on is the field of study, and the kind of study: a lab experiment will usually give a stronger correlation than field data, and so an interesting correlation will be different in the two cases. 2

2.42 The data for this one don t appear to be on the disk, so if you want to use Minitab, you ll have to type the data in yourself. I used Minitab for this exercise; my plot is in Figure 5. Figure 6: Fuel used against speed Figure 5: Scatterplot of Ex. 2.30 data The correlation (calculated by hand, by calculator, or by Minitab) is 0.481, which is about 0.5, and not especially high. On the scatterplot, you see a nice trend through five of the points, but an outlier at the bottom right (the point with x = 10 and y = 1). This is bringing the correlation down. If you take out this one point, the correlation shoots up to 0.993. So this shows that the correlation (because it s based on means and SDs) can be badly affected by outliers. 2.44 See Figure 6 for the scatterplot. The correlation is 0.172, close to zero; this is because the trend shown on the scatterplot is a curve rather than a straight line. Since there isn t a straight-line relationship, there isn t a very big correlation. You can often guess what the correlation will be for a curved relationship. This one goes down sharply and then up moderately. Since the overall trend is more down than up, the correlation comes out negative. But it doesn t reflect the fact that if you used a curve to predict fuel consumption from speed, you would be able to get accurate answers. 2.45 My plot is shown in Figure 7. (I thought of highway gas mileage as the response; you could do it the other way around.) The Insight is the car far at the top right. It seems to fit the linear pattern made by the other cars. Using Minitab, the correlation using all the cars is 0.990. Taking out the row containing the Insight and re-calculating the correlation gives a correlation of 0.973. Both correlations are high, but the Insight increases the strength of the relationship by virtue of being so far away. 2.47 Invent some plausible ages for women getting married, say 20, 22, 24, 26, 28. The men they marry have ages 22, 24, 26, 28, 30. You can draw a scatterplot by hand, or enter the numbers into a Minitab worksheet and draw the scatterplot that way. My plot is in Figure 8; yours will be different according to the ages you chose. 3

The points form a perfect straight line (with equation y = x+2), so the correlation should be 1. You can check this by calculation if you want. Figure 7: Scatterplot of city and highway gas mileages 2.50 The correlation is a number without units calculated from two quantitative variables. Knowing this helps you see the blunders. (a) Gender is categorical, not quantitative; (b) 1.09 is a very high correlation; the blunder is that the correlation must be between -1 and 1. (c) the correlation doesn t have units (such as bushels). You could calculate a correlation in (a) by turning the categorical variable into numbers, say 1 for males and 0 for females. The result is called a point-biserial correlation coefficient. In the case of (a), if it came out positive it would mean that males tend to have higher incomes (because males were given the value 1). 2.51 Examine the relationships means draw some scatterplots, and if you think it is reasonable, calculate some correlations. For GPA and IQ, the scatterplot is in Figure 9. The relationship, on the whole, is moderately linear, but with a few outliers at the bottom end (the two students with the lowest GPAs, and maybe the student with the lowest IQ also). The correlation is 0.634. For GPA and self-concept, the plot is in Figure 10. There seems to be an upward trend, but a weaker one than for GPA and IQ (the correlation here is 0.542). The student with the lowest GPA again appears to be an outlier. In both cases, the relationship looks roughly linear, and so calculating a correlation is sensible. Taking out the outliers would make the correlations (a little) bigger, but not very much bigger as there is so much data. Figure 8: Plot of husband s and wife s ages 4

Figure 9: Scatterplot of GPA and IQ Figure 10: Scatterplot of GPA vs. self-concept 5