Chapter 18: The Correlational Procedures

Introduction: In this chapter we are going to tackle about two kinds of relationship, positive relationship and negative relationship. Positive Relationship Let's say we have two values, votes and campaign spending for example, might have a positive relationship--higher levels of spending leads to a greater number of votes. Negative Relationship We might collect data about the amount of time people exercise on a daily basis and their blood pressure. In this case, we might find a negative relationship: when one value goes up (i.e., the amount of exercise) the other value goes down (i.e., blood pressure). In this chapter we're going to expand on that idea by learning how to numerically describe the degree of relationship by computing a correlation coefficient. We'll continue by showing how to use scatterplots and correlation coefficients together to investigate hypotheses. You Must Remember: We are going to INVESTIGATE our hypothesis, we are not going to TEST our hypothesis. Lesson 18.1: Pearson Product-Moment Correlation Coefficient (A Parametric Correlational Procedure) If we look at the relationship between two sets of data, one of three things can happen: (1) as the values in one dataset get larger, so can the values in the other dataset; (2) as the values in one dataset get larger, the values in the other dataset can get smaller; (3) as the values in one dataset change, there might be no apparent pattern in the change of values in the other dataset. Note: The three things mentioned above are called the correlation between two variables and best explain in Table 18.1. Table 18.1 Understanding Correlation

Example 18.1 We will investigate whether the amount of tuition reimbursement spent by a company in a given year's tome was correlated with the number of resignations during that same time period. As a reminder, the management of the company was concerned that more and more employees were taking advantage of that benefit, becoming better educated and moving to another company for a higher salary. Using the following data, the company's manager decided to investigate whether such a relationship really existed. In Table 18.2, each row represents a given year, the amount of tuition for that year, and the number of resignations for that same year. Table 18.2 Tuition and Resignations by Year Our Goal and Our Possible Results: As we just said, management is worried that the more money they spend, the greater the number of resignations. But based on the definition of correlation between two variables, this is only one of three possibilities. Moreover: (1) if they are right, as the amount of tuition reimbursement goes up, so will the number of resignations; (2) a second possibility is that, when the amount of tuition money goes up, the number of resignations goes down; and (3) the final option is that there will be no obvious pattern between the number of resignations and the amount of tuition money. Question: In this kind of situation, what will be the best statistical tool? In situation like this, we can investigate these relationships with any of several correlational statistics. The key in determining which correlation tool to use depends on the type of data we have collected. Question: In order for use to identify the correct correlational statistics to be used in our example, what type of data can we collect from our two values?

In this case, both resignations and tuition spent are quantitative (i.e., interval or ratio data). Because of that, we will use the Pearson product-moment correlation coefficient (most people call it Pearson's r). Pearson Product-Moment Correlation Coefficient (Pearson's r) Karl Pearson was interested in relationships of this type and developed a formula that allows us to calculate a correlation coefficient that numerically describes the relationship between two variables. Formula for Computing the Pearson's r: Modified Formula for Computing the Pearson's r: r = n(e) (A C) [nb (A) 2 ][nd (C) 2 ] where: A is the sum of all values in dataset x B is the sum of squares of all values in dataset x C is the sum of all values in dataset y D is the sum of squares of all values in dataset y E is the sum of the product of each value in dataset x and dataset y n is the number of pairs of data in the equation

Table 18.3 Values Needed for Computing the Pearson Correlation Coefficient Table 18.4 Values Needed for Computing the Pearson Correlation Coefficient x x 2 y y 2 xy 32 20 38 25 41 28 42 25 50 30 60 32 57 40 70 45 80 45 100 90 x = x 2 = y = y 2 = xy = (A) (B) (C) (D) (E) Task 18.1 Using either the modified or not formula, determine the computed value of Pearson's r. The computed value of Pearson's r is r = 0.926. Lesson 18.2: Interpreting Pearson's r The value of Pearson's r can be ranged from -1 to +1. When you're computing a correlation, three things can happen: (1) If both values go up, or both values go down, it is a positive correlation. The greater the positive correlation, the closer r is to +1.

(2) If one value go up and the other goes down, they are negatively correlated. The greater the negative correlation, the closer r moves to -1. (3) If r is close to zero, it is indicative of little or no correlation; there is not a clear relationship between the values. A Word of Caution Positive Correlation A positive correlation exists when the values of two variables both go up or both values go down. Negative Correlation Only when one variable goes up and the other goes down do we say there is a negative correlation. You Must Remember: Do not make the mistake of labeling a case in which both variables go down as a negative correlation. Table 18.5 Relationship between Positive and Negative Correlations Something to think about... In our preceding case, we have a very high r value of 0.926, but what goes that really mean? How high does the r value have to be for us to consider the relationship between the variables to be meaningful? The answer to this question is that the interpretation of the r value depends on what the researcher is looking for. Interpretations of Pearson's r (1) an r value of +0.90 or greater would be excellent for a positive correlation; (2) an r value of -0.90 or less than would be great for a negative correlation; (3) if we have a positive correlation (+0.90 or greater) or we have a negative correlation (-0.90 or lesser), in both instances we could clearly see that a relationship exists between the variables being considered. (4) we might consider an r value in the 0.60s and 0.70s (or -0.60 to -0.70) to be sufficient; it would just depend on the type of relationship we were looking for.

(5) if our correlation coefficient is smaller than this, especially anything between zero and +0.50 for a positive correlation or between -0.50 and zero for a negative correlation, it is apparent that a strong relationship apparently does not exist between the two variables. Task 18.2 What will be our general conclusion with our Pearson's r of.926? An Even More Important Word of Caution We just saw that a strong relationship existed between the amount of tuition money spent and the number of resignations. Many beginning statisticians make the mistake of thinking that a positive correlation means one thing caused the other to happen. Note: A correlation doesn't mean causation. While you might be able to make an inference, you cannot be absolutely sure of a cause-and-effect relationship; it might be due to chance. What do I mean by that? For example, did you know there is a large correlation between the number of churches in a city and the number of homicides in the same city? While you may be thinking "blasphemy," it's true--a large number of churches is highly correlated with a large number of homicides. Question: Think carefully about what I just said, though. Should people insist that churches be torn down in order to lower the homicides rate? Of course not! While the values are correlated, they certainly aren't causal. The real reason underlying the correlation is the size of the town or city. Larger cities have a greater number of both churches and homicides; smaller towns have a smaller number of each. They are related to each other but not in any causal way. Do you know? This is a perfect example of being a good consumer of statistics. You can see quite a few of these noncausal correlations being presented in various newspapers, magazines, and other sources; in a lot of cases, people are trying to use them to bolster their point or get their way about something. What's the lesson? Simple, always look at the big picture. Example 18.2 Let's use the same dataset but create Table 18.6 by changing the resignation number just a bit. Again, just by looking, we can see that an apparent relationship exists. In this case, it appears that the number of resignations goes down as the amount of tuition money spent goes up. Interpretation: Since our computed Pearson's r is -.419 and, it is between -.5 and zero, it is rather small negative correlation. While there is a correlation between the money spent and the resignations, it is not very strong.

Table 18.6 Tuition and Resignation Data Table 18.7 Negative Pearson Correlation Coefficients Note: We can see that the r value is negative. This indicates an inverse relationship: when one of the values goes up, the other value goes down. In this case as the amount of tuition money spent gets larger, there are fewer resignations. Remember, although a correlation exists here, there may or may NOT be a causal relationship; be careful how you interpret these values. Example 18.3 Now look at Table 18.7; do you see a pattern emerging? Does there seem to be a logical correlation between the two sets of values? Table 18.7 Tuition and Resignation

If you look closely, you'll see there does not appear to be. Sometimes the value in the left column will have a much greater value in the right column (e.g., 41 and 58), and sometimes the value in the right column will be much lower (e.g., 32 and 20). Table 18.8 Pearson Correlation Coefficients Showing a Weak Positive Correlation Question: Based on our Pearson's r, what will be our general conclusion? Interpretation: Because of the lack of pattern, Pearson's r is only.013; this supports our observation that there does not seem to be a meaningful correlation between the two sets of values. Remember, though, this might be purely coincidental; you cannot use these results to infer cause and effect. Lesson 18.3: Spearman Rank-Difference Correlation (A Nonparametric Correlational Procedure) Spearman Rank-Difference Correlation (Spearman's Rho) It stands to reason that, if we can compute a correlation coefficient for quantitative data, we should be able to do the same for nonparametric data. For beginning statisticians, the only nonparametric correlation we generally use is the Spearman Rank-Difference Correlation (most often called Spearman's rho). Given its name, it only works with rank-level (i.e., ordinal) data. Example 18.4 Let's imagine we're members of the board of directors of a company where several issues have arisen that seem to be affecting employee morale. We've decided that our first step is to ask both the management team and employees to rank these issues in terms of importance. We then want to determine if there is agreement between the two groups in terms of the importance of the issues labeled A to J. Formula for Computing the Spearman's rho:

where: r s is the symbol for Spearman's rho 1 and 6 are exactly that: constant value of 6 and 1 n is the number of objects we are going to rank d 2 is the sum of squares of the difference between the rankings of two values Table 18.9 Employee and Management Rank of Issues Table 18.9 Necessary Values for Spearman's Rho Issue Employee Rank Management Rank A 3 3 B 4 7 C 7 9 D 8 1 E 1 5 F 5 2 G 2 6 H 6 4 I 10 10 J 9 8 d d 2 d = d 2 = Task: Using the formula, determine the Spearman's rho value. Our Spearman's rho value is.345.

Interpretation of Spearman's rho Interpreting Spearman's rho is exactly the same as interpreting Pearson's r. The value of rho can range from -1 to +1, with values of rho approaching +1 indicating a strong agreement in rankings; values approaching -1 indicate an inverse agreement in rankings. Question: What does our Spearman's rho of.345 tell us? Interpretation: We have a rho value of.345 indicating there is very little agreement in the ranking of the issues between employees and management. Lesson 18.4: The p Value of a Correlation Table 18.10 Spearman's Rho Correlation Coefficient Showing a Moderate Positive Correlation Note: Pay attention to the p value shown in the table. Sometimes this confuses the beginning statistician because he wants to use it to reject or fail to reject the hypothesis stated in the second step of the six-step process; that's not what it is for. Remember, the research hypothesis stated in Step 2 for a correlation is simply looking for a meaningful relationship between the two variables under consideration; we use our subjective interpretation of the r value and scatterplot to help us make that decision. Do you know? The p value shown as a part of the output for a correlation tests the hypothesis, "the correlation coefficient in the population is significantly different from zero." By using this hypothesis, statisticians are able to make decisions about a population r value by using sample data. This is very, very rarely done by beginning statisticians, so we won't go into any further detail. Pearson s r: Six Steps Research Model (A Case Study) The Case of the Absent Student It seems we have talked a lot about students missing class, haven't we? One would guess I think going to class is important! Perhaps, if I think that way, then I would enjoy looking at the

correlation between the number of times students are absent during a semester and their score on the final exam. Step 1: Identify the Problem In this case, it's apparent that I'm concerned about the number of absences and their relationship to the final grade. Statement of the Problem: "Research has shown a relationship between a student's number of absences and class achievement. This study will investigate whether there is such a relationship between grades and the number of absences in my classes." Question: Before we move forward, look at my statement of the problem. Is this really a problem? No, it may not be; since the correlation measures relationships that aren't necessarily causal, we're simply trying to determine if a problem may exist. Step 2: State a Hypothesis Question: How should I look at the two variables I have? What relationship should I consider exists between the number of absences and the grade on the final exam? My hypothesis will be that a negative relationship exists between the number of absences and the grade on the final exam. In other words, the fewer days a student misses, the higher their final exam grade will be. Research Hypothesis: "There will be a negative correlation between the number of times a student is absent and their grade on the final exam." Question: What did you observe that is something different from our usual hypothesis? Right off the bat, you can see something different about this hypothesis in that it does not include the word "significant." Three Reasons Why We Do Not Include "Significant" in Our Hypothesis for Correlational Procedure (1) The correlational procedures are really descriptive statistics. When we learned to calculate the correlations earlier, we saw that they are used to tell us whether or not a linear relationship exists between two variables. (2) Despite the fact they are descriptive in nature, many people will state a hypothesis and then use the correlation to investigate rather than test it.

(3) Because we are not involved in testing a null hypothesis, we are going to subjectively evaluate our research hypothesis. Step 3: Identify the Independent Variable Note: Consider the fact that, for the first time, we are not looking at a scenario where we are trying to determine if differences exist between groups of data. Rather, we are trying to decide if a relationship exists between two sets of data. Here, we are trying to determine if it is true that absences from class are directly related to scores on the final exam. Note: When we look at relationships of this type, we are looking at the correlation between the two sets of values. Note: Since, in this kind of cases, we are not looking at cause and effect, we have to get away from the idea of having an independent variable and dependent variable. Since we are trying to determine if a relationship exists between two variables, instead of independent and dependent variables, we use the terms predictor variable and criterion variable. Predictor Variable It is what we are using to help us make a prediction. Criterion Variable It is what we are trying to predict. Question: So what is our predictor variable? In this case, our predictor variable is the number of absences for each student. Step 4: Identifying and Describing the Dependent Variable Note: Since we technically do not have a dependent variable, our criterion variable will be each student's grade on the final examination. Table 18.11 Absences and Score Data

Step 5: Choose the Right Statistical Test Since we have two sets of quantitative data and we want to determine if a relationship exists between the two, we will use Pearson's r to test our hypothesis. Step 6: Computation and Data Analysis to Investigate the Hypothesis Task 1: Determine the Pearson's r. Table 18.12 Values Needed for Computing the Pearson Correlation Coefficient x x 2 y y 2 xy 1 100 2 95 4 90 5 90 6 80 4 85 6 82 7 80 2 97 7 77 7 70 9 90 11 55 2 94 1 90 12 70 6 80 7 80 2 93 3 88 x = x 2 = y = y 2 = xy = (A) (B) (C) (D) (E) The Pearson's r is -.836. Task 2: Interpret the result. We can see we have a Pearson r value of -.836, which indicates a strong negative correlation; it appears that the number of absences is inversely related to a student's final exam score.

Task 3: Construct a scatterplot to investigate the correlation. Lesson 18.5: Scatterplots Scatterplot A scatterplot is used to show the relationship between paired values from two datasets. Steps in constructing a Scatterplot Step 1: Determine the variable for x-axis. Usually it is the criterion variable. Here, each student's grade on the final exam will be plotted on the x-axis. Step 2: Determine the variable for y-axis. Usually it is the predictor variable. Here, the number of absences of each student will be plotted on the y-axis. Step 3: Make an ordered pair (i. e., [x, y]) by pairing each predictor variable to its corresponding criterion variable. Afterwards, plot them all on the scatterplot. Step 4: Although it is not necessary, we can plot a line called "line of best fit" through the data points. Line of Best Fit It shows the trend of the relationship between the plotted variables. Types of Line of Best Fit Line of Best Fit for a Positive Relationship This plot represents a positive relationship; as one value goes up, the other goes up. Figure 18.1 Positive Relationship between Resignations and Tuition Paid Line of Best Fit for a Negative Relationship This plot represents a negative relationship; as one goes up, the other tends to go down. Figure 18.2 Negative Relationship between Resignations and Tuition Paid Line of Best Fit for Slight Relationship Slightly Positive Relationship When the line of best fit goes slightly down from right to left; this indicates we have we have a somewhat negative relationship. Slightly Negative Relationship When the line of best fit goes slightly down from left to right; this indicates we have a somewhat negative relationship. Figure 18.3 A Slightly Positive Relationship Between Resignations and Tuition Paid

Do you know? If the line of best fit was exactly 45 degrees, we would know that the relationship between the two sets of variables is perfect in that, if we knew one value, we could accurately predict the other. This does not occur in most instances, however; we will see lines that look nearly perfect, and, in other instances, the line will be nearly flat. Task 3: Construct a scatterplot to investigate the correlation and interpret the result. Figure 18.4 Scatterplot Showing the Negative Correlation Between Absences and Exam Scores Interpretation: We plot a line of best fit and it shows a nearly 45-degree angle upward to the left. When we see a line such as this, we immediately know a relationship we have plotted is negative; when one goes up, the other goes down. In this case, since the line is nearly at a 45-degree angle, we know the relationship is fairly strong. This verifies what we saw with Pearson's r. Note: Always keep in mind, however, that while both the computed value of r and the scatterplot indicate a strong relationship, don't be fooled. Correlations are just descriptive measures of the relationship; that doesn't necessarily mean that one caused the other to happen.

Spearman s rho: Six Steps Research Model (A Case Study) The Case of Different Tastes Recently I visited the Coca-Cola museum in Laguna. During the tour, I was amazed that the formula for Coke differs throughout the world; the museum had dispensers that allowed us to taste different varieties of their product from different countries. I tried some and I was astonished at how bas some of them tasted. "How," I asked, "could anyone enjoy drinking this? Wouldn't it be better if everyone drank the same brand of Coke that we enjoy in the Philippines?" Of course, my statistical mind went immediately to work. I thought, "If I could only get two groups of people, some from the Philippines and some from foreign countries, I could get then to rank samples from different parts of the world. I will bet, if everyone were exposed to OUR formula, they would agree that it is best!" Step 1: Identify the Problem Here, again, there's really not a problem from my perspective. I'm just interested in knowing if people from throughout the world rank the different varieties of Coke in the same way. It might, however, be in the best interests of Coca-Cola to investigate my idea. Statement of the Problem: "Developing, manufacturing, and distributing soft drinks throughout the world is a timely and expensive proposition. If there is no difference in the rankings of different products, it is possible that manufacturing and advertising costs could be lessened by focusing on a smaller number of products." Step 2: State a Hypothesis Research Hypothesis: "There will be no difference in the preference rankings for persons from the Philippines and persons from outside the Philippines." Step 3: Identify the Independent Variable In this case, our predictor variable is going to be a person's country of residence. Either he is from the Philippines or he is from a foreign country. Step 4: Identify and Describe the Dependent Variable Let's suppose I asked each group to rank their preferences for 10 different formulas of Coke. Our dependent variable is the rankings for each of the groups. Step 5: Choose the Right Statistical Test Because we are using ordinal data (i.e., the ranking from the two groups) and we are interested in computing a correlation between the two groups, we will be using the Spearman rho formula.

Step 6: Computation and Data Analysis to Test the Hypothesis Task 1: Determine the Spearman's rho value. Table 11.15 Necessary Values for Spearman's Rho Issue Philippines Foreign Countries A 5 10 B 3 5 C 1 6 D 1 3 E 6 7 F 7 9 G 9 1 H 10 2 I 4 8 J 8 4 d d 2 d = d 2 = The rho value is -.316. Task 2: Interpret the result. Our value of rho (i.e., -.316) indicates that a small negative correlation exists; people from foreign countries have a slight tendency to rank the sodas in exactly the opposite order from those in the Philippines. Task 3: Construct a scatterplot and interpret the graph. Interpretation: The line is going slightly down from left to right. If the same line was flat, it would indicate a great deal of disagreement in the rankings (e.g., one group ranked an item tenth and the other first, one ranked an item second, and the other ninth, etc.).

Figure 18.5 Scatterplot Showing a Negative Correlation in Rankings between the Philippines and Foreign Countries Let's Practice: Direction: Use the six-step research model to investigate the case below. The Case of Height versus Weight Let's look at one a case that's dear to many of us, the relationship between height and weight. We all know that, generally speaking, taller people weigh more than shorter people, but doesn't always hold true. In this case, we're interested in looking at something that is usually pretty obvious: tall people generally weigh more than short people. While this is not really a problem per se, it is the type of thing that correlations are used for a lot of times. Table 18. Height and Weight Data