3.1 Measures of Central Tendency

Size: px

Start display at page:

Download "3.1 Measures of Central Tendency"

Bonnie Cornelia Franklin
5 years ago
Views:

2 3.1 Measures of Central Tendency n Summation Notation x i or x Sum observation on the variable that appears to the right of the summation symbol. Example 1 Suppose the variable x i is used to represent the number of bathroom in a given residence. Five residential properties are examined, and the value of x i is recorded for each. The observations are x 1 = 2, x 2 = 1, x 3 = 3, x 4 = 2, and x 5 = 3. (a) Find 5 x i and 5 x 2 i. (b) Find 5 (x i 3) and 5 (x i 3) 2. Solution: (a) The symbol 5 x i tells us to sum the x i values in the data set. Therefore, 5 x i = x 1 + x 2 + x 3 + x 4 + x 5 = = 11 (b) The symbol 5 x 2 i = x x x x x 2 5 = = (x i 3) tells us to subtract 3 from each x i value and then sum Therefore, 5 (x i 3) = (x 1 3) + (x 2 3) + (x 3 3) + (x 4 3) + (x 5 3) = (2 3) + (1 3) + (3 3) + (2 3) + (3 3) = ( 1) + ( 2) ( 1) + 0 = 4 5 (x i 3) 2 = (x 1 3) 2 + (x 2 3) 2 + (x 3 3) 2 + (x 4 3) 2 + (x 5 3) 2 = (2 3) 2 + (1 3) 2 + (3 3) 2 + (2 3) 2 + (3 3) 2 = ( 1) 2 + ( 2) 2 + (0) 2 + ( 1) 2 + (0) 2 = 6. 2

3 Example 2 Find xy if the variables x i,y i are given by i x i y i Solution: xy = xi y i = x 1 y 1 + x 2 y 2 + x 3 y 3 + x 4 y 4 + x 5 y 5 = = Mean Definition 1 The mean of a set of measurements is defined to be the sum of the measurements divided by the total number of measurements. Mean for ungrouped data The mean for ungrouped data is obtained by dividing the sum of all values by the number of values in the data set. Thus, (a) Mean for population data µ = x 1 + x 2 + x x N N = N x i N (b) Mean for sample data x = x 1 + x 2 + x x n n = n x i n where x i is the sum of all values, N is the population size, n is the sample size, µ is the population mean, x is the sample mean. Example 3 The monthly starting salaries for a sample of 12 Business School Graduates are as follows: Monthly Monthly Monthly Monthly Graduate Salary Graduate Salary Graduate Salary Graduate Salary

4 Solution: The sample mean is x = x 1 + x 2 + x x n n = = 25, = Mean for grouped data (a) Mean for population data µ = fi m i N (b) Mean for sample data x = fi m i n where f i denote the frequency of class i and m i denote the midpoint of class i. Example 4 Recall the frequency distribution in Example 2 (Chapter 2). Find the mean. Audit Time Frequency Class Midpoint (Days) f i m i f i m i fi = 20 fi m i = 380 Hence, the sample mean is Weighted Mean x = fi m i fi = = 19. Let X 1,X 2,...,X N be a set of N values, and let w 1,w 2,...,w N be the weight assigned to them. The weighted mean is found by dividing the sum of the products of the values and their weights by the sum of the weights. µ Weighted = w 1X 1 + w 2 X w N X N w 1 + w w N = wi X i wi 4

5 Example 5 It is decided that six observations, 12, 20, 17, 5, 9 and 22, should be given the weights 10, 4, 6, 18, 16 and 3 respectively. What is the weighted mean? Solution: The weighted mean is calculated as wi X i (10 12) + (4 20) + (6 17) + (18 5) + (16 9) + (3 22) = wi = = 10.6 Example 6 A mathematics class is divided into two sections, both of which are given the same test. Section 1 (41 students) has a mean score of 62; and section 2 (52 students) has a mean score of 68. Find the mean of the whole class correct to two decimal places. Solution: Since w 1 = 41,X 1 = 62,w 2 = 52,X 2 = 68, we find x = w 1X 1 + w 2 X 2 w 1 + w 2 = (41) (62) + (52) (68) = = Example 7 The examination results of an AD student are listed as follows: Subject Credit Grade Grade Point I 3 B+ 3.5 II 3 B 3 III 1 A+ 4.5 IV 4 D+ 1.5 V 5 D 1 Hence, the GPA of the student is calculated as follows GPA = = Geometric Mean The Geometric mean of a set of values is defined as the nth-root of the product of the n values. The geometric mean of X 1,X 2,...,X n is given by X G = n X 1 X 2 X n = (X 1 X 2 X n ) 1/n 5

6 Example 8 Find the geometric mean for the following set of data 8, 18, 24, 36, 64. Solution: By the formula, the geometric mean is = = 24 Application Geometric Mean Rate of Return R G = [(1 + R 1 ) (1 + R 2 ) (1 + R n )] 1/n 1 where R i is the rate of return in time period i. Example 9 The price of a stock has risen by 6%, 13%, 11% and 15% in each of 4 successive years, find the average percentage risen in the price of the stock. Solution: The geometric mean is given by [( ) ( ) ( ) ( )] 1/4 1 = (1.5290) 1/4 1 = The average rise = = 11.2%. This value of 11.2% can be translated as the constant increase necessary each year to produce the final year price, given the starting price Median Definition 2 The median of a set measurements is defined to be the middle value when the measurements are arranged from lowest to highest. Median for ungrouped data If there is an odd number of items, the median is the value of the middle item when all items are arranged in ascending order. If there is an even number of items, the median is the average value of the two middle items when all items are arranged in ascending order. 6

7 Example 10 Find the median of the given data: Solution: Arrange the data in ascending order }{{} 46 Median Hence, the median is the 3rd term 46. Example 11 Compute the median for Example 3. Solution: Arranging the 12 items in ascending order } 2090{{ 2120} Median= = Median for grouped data For grouped data, the median can be found by first identify the class containing the median, then apply the following formula: ( n/2 C median = l 1 + f m ) (l 2 l 1 ) where: l 1 is the lower class boundary of the median class; n C f m l 2 is the total frequency; is the cumulative frequency just before the median class; is the frequency of the median; is the upper class boundary containing the median. It is obvious that the median is affected by the total number of data but is independent of extreme values. However if the data is ungrouped and numerous, finding the median is tedious. Note that median may be applied in qualitative data if they can be ranked Mode Definition 3 The mode of a set of measurements is defined to be the measurements that occurs most often(with highest frequency). The mode may not exist, and even if it does exist it may not be unique. A distribution having only one mode is called unimodal. 7

8 Example 12 Refer to Example 5 (Chapter 2). Find the mode. Automobile Purchase frequency Chevrolet Cavalier 9 Ford Escort 14 Toyota Echo 8 Honda Accord 11 Hyundai Excel 8 Total 50 The mode is the Ford Escort. For grouped data, the mode can be found by first identify the largest frequency of that class, called modal class, then apply the following formula on the modal class: ( ) fa mode = l 1 + (l 2 l 1 ) f a + f b where: l 1 f a is the lower class boundary of the modal class; is the difference of the frequencies of the modal class with the previous class and is always positive; f b is the difference of the frequencies of the modal class with the following class and is always positive; l 2 is the upper class boundary of the modal class Characteristics of each measure of Central Tendency Mode 1. It is the most frequent or probable measurement in the data set. 2. There can be more than one mode for a data set. 3. It is not influenced by the extreme measurements. 4. Modes of subsets cannot be combined to determine the mode of the complete data set. 5. For group data, its value can change depending on the class used. 6. It is applicable for both qualitative and quantitative data. 8

9 Median 1. It is the central value 50% of the measurements lie above it and 50% fall below it. 2. There is only one median for a data set. 3. Medians of subsets cannot be combined to determine the median of the complete data set. 4. For grouped data, its value is rather stable even when the data are organized into different class. 5. It is applicable to quantitative data only. Mean 1. It is the average of the measurements in a data set. 2. There is only one mean for a data set. 3. Its value is influenced by extreme measurements, trimming can help to reduce the degree of influence. 4. Means of subsets can be combined to determine the mean of the complete data set. 5. It is applicable to quantitative data only Percentiles Definition 4 The pth percentile is a value such that at least p percent of the items take on this value or less and at least (100 p) percent of the items take on this value or more. Calculating the pth Percentile Step 1. Arrange the data in ascending order (rank order from smallest value to largest value). Step 2. Compute an index i as follows: i = ( p ) n 100 where p is the percentile of interest and n is the number of items. 9

10 Step 3. (a) If i is not an integer, round up. The next integer value greater than i denotes the position of the pth percentile. (b) If i is an integer, the pth percentile is the average of the data values in positions i and i + 1. Example 13 Determine the 85th percentile for Example 3. Solution: Step 1. Arrange the 12 data values in ascending order. Step 2. i = ( p ) ( ) 85 n = 12 = Step 3. Since i is not an integer, round up. The position of the 85th percentile is the next integer greater than 10.2, the 11th position Quartiles Definition 5 The 25th, 50th, and 75th percentiles of the data referred to as the first quartile, the second quartile, and third quartile, respectively. The quartiles can be used to divide the data into four parts, with each part containing approximately 25% of the data. Q 1 Q 2 Q 3 = first quartile = second quartile (Median) = third quartile Example 14 Find the quartiles for the given set of data. 6.1, 2.8, 1.2, 0.7, 4.3, 5.5, 5.9, 6.5, 7.6, 8.3, 9.6, 9.8, 12.9, 13.1, 18.5 Solution: For Q 1, i = n = 1 Round up 15 = 3.75 = 4 4 Hence, Q 1 = Observation at 4th position = 0.7. For Q 3, Hence, Q 3 = Observation at 12th position = 9.8. i = n = 3 Round up 15 = = 12 4 Example 15 Find the Q 1 and Q 3 for the Example 3. The ordered data are given as follows:

11 Solution: For Q 1, i = n = 1 12 = 3 integer 4 Hence, Q 1 = average of observation at 3rd position and 4th position = = For Q 3, i = n = 3 12 = 9 integer 4 Hence, Q 3 = average of observation at 9th position and 10th position = = Measures of Variation Range Definition 6 The range is the difference between the largest and smallest data values Interquartile Range Definition 7 The interquartile Range (IQR) is the difference between the third and first quartiles in a set of data. IQR = Q 3 Q 1 Example 16 Refer to Example 15. The range = = 615 and the IQR = = Variance and Standard Deviation Definition 8 The variance is the average of the squared differences between each of the observations in a set of data and the mean. The standard deviation is the positive square root of the variance. For ungrouped data 11

12 Variance N (a) Population σ 2 = (x i µ) 2 N Standard Deviation N (x i µ) 2 σ = σ 2 = N (b) Sample s 2 = n (x i x) 2 n 1 s = s 2 = n (x i x) 2 n 1 For grouped data (a) Population σ 2 = Variance Standard Deviation fi (m i µ) 2 σ = fi (m σ 2 i µ) = 2 fi fi (b) Sample s 2 = fi (m i x) 2 fi 1 s = s 2 = fi (m i x) 2 fi 1 Alternative Formula: (a) Ungrouped data: σ 2 = N x2 i N µ 2 N and s 2 = n x2 i n x 2 n 1 (b) Grouped data σ 2 = N f im 2 i N µ 2 N and s 2 = n f im 2 i n x 2 n 1 Example 17 Find the variance and standard deviation for the data given in Example 3. Solution: 12

13 Sample Deviation Squared Deviation Monthly Mean About the Mean About the Mean Salary (x i ) ( x) (x i x) (x i x) (xi x) = 0 (xi x) 2 = Thus, the variance is s 2 = 12 (x i x) 2 n 1 = and the standard deviation is s = s 2 = = = Example 18 Redo the last example by using the alternative formula. Solution: x i = = x 2 i = (2050) 2 + (2150) (2080) 2 = x = s 2 = 12 x i n = x2 i n ( x) 2 n 1 = 2140 = s = = (2140) =

14 Monthly Salary (x i ) x 2 i = = = = xi = x 2 i = Example 19 Find the variance and standard deviation for the data given in Example 4. Solution: Recall that the mean is x = 19. See Example 4. Audit Class Squared Time Frequency Midpoint Deviation Deviation (Days) f i m i (m i x) (m i x) 2 f i (m i x) f i (m i x) 2 The variance is and the standard deviation is s 2 = fi (m i x) 2 fi 1 = = 30 s = s 2 = 30 = Example 20 Redo the last example by using the alternative formula. Solution: Recall that the mean is x = 19. See Example 4. 14

15 Audit Class Time Frequency Midpoint (Days) f i m i f i m 2 i f i m 2 i The variance is s 2 = and the standard deviation is fi m 2 i n( x) 2 fi 1 = 7790 (20) (192 ) 20 1 s = s 2 = 30 = = = Coefficient of Variation The Coefficient of Variation is defined as follows CV = Standard Deviation Mean 100% When to use CV. 1. The data are in different units. 2. The data are in the same units, but the means are apart. Example 21 A study of the test scores for an in-plant course in management principles and the years of service of the employees enrolled in the course resulted in these statistics: The mean score was 200; the standard deviation was 40. The mean number of years of service was 20 years; the standard deviation was 2 years. Compare the relative dispersion in the two distributions using the coefficient of variation. 15

16 Solution: The distributions are in different units (test scores and years of service). Therefore, they are converted to coefficients of variation. For the test scores: For years of service: CV = (100) CV s x = (100) = (100) s x 20 (100) = 20 percent = 10 percent Interpreting, there is more dispersion relative to the mean in the distribution of test scores compared with the distribution of years of service (because 20 > 10 percent). The same procedure is used when the data are in the same units but the means are far apart. (See the following example.) Example 22 The variation in the annual incomes of executives is to be compared with the variation in incomes of unskilled employees. For a sample of executives, x = $500, 000 and s = $50, 000. For a sample of unskilled employees, x = $22, 000, and s = $2, 200. We are tempted to say that there is more dispersion in the annual incomes of the executives because $50, 000 > $2, 200. The means are so far apart, however, that we need to convert the statistics to coefficients of variation to make a meaningful comparison of the variation in annual incomes. Solution: For the executives: For the unskilled employees: CV = (100) CV = (100) s x s x $50, 000 $2, 200 = (100) = $500, 000 $22, 000 (100) = 10 percent = 10 percent There is no difference in the relative dispersion of the two groups. 3.4 Shape A important property of a set of data is its shape the manner in which the data are distribution. Either the distribution of the data is symmetrical or it not. If the distribution of data is not symmetrical, it is called asymmetrical or skewed. Mean > Median: Mean = Median: Mean < Median: positive or right-skewness symmetry or zero-skewness negative or left-skewness 16

17 The skewness is an abstract quantity which shows how data piled-up. A number of measures have been suggested to determine the skewness of a given distribution. One of the simplest one is known as Pearson s measure of skewness: Skewness = Mean - Mode Standard Deviation 3(Mean - Median) Standard Deviation Positive skewness arises when the mean is increased by some unusually high values; negative skewness occurs when the mean is reduced by some extremely low values. Data are symmetrical when there is no really extreme values in a particular direction so that low and high values balance each other out. For data sets that are extremely skewed, be wary of using the mean as a measure of the center of distribution. In this situation, a more meaningful measure of central tendency may be the median, which is more resistant to the influence of extreme measurements. 3.5 Box-and-Whisker Plot (Box Plot) A plot that shows the center, spread, and skewness of a data set. It is constructed by drawing a box and two whiskers that use the median, the first quartile, the third quartile, and the smallest and the largest values in the data set. 17

18 Example 23 The following data give the incomes (in thousands of dollars) for a sample of 13 households Construct a box-and-whisker plot for these data. Solution: The following five steps are performed to construct a box-and-whisker plot. Step 1 First, rank the data in increasing order and calculate the values of the median, the first quartile, the third quartile, and the interquartile range. The ranked data are Step 2 Determine the median, the quartiles, the smallest and the largest values in the given data set. These five values for our example are as follows. Median i = 13 2 First quartile Q 1 i = 13 4 = 6.5 = 7 th ordered data = 32 = 3.25 = 4th ordered data = 23 Third quartile Q 3 i = 3(13) 4 = 9.75 = 10th ordered data = 46 Smallest value = 17 Largest value = 92 Step 3 Draw a horizontal line and mark the income levels on it such that all the values in the given data set are covered. Above the horizontal line, draw a box with its left side at the position of the first quartile and the right side at the position of the third quartile. Inside the box, draw a vertical line at the position of the median. Step 4 By drawing two lines, join the points of the smallest and the largest values to the box. These values are 17 and 60 in this example as listed in Step 2. The two lines that join the box to these two values are called whiskers. 3.6 Uses of Standard Deviation The Empirical Rule For data having a bell-shaped distribution, 18

19 Approximately 68% of the items will be within one standard deviation of the mean. Approximately 95% of the items will be within two standard deviation of the mean. Almost all (99.7%) of the items will be within three standard deviation of the mean. Example 24 The age distribution of a sample of 5000 persons is bell-shaped with a mean of 40 years and a standard of 12 years. Determine the approximate percentage of people who are 16 to 64 years old. Solution: We will use the empirical rule to find the required percentage because the distribution of ages follows a bell-shaped curve from the given information, for the this distribution, x = 40 years and s = 12 years 19

Each of the two points, 16 and 64, is 24 units away from the mean. Dividing 24 by 12, we convert the distance between each of the two points and the mean in terms of standard deviation.

20 Each of the two points, 16 and 64, is 24 units away from the mean. Dividing 24 by 12, we convert the distance between each of the two points and the mean in terms of standard deviation. Thus, the distance between 16 and 40 and between 40 and 64 is each equal to 2s. Because the are within two standard deviations of the mean is approximately 95% for a bell-shaped, approximately 95% of the people in the sample are 16 to 64 years old Chebyshev s Theorem At least (1 1/k 2 ) of the items in any data set must be within k standard deviation of the mean, where k is any value greater than 1. Example 25 For a statistics class, the mean for the midterm scores is 75 and the standard deviation is 8. Using Chebyshev s theorem, find the percentage of students who scored between 59 and 91. Solution: Let µ and σ be the mean and the standard deviation, respectively, of the midterm scores. The from the given information, µ = 75 and σ = 8 To find the percentage of students who scored between 59 and 91, the first step is to determine k. Each of the two points, 59 and 91, is 16 units away from the mean. 20

21 The value of k is obtained by dividing the distance between the mean and each point by the standard deviation. Thus, k = 16/8 = k 2 = 1 1 (2) 2 = 3 4 = 0.75 Hence, according to Chebyshev s theorem, at least 75% of the students scored between 59 and z-score A z-score measures how many standard deviations an observation is above or below the mean. z i = x i x s where z i = the z-score for item i, x = the sample mean s = the sample standard deviation. Example 26 Different typing skills are required for secretaries depending on whether one is working in a law office, an accounting firm, or for a research mathematical group at a major university. In order to evaluate candidates for these positions, an employment agency administers three distinct standardized typing samples. A time penalty has been incorporated into the scoring of each sample based on the number of typing errors. The mean and standard deviation for each test, together with the score achieved by a recent applicant, are given as follows. Sample Applicants s score Mean Standard deviation Law 141 sec 180 sec 30 sec Accounting 7 min 10 min 2 min Scientific 33 min 26 min 5 min For what type of position does this applicant seem to be best suited? Solution: First we compute z-score for each sample Law : z = = Accounting : z = 7 10 = Scientific : z = =

22 Since speed is of primary importance, we looking for the z-score that represents the greatest number of standards to the left of the mean and in our case that would be 1.5. Therefore, this applicant ranks higher among typists in accounting firms than when compared to typists in the other two areas, and consequently should be placed with an accounting firm. 22

23 3.7 Use Scientific Calculator to find the mean and standard deviation Example 27 Use the calculator to find the mean and standard deviation of the data set: 1, 2, 5, 6, 8, 9, 10, 12, 14, 18. Steps Function Keys Descriptions 1 MODE 3 Change to the statistical mode 2 SHIFT AC Clear all old data 3 1 M+ / RUN Input the first data 4 2 M+ / RUN Input the second data 5. Continue to input the data M+ / RUN Input the final data 7 SHIFT 1 µ or x, population mean or sample mean 8 SHIFT 2 σ population standard deviation 9 SHIFT 3 s sample standard deviation 10 Kout 1 x 2, sum of squares of all data 11 Kout 2 x, sum of all data 12 Kout 3 n population size or sample size 23

24 Kout 1 = x 2 = 975 Kout 2 = x = 85 Kout 3 = n = 10 SHIFT 1 = x = 8.5 SHIFT 2 = xσ n = SHIFT 3 = xσ n 1 = Example 28 Suppose that we want to find the mean and standard deviation of the following data: Audit Time Frequency Class Midpoint (Days) f i m i MODE 3 SHIFT AC M M M M M+ / RUN / RUN / RUN / RUN / RUN Kout 1 = x 2 = 7790 Kout 2 = x = 380 Kout 3 = n = 20 SHIFT 1 = x = 19 24

25 SHIFT 2 = xσ n = SHIFT 3 = xσ n 1 = Remarks: If your calculator has no MODE 3 function, then you may use the MODE 2 function. But all the x values will be moved to the y values. Eg. x 2 = y 2 = Kout 4. 25

Numerical Descriptions of Data

Numerical Descriptions of Data Measures of Center Mean x = x i n Excel: = average ( ) Weighted mean x = (x i w i ) w i x = data values x i = i th data value w i = weight of the i th data value Median =