Chapter 3 Section3-: Measures of Center Notation Suppose we are making a series of observations, n of them, to be exact. Then we write x 1, x, x 3,K, x n as the values we observe. Thus n is the total number of data points, and x 4 (say) is the value of the fourth data point. 1 Example Measures of Center and Spread Suppose we ask five people how many hours of television they watch in a week, and get the following data: Observation 1 3 4 Data value 7 3 38 7 That is, x =, x = 7, x = 3, x = 38, x = 7 1 3, 4 What is the center of these data, and what is the spread about that value? Mean Median Mode Midrange CENTER SPREAD Range Inter-quartile range (IQR) Variance/Standard deviation Mean Mode Median Midrange Mean Sample Mean x Population Mean µ x-bar mu the average of a set of observations in a sample If the n observations are then x + x + x + K+ x x = n xi i = = n 1 3 n 1 n the average of a set of observation in the population If the N observations are then x + x + x + K+ x x = N xi i = = N 1 3 N 1 N 6 1
Example: Sample Mean x =, x = 7, x = 3, x = 38, x = 7 1 3, 4 Mean is the balance point of the distribution The mean is x + x + x + x + x x = = n x = + 7 + 3 + 38 + 7 = 1 1 3 4 i= 1 n x i mean Median Median The median M is the midpoint of the distribution (like the median strip in a road) It is the number such that half of the observations fall above and half fall below. Example: Median x =, x = 7, x = 3, x = 38, x = 7 1 3, 4 Step 1: order the data: 3 7 7 38 Example : Median If the data are x =, x = 7, x = 3, x = 38 1 3, 4 Step 1: order the data: 3 7 38 Median = 7 Median = + 7 = 6
Example Data set A: 64 6 66 68 70 71 73 median is 68 mean is 68.1 Data set B: 64 6 66 68 70 71 730 median is still 68 mean is 16 outlier The mean is very sensitive to outliers, while the median is resistant to outliers. Comparing the mean and the median Mean describes the center as an average value, where the actual values of the data points play an important role Sensitive to outliers Median locates the middle value as the center, and the order of the data is the key to finding it. Not sensitive to outliers Symmetric distributions with no outliers Left-skewed distributions Mean Median Mean < Median Right-skewed distributions Skewness Median=68 Mean=7 Median< Mean 3
Which measure of center to use? We will therefore use the mean as a measure of center for symmetric distributions with no outliers. Otherwise, the median will be a more appropriate measure of the center of our data. Another measure of center: Mode Mode is the most frequent value in the data set Data:, 4, 4, 4,,, 6, 7, 8, 10, 1 Mode = 4 Data:, 4, 4, 4,,,, 6, 7, 8, 10, 1 Mode = 4, Data:, 4,, 6, 7, 8, 10, 1 No mode And another measure of center: Midrange Midrange is the value midway between the maximum and the minimum values in the data set. midrange = max+ min Summary: Measures of Center The two main numerical measures for the center of a distribution are the mean x and the median M. The mean is the average value, while the median is the middle value. The mean is very sensitive to outliers, while the median is resistant to outliers. The mean is an appropriate measure of center only for symmetric distributions with no outliers. In all other cases, the median should be used to describe the center of the distribution. The mode is the most frequent value. The midrange is the average of the max and the min values. 1 Range Interquartile range (IQR) Variance/Standard deviation 3 4 4
Spread Measures of Spread Spread: how far from the center the data tend range. If all the data points are identical, there would be no spread at all. Numerically, the spread would be zero. Ex.: Center: Spread: 0 Range Inter-quartile range (IQR) Range = max. value min. value the IQR gives the range covered by the MIDDLE 0% of the data Example: Data:, 4, 4, 4,,, 6, 7, 8, 10, 1 Range= max.- min. = 1- = 13 How to find the IQR? How to find the IQR? Step 1: arrange the data in increasing order Step : find the median Step 3: Find the median of the lower 0% of the data. This is called the first quartile of the distribution and is denoted by Q1.
How to find the IQR? IQR Step 4: Repeat this again for the top 0% of the data. Find the median of the top 0% of the data. This is called the third quartile of the distribution and is denoted by Q3. The middle 0% of the data falls between Q1 and Q3, and therefore: IQR = Q3 - Q1 IQR Example Weights of 10 students: 10, 118, 10, 136, 138, 149, 17, 17, 161, 180 + M = 138 149 = 143. IQR = 17-10 = 37 Note Using the IQR to detect outliers The IQR should be used as a measure of spread of a distribution only when the median is used as a measure of center. Median IQR The 1.(IQR) Criterion for Outliers An observation is considered a suspected outlier if it is below Q1-1.(IQR) or above Q3 + 1.(IQR) 6
The 1.(IQR) Criterion for Outliers Example 1 Weights of 10 students: 10, 118, 10, 136, 138, 149, 17, 17, 161, 1 + M = 138 149 = 143. IQR = 17-10 = 37 Q3 + 1. IQR = 17 + 1. ( 37) = 1. Anything above 1.? YES. 1 IS an outlier. Example Outlier! Data: -1, 8, 9, 1, 14, 19,, 3, 3, 4, 0 M IQR= 3-9 = 14 1. (IQR)= 1. (14) =1 Anything below Q1-1.(IQR)=9 1 = -1? Outliers! YES! Five-number summary To get a quick summary of both center and spread, we consider these five numbers: Mininum value Q1 Median Q3 Maximum value Five-number summary Anything above Q3+1.(IQR)=3 +1 = 44? YES! Boxplot John Tukey invented another kind of display to show off the five-number summary. It s called boxplot. Example 1 Weights of 10 students: 10, 118, 10, 136, 138, 149, 17, 17, 161, 180 Min. + M = 138 149 = 143. Max. 100 110 10 130 140 10 160 170 180 7
Example Outlier Weights of 10 students: 10, 118, 10, 136, 138, 149, 17, 17, 161, 1 Min. + M = 138 149 = 143. Max. Comparing distributions Boxplots are best used for side-byside comparison of more than one distribution. * 100 110 10 130 140 10 160 170 180 Variance and Standard Deviation The standard deviation gives the average (or typical distance) between a data point and the mean, x. Variance: Standard deviation: s ( x x) + ( x x) + K+ ( xn x) = n 1 1 s = ( x1 x) + ( x x) + K+ ( xn x) n 1 Facts about the standard deviation (s) s measures the spread about the mean and should be used only when the mean is chosen as the measure of center. That is, when the distribution of the data is roughly symmetric with no outliers. Mean Standard deviation 4 46 Facts about the Standard Deviation (s) Standard Deviation s is always zero or greater than 0. s = 0 only when there is no spread, i.e., the data values are identical. s gets larger as the spread increases. s has the same units of measurements as the original observations. Like the mean, s is not resistant. It is very sensitive to outliers. Sample s.d.: s s = ( x1 x) + ( x x) + K+ ( xn x) n 1 Population s.d.: σ σ = ( x1 x) + ( x x) + K+ ( x N x) N 47 48 8
The 68-9-99.7 rule (Empirical Rule) The Empirical Rule for Data with a Bell-Shaped Distribution Measures of Relative Standing 0 The Empirical Rule Standard Normal Distribution is a Normal distribution with mean 0 and standard deviation 1. Notation: N (0,1). How to standardize? Need to change x to z-scores: z = x µ σ Ordinary/Unusual Values A z-score measures the number of standard deviations that a data value x is from the mean µ. When x is larger than the mean, z is positive. When x is smaller than the mean, z is negative. When x is equal to the mean, z is zero. Ordinary values: - z-score Unusual values: z-score < -, or z-score > 3 4 9
Percentiles are measures of location. There are 99 percentiles denoted P 1, P,... P 99, which divide a set of data into 100 groups with about 1% of the values in each group. number of values less than x Percentile of value x = 100 total number of values Summary Measures of the center of distributions: Mean Median Mode Measures of spread of distributions: Range IQR Using IQR to detect outliers the 1.(IQR) rule Boxplots Variance/Standard deviation 10