Part 4 Measures of Spread IQR and Devaton In Part we learned how the three measures of center offer dfferent ways of provdng us wth a sngle representatve value for a data set. However, consder the followng stuaton: Example : Below are the exam scores for two exams: Exam X: 68, 64, 7, 66, 7, 76, 7, 74, 7 Exam Y: 5, 4, 7, 4, 7,, 7,, 9 For exam X, calculate the mean, medan and mode. For exam Y, calculate the mean, medan and mode. From Example we see that we need some way of summarzng a data set other than fndng ts center. To that end... The range of a data set s the quantty (greatest value) (least value). Example : Calculate the range for each data set n Example. For exam X, calculate the range, mdrange For exam Y, calculate the range, mdrange Note that the range only consders two values and tells us essentally nothng about the spread of the ntermedate values. We could use the md-range as prevously defned; but, ths would not tell us anythng about how the data s dstrbuted snce the mdrange s greatly affected by only one data value. Moreover, the range s drectly and drastcally affected by what are called outlers (nformally, a small number of values whch dffer greatly from the majorty of values). So, although the range s a smple calculaton of spread to perform, for these reasons t s often not a practcal measure. One way to mantan much of the smplcty of the range yet avod the effects of outlers s to calculate the nter-quartle range or IQR, whch s defned as Q Q. Note that ths gves the range of the mddle 5% of the data. Example : Calculate the IQR for each data set n Example. For exam X, calculate Q -- Q = IQR For exam Y, calculate Q -- Q = IQR
The IQR also allows us to gve a formal defnton of an outler: any value whch s more than (.5) IQR away from the nearest quartle. Usng ths defnton, box-and-whsker dagrams are usually draw somewhat dfferently for samples that contan outlers. In ths case, the whskers extend to the furthest values whch are not outlers, and outlers are then ndcated wth a dot. (Sometmes, so-called extreme outlers, values whch are more than () IQR away from the nearest quartle, are ndcated wth an astersk.) Because outlers can cause major dffcultes n makng statstcal nferences, we wll pay specal attenton to them later n the course. Example 4: Redraw the box-and-whsker dagram from Example 6 n Part, ths tme accountng for outlers. 5, 6,,, 6, 7,,,,,, 4, 5, 8, 8, 4, 4, 66 Although the IQR s unaffected by outlers, t stll gnores many values the way the range does. For the remanng measures of spread, our approach s as follows: Suppose the values x, x,..., xn have mean X. How good of a representatve value s X? That s, how much do the other values dffer from t, on average? Note: Lterally, we mght try calculatng the quantty. However, ths quantty wll always come out to zero, as you can check. Instead, we mght try dstances rather than dfferences: Suppose the values x, x,..., xn have mean X. We defne the mean devaton as Example 5: Calculate the mean devaton for each set n Example. Set X x - X Set Y y - Y 68 5 64 4 66 4 76 74 7 9 A (perhaps non-obvous) drawback of the mean devaton s that absolute values don t work very well wth the tools of calculus, whch are very necessary for much of statstcal analyss. But, another way to make the dfferences (x X ) non-negatve s to square them. Snce ths has the sde effects of yeldng a quantty that s both large and n square unts, we take the square root after fndng the mean.
And fnally, t wll now matter whether we re dealng wth a sample or a populaton. For a gven sample of sze n, say a sample x, x,..., xn wth sample mean x, the sample standard devaton s s n x X n. The sample varance s s n x X n Gven a populaton x, x,..., xn wth populaton mean µ, the populaton standard devaton s N x N. The populaton varance s N x N A natural queston regardng the defnton of s s: Why n? Recall that the purpose of a statstc s to estmate t s correspondng parameter. So, the defnton s chosen freely (and reasonably) for σ, but once ths s done, we must choose a defnton for s so as to make t a good estmator for σ. Although t seems strange, ths necesstates the term n n t s denomnator and not n. Note also that s s not equal to the mean devaton. However, these values wll be reasonably close to each other, and ths gves us a way to determne approxmately what the values of s should be for a gven data set. Example 6: Calculate the sample standard devaton and sample varance for each set n Example. Set X (x - X ) Set Y (y -Y ) 68 5 64 4 66 4 76 74 7 9 So, whch measure of spread should we use? The answer s analogous to decdng between measures of center. s s more useful mathematcally, so we ll usually use t. However, f there are outlers they can strongly affect s, and so the IQR should probably be used nstead. As before, some statstcans advocate ths even for skewed dstrbutons.
As wth the mean, we ll consder a few propertes of the standard devaton. Suppose x, x,..., xn have sample standard devaton s. () For any constant k, the values x + k, x + k,..., xn + k have sample standard devaton s. () For any constant k, the values kx, kx,..., kxn have sample standard devaton k s. (Note: The same propertes hold true for the populaton standard devaton σ.) Example 7: Suppose the Exam scores for ths class have a standard devaton of ponts. Fnd the resultng standard devaton and varance f I (a)... gve everyone an addtonal 5 ponts. (b)... multply everyone s score by /. Usng some algebra, we can rearrange the formula for s ( ) ( ).The trade-off here n x x nn ( ) s that we don t have to calculate x or the values x x, but the values x are often much larger than the values (x x). As we dd n Part, for a frequency dstrbuton outcome x x... xn where k n f frequency f f... fn s the total number of observatons, we can deduce s n f x X n = n f ( x ) ( f x ) nn ( ) Recall, for grouped data, we approxmate the values n each class usng the class mark (mdpont). 4
Example 8: Below s a frequency dstrbuton for a survey of U.S. households determnng the number of people per household. Use ths to compute the sample standard devaton for the number of people per household. Number of people Number of households 7 9 4 5 5 4 6 7 Example 9: The scores of a quz are recorded n the dstrbuton below. Use ths to estmate the sample standard devaton for the quz scores. Score Frequency [8,) [,) [,4) 8 [4,6) [6,8) 7 [8,) 4 What s mportant to understand about the standard devaton or the average devaton? We should understand that the standard devaton s a measure of how tghtly packed the data s about the mean. In other words, the smaller the standard devaton (see Cvar), the closer the data s to the mean, or average. The data below wll help us vsually grasp how ths works. There are 5 data sets lsted below, each wth a mean of 5. As the data values begn to congregate about the mean, notce that the standard devaton becomes smaller. You can see the related hstograms dsplayed on the next page. I have used the average devaton here. As an exercse, you can verfy usng the standard devaton. 5
Example 9 4 4 4 9 4 4 4 9 Set 4 4 4 5 mean 6 45 45 4 Dev 6 45 45 6 45 45 9 5 48 4 9 5 48 4 9 Set 55 48 4 5 mean 55 48 45 5 Dev 55 48 45 6 5 45 6 5 45 4 6 5 48 9 5 48 9 5 48 9 Set 5 48 5 mean 5 48 6.5 Dev 5 5 5 5 5 5 5 5 55 5 55 5 55 5 6 5 6 5 6 5 9 5 Set 5 9 55 mean 5 Set 4 9 55 Dev 4. mean 5 55 Dev. 55 6 6 6 6
.5.5.5 Dev 4.5.5.5 Dev 5.5.5 4 45 48 5 5 55 6 9 4 45 48 5 5 55 6 9 4 Dev 6.5 Dev 6.5 8 6 4 Dev. Dev. 4 45 48 5 5 55 6 9 4 45 48 5 5 55 6 9 Dev 4. 7 6 5 4 4 45 48 5 5 55 6 9 Notce what happens to the graph of the dstrbuton as the mean, medan and mode become equal and the standard devaton becomes small the curve takes on a roughly normal shape. Remember, n the example above, we used the average devaton for s here. In practce, we wll use the standard devaton for s. One key to note when computng standard devatons as a measure of how tghtly packed the data s about ts mean s to compute the coeffcent of varance. The coeffcent of varance s the rato of the standard devaton to the mean. We abbrevate the term as cvar: s Coeffcent of varance s c var. X 7
If we use only the value of s to measure varaton n the data we can be msled. For example, f the mean salary of a company s $5K wth the st. dev. $K, then cvar s.4. But, f the mean salary s $5K and the st. dev. s $K, then cvar s.4. In the frst case, the st. dev. s 4% of the mean, very large s comparson; whereas n the second case, the st. dev. s 4% of the mean, very small n comparson. I thnk you would be much happer wth your salary devatng 4% rather than 4%. So, we want to know both the standard devaton and the coeffcent of varance. Usually, cvar wll be farly ntutve, as n the comparson above. Example : Now look at our comparson above usng Set - Set 5, and compute the cvar for each sample set usng the true standard devaton s and not the average devaton. Set : s= : cvar = Set : s= : cvar = Set : s= : cvar = Set 4: s= : cvar = Set 5: s= : cvar = Fnally, as one last note on standard devatons, we ll menton what s known as Chebyshev s Rule. Whle we know that the IQR tells us the range of the mddle 5% of a data set, somethng smlar can be sad about standard devatons. If a dstrbuton s symmetrc and approxmately normal, as n our Set 5 above, then ) approxmately 68% of the data les wthn one standard devaton of the mean, namely, usng nterval notaton, 68% of the data les wthn ( X -s, X +s) or (µ σ,µ + σ) ) ) approxmately 95% of the data les wthn two standard devatons of the mean, namely, n nterval notaton, 95% of the data les wthn ( X -s, X +s) or (µ σ,µ + σ) approxmately 99% of the data les wthn three standard devatons of the mean,.e. wthn ( X -s, X +s) or (µ σ,µ + σ). For a dstrbuton n general that mght not be normal or symmetrc, then Chebyshev s Rule states that the percent of the populaton that s wthn K standard devatons of the mean s. K 8
Accordng to Chebyshev s Rule then, what would be the percentages for parts -)? ) approxmately % of the data les wthn one standard devaton of the mean, namely, usng nterval notaton, % of the data les wthn ( X -s, X +s) or (µ σ,µ + σ) ) ) approxmately % of the data les wthn two standard devatons of the mean, namely, n nterval notaton, % of the data les wthn ( X -s, X +s) or (µ σ,µ + σ) approxmately % of the data les wthn three standard devatons of the mean,.e. wthn ( X -s, X +s) or (µ σ,µ + σ). We have talked about what t means to be normal. A populaton or sample s normally dstrbuted f t satsfes the crtera of symmetry, bell-shaped, and adheres to the percentages of 68, 95 and 99 as stated above. You may ask What do I normally have for breakfast? Well, I usually eat frut for breakfast. What does usually mean? I eat frut 6 out of 7 days a week? Wa Ch and Allen were usually late to class last semester. How often s usually? In statstcs, we defne an event as usual f t falls wth standard devatons of the mean. Well, that means that, for a normally dstrbuted populaton, 95% of the tme an event wll occur that s consdered usual and 5% of the tme an event wll occur that s consdered unusual. Example : Suppose that from a large sample the mean age of a female at the tme of dvorce from her partner s 7 wth a standard devaton of 8 years. Is a woman who s dvorced at age 6 consdered unusual? The queston here s as follows: Is 6 wthn standard devaton of the mean? Well, we then create the nterval (7--*8, 7 + *8) = (, 5). Snce 6 n not wthn the nterval from to 5, we would consder 6 to be an unusual age for a woman to dvorce. Example : Suppose exam scores are normally dstrbuted wth a mean of and a standard devaton of 5 ponts. If I wanted 84% of the class to pass the exam, what would be the lowest passng score? To answer ths queston, let us start by drawng a general normal curve, labellng the mean and the regons accordng to ther percentages. 9