Solutions for practice questions: Chapter 9, Statistics

Solutions for practice questions: Chapter 9, Statistics If you find any errors, please let me know at mailto:msfrisbie@pfrisbie.com. 1. We know that µ is the mean of 30 values of y, 30 30 i= 1 2 ( y i µ ) = 925. i= 1 y = 360, and i a) The mean can be calculated by adding up all the values and dividing by the number of values. Conveniently, we are given both the fact that there are 30 values (that s what numbering them from 1 to 30 does for you) and that the sum of the values is 360 (that s the first summation). 360 Therefore µ = = 12. 30 b) The standard deviation is probably not a formula you can just recite in the way you can the mean. However, there is a way to get it without the raw data. Back on p. 294, you ll find two different formulas which work out to the very same thing. One is called the sample standard deviation, and the other is population standard deviation, but in the world of IB, 1 they re computed identically: = n i= 1 σ ( x µ ) 2 i. Notice that n the numerator of that fraction is the other given sum. 925 So σ = 5.55. 30 1 But not in the world of AP Statistics so if you live in the world of AP Stats, too, you need to know the difference. In normal-not-ib-stats, the sample standard deviation has n 1 in the denominator. 2. So we know the mean of the data set, but not the number in the group that took 40 minutes to get to school in the morning. The mean is the sum of the times divided by the total number of students. I ll call the missing value m, for missing. 10 1+ 20 2 + 30 5 + 40 m + 50 3 = 34 1+ 2 + 5 + m + 3 350 + 40m = 34 11+ m 350 + 40m = 34(11+ m) = 374 + 34 m 6m = 24, and m = 4 students 3. Fifty measurements. Great. a) The question specifies that the first interval should start at 1.6, and that the intervals should be 0.5 units wide. (Shouldn t there be units on time? Whatever. 2 ) Because I write my intervals both carefully and correctly, I know that I should use inequalities, like 1.6 x < 2.1, and so on. You know that, right? I am using x to represent the time. 3 Time Taken Frequency 1.6 x < 2.1 2 2.1 x < 2.6 5 2.6 x < 3.1 5 3.1 x < 3.6 5 3.6 x < 4.1 14 4.1 x < 4.6 6 4.6 x < 5.1 6 5.1 x < 5.6 2 5.6 x < 6.1 3 6.1 x < 6.6 2 2 But while I m complaining, why are we starting at 1.6 and not 1.5? The graph would make more sense to me if we had. 3 Not only did BoB not write the intervals correctly, his counts are off. These are correct.

Rather than drawing the histogram by hand, I am using technology (for what I am assuming are obvious reasons related to how you are reading this). You should be able to set this up on graph paper, with a regular, ordinary x-axis, and using a straightedge for the axes and the bars. be expected to find with a calculator, so there s no reason not to use it to find the mean, too. b) All but seven of the values are less than 5.1, so that would be a fraction of 43 less than 5.1. 50 c) Based on the squares I can count, there are 19 values to the right of the 3.6 x < 4.1 class, and 17 to its left. So the median should be somewhere near the middle of that class. To be more precise than somewhere near the middle, of the 14 pieces of data in the class, eight must be less than the median and 6 greater than the median. The median is between the eighth and ninth values. So we need a number which is 8.5 of the way up the 14 interval, from the bottom. The intervals are 0.5 wide, so this 8.5 gives 3.6 + 0.5 3.90. 14 d) The standard deviation is certainly something you would Notice that the numbers in the time column are the midpoints of each interval. I used the onevariable statistics command with one list and frequencies in the list I ve titled freq. The mean is the 3.93 that s highlighted; the standard deviation we need is σ, not s. So σ 1.11. e) A cumulative frequency graph starts with a cumulative frequency table. Cum. Freq Time Taken Freq 1.6 x < 2.1 2 2 2.1 x < 2.6 5 7 2.6 x < 3.1 5 12 3.1 x < 3.6 5 17 3.6 x < 4.1 14 31 4.1 x < 4.6 6 37 4.6 x < 5.1 6 43 5.1 x < 5.6 2 45 5.6 x < 6.1 3 48 6.1 x < 6.6 2 50 Then we plot the right end of each interval with the cumulative frequency values as points, along with a 0 at the left end of the first interval, and connect with a curve.

4. The data are from cities chosen at random. Did you know that 74.3% of all statistics are made up on the spot? f) The five-number summary uses the minimum, maximum, median, and quartiles. The minimum is 1.6 and the maximum is 6.6. For the others, we use fractions of the 50 total frequency. The median will be at 25, the lower quartile at 12.5, and the upper quartile at 37.5. Draw across from the frequency and down to the time. Here s another picture. 4 a) This one sounds like purely opinion, but of mean, median, and mode, mean seems clearly to be the worst choice; the data are clearly skewed positively. Either the median or the mode seems fine to me, but because there are two modes, I d probably go with median. Similarly, standard deviation and range would be affected a lot by the outlier(s) on the right side, so interquartile range seems like a better measure of spread. b) The mean and standard deviation will come from a calculator. The number of parking spaces will be the values, the number of sites make up the frequencies. Judging from those vertical segments, I d posit the first quartile to be 3.1, the median to be 3.8, and the third quartile to be 4.7. 4 While in general, graphs are easier using a computer, this was a huge pain to make; I had to draw the curve in another program and import it as a graphic to GeoGebra, because the drawing tools in GeoGebra are just not meant for this. I should have just done it on graph paper and scanned it in. I get a mean of 783 spaces, and a standard deviation of 536 spaces (both to 3 s. f.). 5 5 BoB s answer for the mean just doesn t make sense. The graph clearly shows that the first bar is centered

c) Hmmm. Another cumulative frequency graph. I think this time I ll look for another piece of software. In the meantime, here s the cumulative frequency table. Parking Spaces Freq CF 150 x < 250 20 20 250 x < 350 30 50 350 x < 450 80 130 450 x < 550 80 210 550 x < 650 50 260 650 x < 750 30 290 750 x < 850 20 310 850 x < 950 40 350 950 x < 1050 30 380 1050 x < 1150 0 6 380 1150 x < 1250 0 380 1250 x < 1350 30 410 1350 x < 1450 0 410 1450 x < 1550 0 410 1550 x < 1650 0 410 1650 x < 1750 20 430 1750 x < 1850 0 430 1850 x < 1950 0 430 1950 x < 2050 10 440 2050 x < 2150 0 440 2150 x < 2250 10 450 2250 x < 2350 0 450 2350 x < 2450 0 450 2450 x < 2550 0 450 2550 x < 2650 0 450 2650 x < 2750 10 460 I went with Excel. around 200, not 100. I stand by my answer, but if you can explain why I would be wrong, please do. 6 Yes, I considered leaving out the rows with the zeros, but elected not to, as I thought it would raise more questions than it was worth. d) The quartiles and median come from drawing on the graph. There are 460 sites, so the median should be at 230, and the quartiles at 115 and 345. I added some grid lines to make this easier to read. I get the median 580, Q1 425, and Q3 = 925. And therefore the interquartile range is about 925 425 = 500. e) Outliers are defined as being more than 1.5 times the interquartile range away from the nearer quartile. Q1 1.5(IQR) = 325, which is clearly not a possibility. Q3 + 1.5(IQR) = 1675. This value is barely in the 1650 x < 1750 bar. So the highest 30 values are outliers, as are some of those in the 20 in that interval. As 1675 is ¼ of the way through that interval, I ll take 15 of those, for a total of 45 outliers. f) Well, as I previously said, it s skewed positively and is bimodal (two of the bars are the same height). It has about 45 outliers on the high end, and none on the low. And if I count this one as the third sentence, then I ve now written a few sentences in the description. 5. a) The Spanish wines include both the most expensive and the least expensive case.

b) Red wines are generally more expensive in France, where the top half of the prices are more expensive than the vast majority of the wines in the other two countries. c) Spanish wines have the greatest range, but the least expensive median; if you are going to put a red wine on the table every evening, Spanish wines would be a good choice. French wines are the most expensive overall, with 75% of the cases above the median of prices for Italian and Spanish wines. The Spanish wines could be described as holding the middle ground. 6. I m only on question 6. There are 26 of these. Does anyone read this stuff? a) Using a calculator µ 52.6 s, σ 7.60 s b) Still with the calculator next closer time, both the mean and the standard deviation were much larger than if that one value were left out. 7 7. Interesting question we have the summary statistics, and we re asked about features of the distribution. a) If the distribution is symmetric, then the distance from each quartile to the corresponding extreme value should be approximately equal. The maximum minus the third quartile is 86 75.75 = 10.25, and the first quartile minus the minimum is 60.25 42 = 18.25. As the second is nearly twice the first, I would say that the distribution is not symmetric. b) Outliers are more than 1.5 times the interquartile range away from the nearer of Q1 and Q3. The IQR is 75.75 60.25 = 15.5; 1.5 15.5 = 23.25. Subtracting, Q1 23.25 = 37; the minimum is not less than 37. Then Q3 + 23.25 = 99; the maximum is not greater than 99. There are no outliers. c) I m going to do the box plot on the TI-nspire. I will just enter the data as the five-number summary. median = 51.34 s (exactly) IQR = Q3 Q1 = 52.58 49.93 = 2.65 s. c) Because one of the times was about 50 seconds greater than the I was initially surprised to discover that I had to enter each 7 This sounded familiar to me, so I looked it up. To read about Eric the Eel, try this link: http://www.swimmingworldmagazine.com/lane9/new s/1807.asp

of the quartiles twice to get them to graph correctly, but after thinking about it a bit, this made sense. Try it yourself if you don t see why this was necessary. d) The data is skewed somewhat to the left. The middle half of the data is fairly compact, but even so the ends are not so spread as for there to be outliers. and there are 2000 people, it s 1000 patients. 8 d) For the box plot, it s back to the TI-nspire. The minimum and maximum values are 100 and 400 mg/dl respectively. (The blue line appears to go all the way back to 100 on the horizontal axis. I had to look closely.) 8. a) The median will happen at the 50th percentile. Since the y-axis is given as percentages, this is easy. e) The distribution is skewed somewhat to the right, with a relatively small interquartile range. The maximum and minimum are both outliers. The median is 225 mg/dl. b) The quartiles are at 25% and 75%. I ll do those in blue and the 90th and 10th in red. 9. a) I counted by fives. If you use something else, your answers may differ from mine. 9 Speed Frequency 25 x < 30 6 30 x < 35 20 35 x < 40 38 40 x < 45 20 45 x < 50 14 50 x < 55 2 b) Histogram from the calculator: First quartile: 205 Third quartile: 253 10th percentile: 190 90th percentile: 298 No wonder these people are heart patients. Their cholesterol levels are crazy. c) The IQR would be 253 205 = 48. Since this is the middle 50%, 8 BoB answered some other question? 9 BoB chose different boundaries than I did, too, so our answers aren t going to be quite the same. And of course he set up the intervals badly. Speed is continuous data; it needs inequalities.

c) I am assuming that showing all work means that I am to do this with the frequency distribution. To show all work for 100 individual data points is just absurd. (In fact, showing all work in general on this is not my idea of useful activity. Feel free not to do so yourself.) The mean, at least, is easy. I need the sum of the values divided by the number of values. 27.5 6+ 32.5 20+ 37.5 38+ 42.5 20+ 47.514 + 52.5 2 x 100 = 38.6 km/h. Note that this is really an approximation rather than an exact value (see the?) because the midpoints of the intervals are approximating the true values. Standard deviation is far more annoying. I am modeling my table on the one on p. 300. Midpt Freq squared times freq m f(m) (m xbar) 2 ( ) 2 freq 27.5 6 123.21 739.26 32.5 20 37.21 744.2 37.5 38 1.21 45.98 42.5 20 15.21 304.2 47.5 14 79.21 1108.94 52.5 2 193.21 386.42 sum f(m) sum of sq diff times freq 100 3329 standard dev is square root of green over blue 5.769749 Ugh. So σ 5.77 km/h. d) Cumulative frequency table: Speed Freq C. F. 25 x < 30 6 6 30 x < 35 20 26 35 x < 40 38 64 40 x < 45 20 84 45 x < 50 14 98 50 x < 55 2 100 e) So the median and quartiles come from a graph. This is some serious overkill. Median = 38 Q1 = 35 Q3 = 42.3 IQR = 42.3 35 = 7.3 f) 1.5 IQR = 10.95 35 10.95 = 24.05 for the lower boundary for outliers; 42.3 + 10.95 = 53.25 for the upper boundary. So apparently there is one outlier (54) on the high side, and none on the low. My calculator made you a box plot (using the entire data set, because that s how hard I intend to work here). 10. a) I ll be doing these on the calculator, as any sane person would.

mean = 1846.74 L median = 1898.5 L standard deviation 231.5 L Q1 = 1714 L Q3 = 2024 L IQR = 2024 1714 = 310 L 10 b) Q1 1.5IQR = 1249 Q3 + 1.5IQR = 2489. The maximum value is less than the upper end boundary, so there are no outliers on the right. However, the minimum value is less than the lower boundary; as it turns out, there are two values that qualify as outliers on the low end, 1123 and 1239. c) From the Nspire: d) µ σ 1615 L µ + σ 2078 L So consumption levels between 1615 and 2078 L are within one standard deviation of the mean. e) Germany s 2758 L would be a significant outlier on the high end. It will not significantly affect the median, quartiles, or IQR, but the mean and standard deviation will both be larger. 10 BoB s answers are a little different from mine. I get the impression (from his median and quartile values) that he may have made a frequency distribution and used that; I can t tell for sure. 11. a) Ninety students had a total of 4460 minutes of travel time, so on average, they spent 4460 49.6 minutes each. 90 b) With the new times, the total is now 4460 + 35 + 39 + 28 + 32 = 4594, and there are now 94 students; the new mean is 4594 48.9 94 min. 12. a) I m going to make the table vertical so it fits in my twocolumn format better. The 1-10 notation that the book uses here is actually not terrible, because the scores are discrete. They do make the mid-interval values more ugly, though, with decimals. I m going to go with regular intervals, because I know that s best. Notice that the intervals I use do include the discrete values in the original table, with the equals on the right side. Marks Freq C. F. 0 < x 10 30 30 10 < x 20 100 130 20 < x 30 200 330 30 < x 40 340 670 40 < x 50 520 1190 50 < x 60 440 1630 60 < x 70 180 1810 70 < x 80 90 1900 80 < x 90 60 1960 90 < x 100 40 2000 b) The scale that is requested seems too big to put here. What I m going to do is make it on graph paper at the required size, then scan it and shrink it to fit in this document.

c) To answer these, I have drawn on the graph from (b), at right. Look there to see where the numbers came from. i) Median is at the middle student, number 1000. 11 The value is 47 marks. ii) About 455 students have to do the retake. iii) Fifteen percent of 2000 is 300; the top 300 students are above 1700. It looks like a mark of 64 or higher gets honors (aka distinction ). 12 13. Weighted averages! The 72 men have a mean height of 1.79 m, so their total height is 72 1.79 = 128.88 m; for the women, the total is 28 1.62 = 45.36 m. So if you stacked them all on top of each other, you d have 128.88 + 45.36 = 174.24 m of mathematicians. As there are 100 of them, they each average 1.7424 1.74 m tall. 14. a) The mean is the sum of all the values divided by the number of values. The sum of the values is given as 300, and we can tell from the subscripts that there are 25 values. 300 So m = = 12. 25 b) To get the standard deviation from this information, you need the formula = N i= 1 σ ( x µ ) 2 i (which you will find on p. 293). That formula (with lower case n, N I think) used to be in the formula booklet. However, in the newest syllabus, its use is no longer tested, so you wouldn t have to look it up or memorize it for the IB exam. 11 Yeah, I know, it really ought to be between student 1000 and 1001. Can you tell the difference on this scale? Neither can anyone else. Just use 1000. 12 When you draw a graph like this, there will always be some variation between one person s result and another s. In a markscheme, the reader will use your graph to check your numbers.

Note that this doesn t mean this question is bad, just that someone should provide you with the formula before asking you to do it. Since m is the mean, 625 is the numerator of that fraction. The value of n (or N) is still 25. 625 σ = = 25 = 5. 25 15. a) The appropriate approximation they re referring to is the midinterval value. That would be 15, 45, and so on. The mean is 97.2 seconds. b) Oh, I feel more graph paper coming. Wait time Freq C.F. 0 x < 30 5 5 30 x < 60 15 20 60 x < 90 33 53 90 x < 120 21 74 120 x < 150 11 85 150 x < 180 7 92 180 x < 210 5 97 210 x < 240 3 100 c) I just counted how much more graph paper I m going to need. Ugh. As it turns out, in recent years the exams have shown a trend away from making you construct graphs like these and toward having you read them. That makes sense; drawing them takes a lot of time, and they can test more material if they don t make you do all of that stuff. But I have told myself I m going to work all of these problems, so here s another graph. d) The median is at customer number 50; the quartiles are at 25 and 75. Nice numbers. Median 88 s Lower quartile 66 s Upper quartile 122 s 16. a) i) Looks like 10 plants. ii) 14 + 10 = 24 plants. b) Again, the midinterval values are used for the computations.

c) Hey, I don t have to draw the graph this time! µ = 63 cm σ 20.5 cm c) The graph shows that the distribution is skewed to the left; the left tail is longer than the right. When a distribution is not symmetric, the median and mean are different. d) There are 80 plants (it was given at the top of the question). So for the median, we look at 40 plants. According to the table, that puts it greater than 60 cm and less than 70. In fact, 40 is exactly the mean of 32 and 48, which means the median can be estimated as the mean of 60 and 70 cm, or as 65 cm. 17. a) Once more with the one-variable statistics. σ 7.41 g. Notice that the units are grams, just like the weights. b) I made it vertical. Weight (W) Freq C.F. 80 < W 85 5 5 85 < W 90 10 15 90 < W 95 15 30 95 < W 100 26 56 100 < W 105 13 69 105 < W 110 7 76 110 < W 115 4 80 i) The median is at 40 packets; median 97 g. ii) The upper quartile is at 60 packets; upper quartile 101 g. d) At first this question may seem impossible to answer. If you don t know any of the individual weights 13, how can you know this? But watch the algebraic magic W W + W W + ( 1 ) ( 2 ) + ( W79 W ) + ( W80 W ) = ( W1 + W2 + + W80 ) ( W + + W ) 80 terms 80 = Wi 80W i= 1 Notice that the mean is calculated by adding the values of Wi and dividing by 80. Making that substitution, we have 13 I know, grams mean mass. This is how the question was written on the exam, too.

80 80 1 = Wi 80 Wi i= 1 80 i= 1 80 80 = W W = 0. i i= 1 i= 1 Brilliant. e) This part of the question assumes you ve already studied probability. We haven t gotten there yet, but it s not that hard. We know that W is between 85 and 110. That s all the packets except the 5 in the lightest group and the 4 in the heaviest, which is 71 packets. Of those, there are 13 + 7 = 20 over 100 g. So the probability is 20 71. c) i) As seen in red on the graph, 200 cars travel at speeds of 105 km/h or less, so 100 travel at greater than that speed. That s one-third of the total, or 33.3% (to three significant figures). ii) i Fifteen percent of 300 is 45 cars. Forty-five from the top would be at the cumulative frequency of 255. As you can see in purple on the cumulative frequency curve, this corresponds to about 114 km/h. 18. At first, the top and bottom rows of the table seem to be different widths, but the frequencies there are 0, so they don t introduce any complications. a) Using the calculator: x 98.2 km/h b) i) a = 95 + 70 = 165 b = 236 + 39 = 275 ii) Back to graph paper. See below.

19. a) Here s the graph with markings. i) The median, in red, is at 100 cabs, and is $25. ii) The fare is $35 or less for 156 cabs, as seen in purple. b) The 40% of the cabs that travel the shortest distances also have the lowest fares. Forty percent of 200 is 80 cabs. As seen in green, this gives fares of $22 or less. Since the cabs charge 55 per km, that gives a distance of $22 $0.55 per km = 40 km. c) Ninety km at $0.55/km = $49.50. On the graph in blue, $49.50 corresponds to a cumulative frequency of 185 cabs. This means that 15 cabs travel more than this distance. And 15 out of 300 is 5%. 20. If the median of the three numbers is 11, then b = 11. A mean of 9 gives a + 11+ c = 9, so a + c = 16. A range 3 of 10 means c a = 10. Adding the two equations gives 2c = 26, so c = 13 and a = 3.

21. a) I think this is the last one I have to put on graph paper. Since no houses are sold for less than $0, that will be the lower limit of the first interval. b) The quartiles happen at 25 and 75 houses, which gives Q1 = $138,000 and Q3 = $245,000; the interquartile range is therefore $245,000 $138,000 = $107,000. 14 c) a = 94 87 = 7 b = 100 94 = 6 d) Once again, to the calculator. In this case, a mean of 199 means $199,000. e) i) $350,000 is marked on the graph in purple. There appear to be 90 houses selling at less than $350K, so there are 10 houses that can be classified as DeLuxe. ii) This is once again a probability question a little past where we are in the study of the syllabus, so no worries if you don t know how to do it yet. Of the 10 DeLuxe houses, we know from the table that 6 sold for more than $400,000. So the chance that the first one selected was that expensive is 6. For the second 10 one, there are only 9 DeLuxes left, and only 5 selling for over $400K. So now 6 5 1 the likelihood is at 5. Multiplying those gives the answer: =. 9 10 9 3 14 If you bother to check, you ll probably see that BoB s and my answers differ somewhat, and so do the shapes of our curves. I m drawing mine by hand; he s clearly not. That makes a difference. His technology is making the sections between the dots straight; in fact, there s no reason to believe that they will be, and curved is probably more realistic, although without the raw data, it s impossible to know if either shape really reflects the data accurately. Don t worry about it. I m not.

22. a) More drawing on a graph. The median is at 40 shells (notice that even though the y-axis goes up to 90, we are told that there are 80 shells, and the highest point on the curve has a y-value of 80), and the upper quartile is at 60. I would estimate the median as 20 mm and the upper quartile as 24 mm. b) IQR = 24 14 = 10 mm 23. a) To find the values in the table, I ll have to find the cumulative frequencies at 40, 60, and 80, and then subtract the ones that preceded them. For instance, the number in the 40-to-60 range is the number less than 60 minus the number less than 40. 76 22 = 54 142 76 = 66 180 142 = 38. Those are the numbers in the table. I choose not to reproduce the whole thing. b) Forty percent 15 of 200 is 80 students. Looks like that s a 42. 15 FORTY PERCENT FAIL! Wow.

24. Yep, drawing on another curve. Some of these problems are starting to feel like overkill. 26. And a final graph. a) The median height will be at the 60th player; it is about 183 cm. b) The first quartile, at the 30th player, is 175 cm. The third quartile, at the 90th, is 189 cm. Therefore the interquartile range is 189 175 = 14 cm. a) Forty marks matches with 100 candidates who scored that many marks or fewer, shown in red. b) The middle 50% is exactly the interquartile range. Therefore a must be Q1 and b is Q2. With 800 students, a will occur at a cumulative frequency of 200, and b at 600. Those are in blue. a = 55; b = 75. 25. So a, b, c, and d are in order from smallest to largest, with c = d. If the mode is 11, then c = d = 11. If the range is 8, then d a = 11 a = 8, and a = 3. Finally, if the mean is 8, then 3+ b + 11+ 11 = 8. That gives 4 25 + b = 32, and b = 7.