STAB22 section 1.3 and Chapter 1 exercises

Size: px

Start display at page:

Download "STAB22 section 1.3 and Chapter 1 exercises"

Vernon Barber
5 years ago
Views:

1 STAB22 section 1.3 and Chapter 1 exercises Go up and down two times the standard deviation from the mean. So 95% of scores will be between 572 (2)(51) = 470 and (2)(51) = Same idea as the previous exercise, but go up and down 3 times the SD: from 572 (3)(51) = 419 to (3)(51) = 725. In a normal distribution, going up and down 3 times the SD includes almost all the data; this range has length 3+3 = 6 times the SD (which is the origin of the name six sigma in statistical process control) z = ( )/51 = A z-score says how your given value, here 510, compares to the mean. Here, the z-score is negative because the value is below the mean (it doesn t matter that the mean is positive). If you want another example, consider a city where the mean temperature in January is 10, with an SD of 6. A temperature of 5 has a z-score of z = (( 5) ( 10))/6 = 0.83 (positive because it is above the mean), and a temperature of 19 has a z-score of z = (( 19) ( 10))/6 = 1.5, negative because 19 is below the mean (and not just because 19 itself is negative) Figure out z and look it up in Table A. z = ( )/51 = 0.94; a proportion of the normal curve is less than this. To find how much is greater, do the same calculation but subtract the answer from 1: = (The table always gives you less than.) Turn your two values (620 and 660) into z- scores, look them both up in the table, and subtract. As we found in the previous question, 620 has a z-score of 0.94, which corresponds to in the table. 660 has z = ( )/51 = 1.73, which goes with proportion The proportion between 620 and 660 is = Another way to see this is: the proportion of students scoring less than 620 is ; the proportion scoring more than 660 is = ; everyone else scores between, so the proportion is = This way is perhaps easier to understand, but the first way is easier to do If 25% of students are going to score bigger than this score (whatever it is), 75% will score less than this score. Look up in the body of Table A; it s between and , so z is between 0.67 and 0.68, slightly closer to Then unstandardize this z value using the formula on page 64 (just above the question for 1.107) and the given mean and SD to get a score 1

2 of (0.67)(51) = 606. (If you are going to be in the top 25%, you ll need a score a bit bigger than the mean.) It s a density curve, so the area under it has to be 1. The area of the shape shown is its width times its height; for the area to be 1, the height must be 1 as well. For (b), draw the (vertical) line x = 0.35 on the picture; the piece on the left is what you want. The width is and the height is 1, so the area is = (c) is the same idea: draw the area that has x between 0.35 and 0.65; the width is = 0.3 and the height is 1, so the area is 0.3. You might guess that the proportion of the uniform distribution between a and b is b a, and you would be correct. Use a = 0 or b = 1 if you don t have a lower or upper limit, as in (b) This density curve is also a rectangle, so the area, width times height, has to be 1. Since the width is 4, the height must be 1. In (b), the width 4 is 1 0 = 1, so the proportion is 1 1 = 1. (If you 4 4 draw a picture, the area you want is obviously a quarter of the rectangle). For (c), the width is = 2 so the proportion is 2 1 = For the median and quartiles, you want the values that cut off area 0.5, 0.25 and These are 0.5 (median), 0.25 (Q1) and 0.75 (Q3). For the mean, note that the density curve has a symmetric shape, so the mean and median must be equal, For a skewed density curve, like a skewed histogram, the mean is pulled farthest into the tail, and the median lies between the peak and the median. Thus for (a), C is the mean and B the median. For (c), A is the mean and B the median. For a symmetric density curve with one peak, the mean and median are at that peak, so in (b), mean and median are both A It s easiest to draw the bell-curve first and then put the numbers below it. The mean is at the peak, and the shoulders of the curve are one standard deviation above and below the mean (that is, where the curve stops curving downwards and starts curving outwards). My rough sketch is in Figure For (a), go up and down 2 standard deviations from the mean, that is from 266 (2)(16) = 234 to (2)(16) = 298. Because the normal distribution is symmetric, these values cut off half of 5%, that is, 2.5%, on each end. So the shortest 2.5% of pregnancies last 234 days or less and the longest 2.5% last 298 days or more For (a), go up and down 3 times the SD: between 336 (3)(3) = 327 and (3)(3) = 345 2

3 days. (The SD for horses is less than for humans, so we can make more precise statements about pregnancy lengths for horses as compared to humans.) (b) is tricky: the middle 68% are between 333 and 339 days (1 SD), so of the other 32%, half (16%) are below 333 days and half (16%) are above 339 days. Figure 1: Normal density curve for ex It s easiest to get the data into software first (on the disk, look for the data acidrain). Calling for the Descriptive Statistics gets you the mean, , and the SD, The normal probability plot was fairly straight, indicating that a normal distribution is a good fit to these data. So the rule should be fairly accurate. The 68% limits are = and = Then go to the data in the Minitab worksheet and count how many of the values are within these two limits. This is made easier by the fact that the data values are sorted in order. The values in rows 18 to 88 inclusive are between these values; there are = 71 of them, which is 71/ % = 67.6%. This is very close to 68%. For the others, go up and down 2 (and 3) times the SD from the mean, and count how many data values fall between those limits. The 95% limits are (2)(0.5379) =

4 and (2)(0.5379) = The values in rows 2 to 101 are between these limits; these are 100/105 = 95.2% of the total, again very close to 95%. The 99.7% limits are (3)(0.5379) = and (3)(0.5379) = All 105 values, 100%, fall between these limits. This is again very close to 99.7%. A reminder that the rules don t work exactly for actual data, but if the data are close to normal, the rules will be close to correct goes with z = ( )/15 = 2. Thus those more than 2 SDs above the mean would qualify, which should be the same proportion as those further than 2 SDs below the mean. So, 2.5%, or more accurately, Calculate z values to compare the results fairly, using the mean and SD for the test that was taken. Tonya s z is ( )/321 = 0.97, while Jermaine s is ( )/5.4 = Both scored above average, but Jermaine s score is higher in standardized units. Sometimes you will see these scores expressed as percentile ranks, which is the percentage of all people who would score less than the given score. You can get these by looking up the z s in Table A: Tonya is at the 83rd percentile (table gives ) while Jermaine is at the 92nd percentile (table gives ). It is perhaps clearer this way that Jermaine s performance is better. See and Jacob scored z = ( )/5.4 = 1.02, and Emily scored z = ( )/321 = Both scored below average, but Jacob did better relative to the mean on his test. (Jacob is at the 15th percentile and Emily is at the 6th, using the same kind of calculation as in ) Jose s score standardizes to z = ( )/321 = To find out the equivalent ACT score, x, figure out how you would standardize x and put that equal to 1.78: (x 21.5)/5.4 = Then solve for x to get 31.1, or 31 rounded off (since ACT scores are given as whole numbers). Or, if you prefer, do some trial and error to see what ACT score standardizes to 1.78: 31 is a little too low, 32 is too high by more, so 31 is best Same idea: Maria s z is z = ( )/5.4 = 1.57, and the standardized SAT score, say x, has to come to the same thing: (x 1509)/321 = 1.57, so x = , which would presumably be rounded off to Or (in the previous two exercises) you can use the rule for unstandardizing z-values given at 4

5 the top of page 68: x = µ + σz, where µ is the mean and σ the SD. Thus you d get x = (1.78) = 31.1 and x = (1.57) = This is the same Tonya as in 1.132, with z = Look in Table A to find that the proportion less than this is , so the percentile is 83 (rounded off) This is the Jacob of 1.133, with z = The proportion less than this is , so Jacob is at the 15th percentile (rounded off) First, ask yourself what z-value cuts off the top 10% of the standard normal distribution. This same value cuts off the bottom 90%, and thus (Table A) is about z = Then unstandardize according to the mean and SD of SAT scores: those SAT scores above (1.28) = 1920 make up the top 10% First find the quartiles of the standard normal distribution. These are z = 0.67 and z = 0.67 (look up 0.25 and 0.75 in the body of Table A). Then unstandardize them according to the mean and SD of ACT scores: Q1 = ( 0.67) = 17.8 and Q3 = (0.67) = (These don t necessarily need to be rounded off since a quartile of whole numbers doesn t itself have to be a whole number.) Same idea as the previous exercise, though it looks a bit more scary. Find the quintiles of the standard normal distribution by looking up 0.2, 0.4, 0.6 and 0.8 in the body of Table A: this gives z = 0.84, 0.25, 0.25, (Note the symmetry: 0.2 below is the mirror image of 0.8 below, which is 0.2 above.) Then unstandardize onto the SAT scale: ( 0.84) = 1239, ( 0.25) = 1429, (0.25) = 1589, (0.84) = (You can keep one decimal place here, by the same argument as ) Note that the quintiles are equally spaced on the proportion scale but not the score scale: 1239 and 1429 are farther apart than 1429 and 1589, for instance (a) Standardize 40 (using the correct mean and SD), and look it up in Table A. z = (40 55)/15.5 = 0.97, so a proportion of young women have cholesterol level lower than 40. (b) This implies 60 or higher : since HDL can take any value, it has no chance of being exactly 60. Turn 60 into a z-score first: z = (60 55)/15.5 = Table A gives proportion for this, which is the proportion less. So the proportion of women with HDL 60 or higher is = 5

6 or 37%. (c) Everyone between is neither low (part (a)) or high (part (b)), so the answer is = If you didn t recognize that you could use parts (a) and (b), you can churn through the whole thing: turn 60 into a z-score (0.32), turn 40 into a z-score (-0.97), look them both up in Table A ( and ) and subtract ( = ) Same kind of calculations as the previous exercise, but now using the different means and SDs. For 40, z = (40 46)/13.6 = 0.44, proportion less is For 60, z = (60 46)/13.6 = 1.03, proportion less is , proportion more is = Proportion between 40 and 60 is = Compared to the women of 1.142, the men have a lower HDL on average than the women (with a similar spread), so more of them have low HDL and fewer of them have high HDL (a) z = ( )/16 = 1.63, so , or about 5%. (b) For 270 days, z = ( )/16 = 0.25, so proportion are shorter than this. Proportion between is = (c) The z for 0.2 longer is that for 0.8 shorter, which is z = 0.84 (look for in the body of Table A). Unstandardize to get (0.84) = days As found in 1.140: z = 0.67 and z = For (b), unstandardize these to get Q1 = µ 0.67σ and Q3 = µ σ. (Never mind that these are formulas; the unstandardizing works the same way.) For (c), put in the values 266 for µ and 16 for σ to get Q1 = and Q3 = (a) is the difference of the two numbers found in 1.146: 0.67 ( 0.67) = The easiest way to do (b) is to look back at 1.146(b) and take the difference between Q3 and Q1 (using the formulas), which is 1.34σ. The µs cancel out, which makes sense: the IQR, as a measure of spread, depends only on the standard deviation, which is another measure of spread. This way, the answer has to be c = 1.34, no matter what µ and σ are such is the power of mathematics This one is clearly not normal, because the plot is not close to a straight line. If you think about how it fails to be a straight line: there are too many values with emissions close to 0 (the plot is too flat on the left compared to the middle of the plot), and too many values with high emissions (because the plot is too steep on the right compared to the middle). Thus the distribution is skewed to the right. (Or you can memorize which 6

7 shape of curve goes with which kind of skew, but you would do better to be able to work it out from scratch.) The three countries at the top right look like outliers: if you were to make a straight line through the middle of the data, the line would definitely go below these three points Bar graphs for population and for open space would show only that the cities vary in both population and in how much open space they have. To investigate further, type the data into a Minitab worksheet and create a fourth column (using Calculator) with rate of open space per thousand residents. To make a bar chart, select Graph and Bar Chart. Select Bars represent values from a table. Select the Simple option. Select OK. In the next dialog, select your rates into the Graph Variables box, and select City as your categorical variable. Select Bar Chart Options, and order the bars by Increasing Y (for (e)). My bar chart is shown in Figure 2. I prefer a chart with the heights of the bars ordered, because it makes comparison easier. Here, you see that Miami and Chicago have a noticeably small amount of open space per inhabitant, and Washington DC and Minneapolis have a noticeably large amount, with the other cities being very similar. New York, however, has a lot of its open space concentrated in Central Park, so you might imagine that the rest of the city doesn t have much. Figure 2: Bar chart of open space per resident by city The statement given is true on average, but a statement that women score higher than men hides the fact that there is a lot of variability; the majority of both men and women score between 500 and 650, but it s hard to be more precise. The men s scores have a lower mean and a larger SD than the women s, which means that the lowest scores are overwhemingly likely to be men. Surprisingly (because the men s scores have higher variability), the extremely high scores might be either men or women. (Notice that the density curves for scores 700 and higher are about the same height.) If you want to do some calculations: of 7

8 the women will score less than 450, compared to of the men; of the women will score more than 700, compared to of the men. (This is the usual thing: figure out z s and look them up in the table.) This justifies the statement that the lowest scores are mostly men and the highest could be either men or women. The scores in the middle are all mixed up Two things you can do here: compare the mean and the median, and see whether the median is closer to Q1 or Q3 (or about the same distance from both). Here, the mean is quite a bit bigger than the median, and the median is closer to Q1 and further away from Q3. These both suggest that the distribution is right-skewed; there are a few women with large weights compared to the others. (b) How many hours is quantitative, so a histogram or stemplot (or boxplot). For the second part of (b), you have to figure out how you re going to measure this. One way is to have set times during the semester, such as week 1, before midterm, after midterm, week 10, before final and get the number of study hours in those weeks for your students. Then you could do side-by-side boxplots to compare the study times (since time during the semester is categorical). (c) Favourite radio station is a categorical variable, so bar chart (or maybe pie chart if you want to be able to say that 33% of students prefer to listen to radio station XXXX, which is the highest percentage of any radio station ). (d) To assess whether a normal distribution applies to a collection of measurements, you need a normal probability plot, which you then assess for straightness (a) Make is categorical, so a bar chart (or maybe a pie chart, if you are thinking of out of all the students, what fraction drive cars of a certain make? ). For how old, a quantitative variable, a histogram is the thing (or a stemplot, or even a boxplot). 8

The Normal Distribution

Stat 6 Introduction to Business Statistics I Spring 009 Professor: Dr. Petrutza Caragea Section A Tuesdays and Thursdays 9:300:50 a.m. Chapter, Section.3 The Normal Distribution Density Curves So far we