Elementary Statistics We are now ready to begin our exploration of how we make estimates of the population mean. Before we get started, I want to emphasize the importance of having collected a representative sample, i.e. one that is a simple random sample. Without that, our estimates are useless. The best estimate of the mean that is available to us is the mean of our sample. However, we do not expect to equal therefore, this single estimate, while a good start is somewhat useless because we do not know how far off we are from. What we need is a Lower Bound and an Upper Bound in which we could have some confidence that falls between these two limits. In our words, we would like to find some value E, such that we are 95% confident that given the average, of any sample, lies somewhere between Even this definition is a big vague because what do we mean by confident. To sharpen things up a bit, suppose you were to repeatedly take sample of the same size from the population. For each sample, you would get an average, for the th sample. Now, we do not expect any of the to equal each other, but we want a single value for such that 95% of the will have the following property, will lie in the following interval, So if 95% of the samples we can take have this property, we can be 95% confident that the sample we did take has this property, i.e. will lie in the interval, First note, that we are working with averages, and That means that the probability distribution we will be working with is the sampling distribution of the mean. The mean of this distribution is and the standard deviation is According to the Central Limit Theorem,. be working with these values. and and so we will Now picture the sampling distribution with at its center. All possible are in the sampling distribution somewhere, and so if we find a value E such that the interval, which is centered on captures 95% of the area under the curve, it will also capture 95% of all the possible Take a look at the chart below. It is a chart of the Standard Normal Curve, and hence its center is 0. 56
Elementary Statistics There s a lot going on here, so let s take things one step at a time. The area of 0.95 is centered under the curve. The critical value,, is the boundary between the 0.95 area and the red zone to the right of it. Since we are looking at the graph of a Standard Normal Distribution, that value of equals 1.96. is called the significance, and it simply equals 1.0 Confidence Level (expressed as a decimal). Hence, in this case, In other words, the area of each red zone is 0.025 and together they sum to 0.05. How did we find that for Look at the chart above. Notice that the total area to the left of is 0.95 + 0.025 ( the area of the red zone on the left). Hence the total area to the left of is 0.975 and NORM.S.INV(0.975) = 1.96. Question #1 Now, you try one. Find the value of for an 80% confidence interval. First find Then, divide it by two. Add that value, to 0.80 and use NORM.S.INV find such that the area (called Probability in the dialog) to the left of is Write your answer on the Answer Sheet. Now, recall the formula for translating from the real world to the z-world, i.e. the axis of the Standard Normal Distribution, First, recognizing that we are working the sampling distribution, we rework the formula to reflect this, If we then use the Central Limit Theorem, we have that and so we get, Question #2 Let s try one. Let and write your answer on the Answer Sheet. The is the translated value of Since 95% of the area under the Standard Normal Curve lies between -1.96 and +1.96, that means that 95% of the must lie within this range as well. In other words, there is a 95% chance that the average of any given sample will lie between -1.96 and +1.96. Now, we re getting somewhere because we have just objectively stated what we mean by 95% confidence. All that remains to do is to translate z = 1.96 back into the real world, and we ll have our upper and lower bound on Remember, our goal is to find an E such that, 57
Elementary Statistics We start with the fact that there s a 95% chance that given any sample we ll have, Using the formula above, we translate and get, A little math and we have, and then, Multiplying through by and writing it in standard form, we get We have found our E, and in general, for any critical value,,i.e. any confidence interval, we have, This E is called the margin of error. Question #3 Find for a 99% confidence interval, and then assuming that Unfortunately, we are not much better off than before we started, because the value of E that we derived depends on knowing and if we don t know the value of (that is after all what we are trying to estimate) then why would we know This problem wasn t solved until around the turn of the 20 th century, when William Gosset, working for the Guinness Brewery company worked out a probability distribution that could be used to perform quality control tests using small samples. He called it the Student t distribution. (Nobody knows why he called it that.) The value of using this distribution in place of the Standard Normal Distribution used in the derivation above, is that now we can use s, the sample standard deviation, which we do know, instead of This was really a big deal. 58
This Student t distribution is very similar in shape to the Standard Normal distribution, except that it is wider, i.e. it has a larger standard deviation. Furthermore, the size of the standard deviation, and hence the width of the shape, depends on the sample size. The smaller the sample the larger the standard deviation. Take at a look at the following figure that shows different shapes for the t-distribution as a function of sample size, as well as comparing it to the Standard Normal distribution. Here are a few more rules for working with the t-distribution. If you know for a fact, or you strongly suspect (because you carefully examined the histogram of your sample) that the underlying population is normally distributed, then the sample size is not that important other than its effect on the shape of the t-curve. However, if you suspect that the underlying population is not all that normally shaped, then your sample size should be a minimum of 30. Otherwise, your results will not be reliable. Take a look at the Excel dialog for T.INV.2T, The entry for Probability is the significance (another bad label). The dialog uses a value of 0.05 for because we are trying to find the 95% Confidence Interval. If we wanted to find the 99% Confidence 59
Interval we would use a value of 0.01 ( Notice how this differs significantly from NORM.S.INV. Instead of inputting for Probability, we just input Now, note that there s something brand new here called the. No, this has nothing to do with the Tea Party. The degrees of freedom is simply one less than the sample size, Deg_ freedom So if the sample size, n, is 20 then the degrees of freedom (df) is 19. We use the T.INV.2T function to find the critical value that we use in place of the value 1.96 in the formula for E above, and now we can use s instead of And the confidence interval is, Remarkably, the formula above uses just information from our sample, the size n, the mean and the standard deviation, s. Worked Example We receive a batch of 50,000 washers, and we wish to estimate the average inside diameter of the washers. We carefully select a simple random sample of size 20 and find that the average inside diameter is 24.78mm with a standard deviation of 1.62mm. We want to calculate a 95% confidence interval for our estimate of the batch mean. First we calculate using T.INV.2T, We see that and we proceed to calculate E, ( ( Finally, the confidence interval is, ( Below is the Excel spreadsheet that you can use to calculate these values. If you double click on the table, you will bring up a copy of Excel. Then if you select any of the cells, such as the value for t, you will see the Excel formula in the formula bar, toward the top. Also, try clicking on the value for E to see its Excel formula. 60
Finding a confidence interval for a mean, σ unkown. x s n α t E x-e x+e 24.78 1.62 20 0.05 2.093024 0.758183 24.02182 25.53818 The cells in blue can t be changed by you. That s so you can t accidently screw up the formulas. Also, I ve noticed that when you have been working with these embedded Excel sheets, things start getting screwed up. I believe that s a problem with the operating system. If you suspect that something weird is going on, just close the unit notes, and download a fresh copy. Question #4 One last note. Suppose that the manufacturer of the washers had claimed that the average inside diameter was 25.00mm. On the basis of this sample, could you refute the claim? Question #5 Now it s your turn to have some fun. Use the embedded Excel sheet above. Assume the population is normally distributed. For a sample size of 61, the average weight loss was 4.0 kg with a standard deviation of 6.4 kg. Find a 99% confidence interval for the mean of the population and enter your answer on the Answer Sheet. This is the end of Unit 14. Now turn to Unit 14 homework in your MyMathLab to get more practice with these concepts. 61
Answer Sheet 1. Name 2. 3. 4. 5. 62