Sampling Distributions and the Central Limit Theorem

Sampling Distributions and the Central Limit Theorem February 18

Data distributions and sampling distributions So far, we have discussed the distribution of data (i.e. of random variables in our sample, such as the smoking status or height) For the rest of the course, we will be more interested in the distribution of statistics For example: we go out, collect a sample of 10 people, measure their heights, and then take the average Is this average a random variable?

Sources of variability Yes, for several reasons: There is bound to be measurement error A person s height is not perfectly constant: people are (slightly) taller in the morning than in the evening And most importantly, the average height depends on the specific ten people that comprise our sample For all of these reasons, every time we repeat this procedure, we will get a different answer

Sampling distributions Therefore, our statistic is a random variable, and like any random variable, it will have a distribution To reflect the fact that the distribution depends on the random sample, the distribution of a statistic is called a sampling distribution In practice, no one obtains sampling distributions directly Investigators do not collect 10 samples of 10 individuals and report 10 different means; if they can afford to sample 100 people, they collect a single sample of 100 people and report a single mean

What s the point? Sampling distributions So why do we study sampling distributions? The reason we study sampling distributions is to understand how, in theory, our statistic would be distributed The ability to reproduce research is a cornerstone of the scientific method; sampling distributions provide a description of the likely and unlikely outcomes of these replications The variability of this distribution is the key to answering the question: How accurate is my generalization to the population likely to be?

Seeing sampling distributions As I said before, investigators do not collect multiple samples in order to look at their sampling distributions Since we cannot observe sampling distributions in practice, we need to do one of the following: Argue or prove that the statistic will have a certain distribution, like the normal or binomial distribution Conduct computer simulations in which we artificially create multiple samples We will spend a large portion of our time in the second half of this course doing each of the above

Kerrich s experiment The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem A South African mathematician named John Kerrich was visiting Copenhagen in 1940 when Germany invaded Denmark Kerrich spent the next five years in an interment camp To pass the time, he carried out a series of experiments in probability theory One of them involved flipping a coin 10,000 times

The law of averages The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem We all know that a coin lands heads with probability 50% Thus, after many tosses, the law of averages says that the number of heads should be about the same as the number of tails...... or does it?

Kerrich s results Sampling distributions The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem Number of Number of Heads - tosses (n) heads 0.5 Tosses 10 4-1 100 44-6 500 255 5 1,000 502 2 2,000 1,013 13 3,000 1,510 10 4,000 2,029 29 5,000 2,533 33 6,000 3,009 9 7,000 3,516 16 8,000 4,034 34 9,000 4,538 38 10,000 5,067 67

Kerrich s results plotted The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem Number of heads minus half the number of tosses 50 0 50 10 50 100 500 1000 5000 10000 Number of tosses Instead of getting closer, the numbers of heads and tails are getting farther apart

Repeating the experiment 50 times The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem This is no fluke: Number of heads minus half the number of tosses 100 50 0 50 100 10 50 100 500 1000 5000 10000 Number of tosses

Where s the law of averages? The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem So where s the law of averages? As it turns out, the law of averages does not say that as n increases the number of heads will be close to the number of tails What it says instead is that, as n increases, the average number of heads will get closer and closer to the population average of 0.5 The technical term for this is that the sample average, which is a statistic, converges to the population average, which is a parameter

The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem Repeating the experiment 50 times, Part II Percentage of heads 0 20 40 60 80 100 10 50 100 500 1000 5000 10000 Number of tosses

Trends in Kerrich s experiment The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem There are three very important trends going on in this experiment These trends can be observed visually from the computer simulations or proven via the binomial distribution We ll work with both approaches so that you can get a sense of they both work and reinforce each other Before we do so, I ll introduce two additional, important facts about the binomial distribution: its mean and standard deviation

The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem The mean and standard deviation of the binomial distribution The mean of a binomial distribution is np This makes intuitive sense: if the average number of times the event occurs on each trial is p, then it stands to reason that the average number of times it would occur with n trials is np When talking about the mean of a distribution, the term expected value is sometimes used, to avoid confusion with sample means The standard deviation of a binomial distribution is np(1 p) This also makes sense: when p is close to 0 or 1, there is less variability than when p is close to 0.5, and as we have seen, variability goes up with n

The expected value of the mean The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem Getting back to Kerrich s experiment, the number of heads that we will have after flipping a coin n times follows a binomial distribution with number of trials n and probability of the event occurring on any given trial 0.5 Thus, the expected number of heads after n flips is 0.5n Furthermore, the expected value of the average is 0.5n n = 0.5

The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem The expected value of the mean (cont d) Thus, for all n, the expected value of the average is equal to the population parameter p = 0.5 Putting it another way, we can always expect the distribution of the sample average to be centered around the population average Such statistics are called unbiased estimators

The standard error of the mean The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem To avoid confusion with the sample standard deviation, the standard deviation of the sampling distribution of a statistic is called its standard error In Kerrich s experiment, the standard deviation of the number of heads is np(1 p) = n(0.5)(0.5) = 0.5 n The standard error (the standard deviation of the average number of heads) is therefore np(1 p) = 0.5 n n n = 0.5 n

The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem Variability in the population, the sum, and the mean Note that, in the original population, the standard deviation was p(1 p) = 0.5 Denoting this standard deviation as SD, the variability (standard deviation or standard error) of the original population, the sum, and the mean are related as follows: Population Sum Mean SD SD n SD/ n So, for example, as we double n, it is true that the variability of the sum will increase However, the variability doesn t double; it only goes up by n Thus, when we divide by n to obtain the average, we still have an expression that goes to 0 as n increases

The square root law The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem The relationship between variability in the population and variability in the mean is a very important relationship, sometimes called the square root law: SE = SD n We will see variations on this theme many times in the second half of this course Once again, we see this phenomenon visually in our simulation results

The distribution of the mean The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem Finally, let s look at the distribution of the mean by creating histograms of the mean in our simulation 2 flips 7 flips 25 flips Frequency 0 10000 20000 30000 40000 50000 Frequency 0 5000 10000 15000 20000 25000 Frequency 0 5000 10000 15000 0.0 0.2 0.4 0.6 0.8 1.0 Mean 0.0 0.2 0.4 0.6 0.8 1.0 Mean 0.2 0.4 0.6 0.8 Mean

The central limit theorem The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem In summary, there are three very important phenomena going on here concerning the sampling distribution of the sample average: #1 The expected value is always equal to the population average #2 The standard error is always equal to the population standard deviation divided by the square root of n #3 As n gets larger, the sampling distribution looks more and more like the normal distribution Furthermore, it can be proven that these three properties of the sampling distribution of the sample average hold for any distribution in the underlying population

The central limit theorem (cont d) The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem This result is called the central limit theorem, and it is one of the most important, remarkable, and powerful results in all of statistics In the real world, we rarely know the distribution of our data But the central limit theorem says: we don t have to

The central limit theorem (cont d) The law of averages The expected value of the mean The standard error of the mean The distribution of the mean The central limit theorem Furthermore, as we have seen, knowing the mean and standard deviation of a distribution that is approximately normal allows us to calculate anything we wish to know with tremendous accuracy and the sampling distribution of the mean is always approximately normal The only caveats: Observations must be independently drawn from and representative of the population The central limit theorem applies to the sampling distribution of the mean not necessarily to the sampling distribution of other statistics How large does n have to be before the distribution becomes close enough in shape to the normal distribution?

Rules of thumb are frequently recommended that n = 20 or n = 30 is large enough to be sure that the central limit theorem is working There is some truth to such rules, but in reality, whether n is large enough for the central limit theorem to provide an accurate approximation to the true sampling distribution depends on how close to normal the population distribution is If the original distribution is close to normal, n = 2 might be enough If the underlying distribution is highly skewed or strange in some other way, n = 50 might not be enough

Example #1 Sampling distributions Population n=10 Density 0.00 0.05 0.10 0.15 0.20 Density 0.0 0.1 0.2 0.3 0.4 0.5 6 4 2 0 2 4 6 x 3 2 1 0 1 2 3 Sample means

Example #2 Sampling distributions For example, imagine an urn containing the numbers 1, 2, and 9: n=20 Density 0.0 0.2 0.4 0.6 0.8 2 3 4 5 6 7 Sample mean n=50 Density 0.0 0.2 0.4 0.6 0.8 2 3 4 5 6

Example #2 (cont d) n=100 Density 0.0 0.2 0.4 0.6 0.8 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Sample mean

Example #3 Sampling distributions Weight tends to be skewed to the right (far more people are overweight than underweight) Let s perform an experiment in which the NHANES sample of adult men is the population I am going to randomly draw twenty-person samples from this population (i.e. I am re-sampling the original sample)

Example #3 (cont d) n=20 Density 0.00 0.01 0.02 0.03 0.04 160 180 200 220 240 260 Sample mean

Sampling distribution of serum cholesterol According the National Center for Health Statistics, the distribution of serum cholesterol levels for 20- to 74-year-old males living in the United States has mean 211 mg/dl, and a standard deviation of 46 mg/dl We are planning to collect a sample of 25 individuals and measure their cholesterol levels What is the probability that our sample average will be above 230?

Procedure: Probabilities using the central limit theorem Calculating probabilities using the central limit theorem is quite similar to calculating them from the normal distribution, with one extra step: #1 Calculate the standard error: SE = SD/ n, where SD is the population standard deviation #2 Draw a picture of the normal approximation to the sampling distribution and shade in the appropriate probability #3 Convert to standard units: z = (x µ)/se, where µ is the population mean #4 Determine the area under the normal curve using a table or computer

Example #1: Solution We begin by calculating the standard error: SE = SD n = 46 25 = 9.2 Note that it is smaller than the standard deviation by a factor of n

Example #1: Solution After drawing a picture, we would determine how many standard errors away from the mean 230 is: 230 211 9.2 = 2.07 What is the probability that a normally distributed random variable is more than 2.07 standard deviations above the mean? 1-.981 = 1.9%

Comparison with population Note that this is a very different number than the percent of the population has a cholesterol level above 230 That number is 34.0% (230 is.41 standard deviations above the mean) The mean of a group is much less variable than individuals As Sherlock Holmes says in The Sign of the Four: While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician.

Procedure: Central limit theorem percentiles We can also use the central limit theorem to approximate percentiles of the sampling distribution: #1 Calculate the standard error: SE = SD/ n #2 Draw a picture of the normal curve and shade in the appropriate area under the curve #3 Determine the percentiles of the normal curve corresponding to the shaded region using a table or computer #4 Convert from standard units back to the original units: = µ + z(sd)

Percentiles Sampling distributions We can use that procedure to answer the question, 95% of our sample averages will fall between what two numbers? Note that the standard error is the same as it was before: 9.2 What two values of the normal distribution contain 95% of the data? The 2.5th percentile of the normal distribution is -1.96 Thus, a normally distributed random variable will lie within 1.96 standard deviations of its mean 95% of the time

Example #2: Solution Which numbers are 1.96 standard errors away from the expected value of the sampling distribution? 211 1.96(9.2) = 193.0 211 + 1.96(9.2) = 229.0 Therefore, 95% of our sample averages will fall between 193 mg/dl and 229 mg/dl

Example #3 Sampling distributions What if we had only collected samples of size 10? Now, the standard error is SE = 46 10 = 14.5 Now what is the probability of that our sample average will be above 230?

Example #3: Solution Now 230 is only 230 211 14.5 = 1.31 standard deviations away from the expected value The probability of being more than 1.31 standard deviations above the mean is 9.6% This is almost 5 times higher than the 1.9% we calculated earlier for the larger sample size

Example #4 Sampling distributions What about the values that would contain 95% of our sample averages? The values 1.96 standard errors away from the expected value are now 211 1.96(14.5) = 182.5 211 + 1.96(14.5) = 239.5 Note how much wider this interval is than the interval (193,229) for the larger sample size

Example #5 Sampling distributions What if we d increased the sample size to 50? Now the standard error is 6.5, and the values 211 1.96(6.5) = 198.2 211 + 1.96(6.5) = 223.8 contain 95% of the sample averages

Summary Sampling distributions n SE Interval Width of interval 10 14.5 (182.5,239.5) 57.0 25 9.2 (193.0,229.0) 36.0 50 6.5 (198.2,223.8) 25.6 The width of the interval is going down by what factor?

Example #6 Sampling distributions Finally, we ask a slightly harder question: How large would the sample size need to be in order to insure a 95% probability that the sample average will be within 5 mg/dl of the population mean? As we saw earlier, 95% of observations fall within 1.96 standard deviations of the mean Thus, we need to get the standard error to satisfy 1.96(SE) = 5 SE = 5 1.96

Example #6: Solution The standard error is equal to the standard deviation over the square root of n, so 5 1.96 = SD n n = SD 1.96 5 n = 325.1 In the real world, we of course cannot sample 325.1 people, so we would sample 326 to be safe

Example #7 Sampling distributions How large would the sample size need to be in order to insure a 90% probability that the sample average will be within 10 mg/dl of the population mean? There is a 90% probability that a normally distributed random variable will fall within 1.645 standard deviations of the mean Thus, we want 1.645(SE) = 10, so 10 1.645 = 46 n n = 57.3 Thus, we would sample 58 people