Lecture 8 - Sampling Distributions and the CLT

Lecture 8 - Sampling Distributions and the CLT Statistics 102 Kenneth K. Lopiano September 18, 2013

1 Basics Improvements 2 Variability of Estimates Activity Sampling distributions - via simulation Sampling distributions - via CLT Statistics 102

Basics Histograms of number of successes Hollow histograms of samples from the binomial model where p = 0.10 and n = 10, 30, 100, and 300. What happens as n increases? 0 2 4 6 n = 10 0 2 4 6 8 10 n = 30 0 5 10 15 20 n = 100 10 20 30 40 50 n = 300 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 2 / 26

Basics Histograms of number of successes Hollow histograms of samples from the binomial model where p = 0.10 and n = 10, 30, 100, and 300. What happens as n increases? 2 1 0 1 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 n = 10 2 1 0 1 2 0 2 4 6 8 n = 30 2 1 0 1 2 5 10 15 n = 100 2 1 0 1 2 20 25 30 35 40 n = 300 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 2 / 26

Basics How large is large enough? The sample size is considered large enough if the expected number of successes and failures are both at least 10. np 15 and n(1 p) 15 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 3 / 26

Basics An analysis of Facebook users A recent study found that Facebook users get more than they give. For example: 40% of Facebook users in our sample made a friend request, but 63% received at least one request Users in our sample pressed the like button next to friends content an average of 14 times, but had their content liked an average of 20 times Users sent 9 personal messages, but received 12 12% of users tagged a friend in a photo, but 35% were themselves tagged in a photo Any guesses for how this pattern can be explained? http:// www.pewinternet.org/ Reports/ 2012/ Facebook-users/ Summary.aspx Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 4 / 26

Basics An analysis of Facebook users A recent study found that Facebook users get more than they give. For example: 40% of Facebook users in our sample made a friend request, but 63% received at least one request Users in our sample pressed the like button next to friends content an average of 14 times, but had their content liked an average of 20 times Users sent 9 personal messages, but received 12 12% of users tagged a friend in a photo, but 35% were themselves tagged in a photo Any guesses for how this pattern can be explained? Power users - add much more content than the typical user http:// www.pewinternet.org/ Reports/ 2012/ Facebook-users/ Summary.aspx Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 4 / 26

Basics Facebook cont. This study found that approximately 25% of Facebook users are considered power users. The same study found that the average Facebook user has 245 friends. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? We are given that n = 245, p = 0.25, and we are asked for the probability P(X 70). Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 5 / 26

Basics Facebook cont. This study found that approximately 25% of Facebook users are considered power users. The same study found that the average Facebook user has 245 friends. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? We are given that n = 245, p = 0.25, and we are asked for the probability P(X 70). P(X 70) = P(X = 70 or X = 71 or X = 72 or or X = 245) = P(X = 70) + P(X = 71) + P(X = 72) + + P(X = 245) Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 5 / 26

Basics Facebook cont. This study found that approximately 25% of Facebook users are considered power users. The same study found that the average Facebook user has 245 friends. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? We are given that n = 245, p = 0.25, and we are asked for the probability P(X 70). P(X 70) = P(X = 70 or X = 71 or X = 72 or or X = 245) = P(X = 70) + P(X = 71) + P(X = 72) + + P(X = 245) This seems like an awful lot of work... Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 5 / 26

Basics Normal approximation to the binomial When the sample size is large enough, the binomial distribution with parameters n and p can be approximated by the normal model with parameters µ = np and σ = np(1 p). In the case of the Facebook power users, n = 245 and p = 0.25. µ = 245 0.25 = 61.25 σ = 245 0.25 0.75 = 6.78 Bin(n = 245, p = 0.25) N(µ = 61.25, σ = 6.78). 0.06 0.05 Bin(245,0.25) N(61.5,6.78) 0.04 0.03 0.02 0.01 0.00 20 40 60 80 100 Statistics 102 (Kenneth K. Lopiano) Lec 8 k September 18, 2013 6 / 26

Basics Facebook cont. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 7 / 26

Basics Facebook cont. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? P(X 70) 61.25 70 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 7 / 26

Basics Facebook cont. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? P(X 70) Z = obs mean SD = 70 61.25 6.78 = 1.29 Second decimal place of Z Z 0.05 0.06 0.07 0.08 0.09 1.0 0.8531 0.8554 0.8577 0.8599 0.8621 1.1 0.8749 0.8770 0.8790 0.8810 0.8830 1.2 0.8944 0.8962 0.8980 0.8997 0.9015 61.25 70 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 7 / 26

Basics Facebook cont. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? P(X 70) obs mean 70 61.25 Z = = = 1.29 SD 6.78 P(Z 1.29) = 1 0.9015 = 0.0985 Second decimal place of Z Z 0.05 0.06 0.07 0.08 0.09 1.0 0.8531 0.8554 0.8577 0.8599 0.8621 1.1 0.8749 0.8770 0.8790 0.8810 0.8830 1.2 0.8944 0.8962 0.8980 0.8997 0.9015 61.25 70 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 7 / 26

Basics Facebook cont. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? P(X 70) obs mean 70 61.25 Z = = = 1.29 SD 6.78 P(Z 1.29) = 1 0.9015 = 0.0985 61.25 70 P(X 70) = 0.1128 Second decimal place of Z Z 0.05 0.06 0.07 0.08 0.09 1.0 0.8531 0.8554 0.8577 0.8599 0.8621 1.1 0.8749 0.8770 0.8790 0.8810 0.8830 1.2 0.8944 0.8962 0.8980 0.8997 0.9015 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 7 / 26

Improvements Improving the approximation Take for example a Binomial distribution where n = 20 and p = 0.5, we should be able to approximate the distribution of X using N(10, 5). 2 3 4 5 6 7 8 9 10 12 14 16 18 It is clear that our approximation is missing about 1/2 of P(X = 7) and P(X = 13), as n this error is very small. In this case P(X = 7) = P(X = 13) = 0.073 so our approximation is off by 7%. Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 8 / 26

Improvements Improving the approximation, cont. Binomial probability: 13 ( ) 20 P(7 X 13) = 0.5 k (1 0.5) 20 k = 0.88468 k k=7 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 9 / 26

Improvements Improving the approximation, cont. Binomial probability: 13 ( ) 20 P(7 X 13) = 0.5 k (1 0.5) 20 k = 0.88468 k Naive approximation: P(7 X 13) P k=7 ( ) ( 13 10 Z P Z 7 10 ) = 0.82029 5 5 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 9 / 26

Improvements Improving the approximation, cont. Binomial probability: 13 ( ) 20 P(7 X 13) = 0.5 k (1 0.5) 20 k = 0.88468 k Naive approximation: P(7 X 13) P k=7 ( ) ( 13 10 Z P Z 7 10 ) = 0.82029 5 5 Continuity corrected approximation: P(7 X 13) P ( ) 13 + 1/2 10 Z P 5 ( ) 7 1/2 10 Z = 0.88248 5 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 9 / 26

Improvements Improving the approximation, cont. This correction also lets us do, moderately useless things like calculate the probability for a particular value of k. Such as, what is the chance of 50 Heads in 100 tosses of slightly unfair coin (p = 0.55)? Binomial probability: P(X = 50) = Naive approximation: P(X = 50) P ( ) 100 0.55 50 (1 0.55) 50 = 0.04815 50 ( Z Continuity corrected approximation: P(X = 50) P ) ( 50 55 P Z 4.97 ( ) 50 + 1/2 55 Z P 4.97 ) 50 55 = 0 4.97 ( ) 50 1/2 55 Z = 0.04839 4.97 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 10 / 26

Improvements Example - Rolling lots of dice Roll a fair die 500 times, what s the probability of rolling at least 100 ones? Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 11 / 26

Improvements Example - Rolling lots of dice Roll a fair die 500 times, what s the probability of rolling at least 100 ones? P(X 100) = 500 k=100 ( 500 k=0 k ) (1/6) k (5/6) 500 k 99 ( ) 500 = 1 (1/6) k (5/6) 500 k k = 1 pbinom(99, 500, 1/6) = 1 0.9717129 = 0.0282871 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 11 / 26

Improvements Example - Rolling lots of dice Roll a fair die 500 times, what s the probability of rolling at least 100 ones? Since n is large, X is approximately normal with mean µ = np = 500/6 = 83.33 and SD σ = npq = 2500/36 = 8.333 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 11 / 26

Improvements Example - Rolling lots of dice Roll a fair die 500 times, what s the probability of rolling at least 100 ones? Since n is large, X is approximately normal with mean µ = np = 500/6 = 83.33 and SD σ = npq = 2500/36 = 8.333 ( P(X 100) P Z = P ( Z ) 100 1/2 µ σ = 1 P(Z 1.94) = 1 0.09738 = 0.0262 100 1/2 83.33 8.333 ) Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 11 / 26

Variability of Estimates 1 Basics Improvements 2 Variability of Estimates Activity Sampling distributions - via simulation Sampling distributions - via CLT Statistics 102

Variability of Estimates Parameter estimation We are often interested in population parameters. Since complete populations are difficult (or impossible) to collect data on, we use sample statistics as point estimates for the unknown population parameters of interest. Sample statistics vary from sample to sample. Quantifying how sample statistics vary provides a way to estimate the margin of error associated with our point estimate. But before we get to quantifying the variability among samples, let s try to understand how and why point estimates vary from sample to sample. Suppose we randomly sample 1,000 adults from each state in the US. Would you expect the sample means to be the same, somewhat different, or very different? Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 12 / 26

Variability of Estimates Activity Estimate the avg. # of drinks it takes to get drunk We would like to estimate the average (self reported) number of drinks it takes a person get drunk, we assume that we have the population data: Number of drinks to get drunk 0 5 10 15 20 25 0 2 4 6 8 10 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 13 / 26

Variability of Estimates Activity Estimate the avg. # of drinks it takes to get drunk (cont.) Sample, with replacement, ten respondents and record the number of drinks it takes them to get drunk. Use RStudio to generate 10 random numbers between 1 and 146 sample(1:146, size = 10, replace = FALSE) If you don t have a computer, ask a neighbor to generate a sample for you. Find the sample mean, round it to 1 decimal place, and record it. Time permitting, obtain another sample. If we randomly select observations from this data set, which values are most likely to be selected, which are least likely? Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 14 / 26

Variability of Estimates Activity Estimate the avg. # of drinks it takes to get drunk (cont.) sample(1:146, size = 10, replace = FALSE) ## [1] 59 121 88 46 58 72 82 81 5 10 (8 + 6 + 10 + 4 + 5 + 3 + 5 + 6 + 6 + 6)/10 = 5.9 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 15 / 26

Variability of Estimates Activity 1 7 21 6 41 6 61 10 81 6 101 4 121 6 141 4 2 5 22 2 42 10 62 7 82 5 102 7 122 5 142 6 3 4 23 6 43 3 63 4 83 6 103 6 123 3 143 6 4 4 24 7 44 6 64 5 84 8 104 8 124 2 144 4 5 6 25 3 45 10 65 6 85 4 105 3 125 2 145 5 6 2 26 6 46 4 66 6 86 10 106 6 126 5 146 5 7 3 27 5 47 3 67 6 87 5 107 2 127 10 8 5 28 8 48 3 68 7 88 10 108 5 128 4 9 5 29 0 49 6 69 7 89 8 109 1 129 1 10 6 30 8 50 8 70 5 90 5 110 5 130 4 11 1 31 5 51 8 71 10 91 4 111 5 131 10 12 10 32 9 52 8 72 3 92 0.5 112 4 132 8 13 4 33 7 53 2 73 5.5 93 3 113 4 133 10 14 4 34 5 54 4 74 7 94 3 114 9 134 6 15 6 35 5 55 8 75 10 95 5 115 4 135 6 16 3 36 7 56 3 76 6 96 6 116 3 136 6 17 10 37 4 57 5 77 6 97 4 117 3 137 7 18 8 38 0 58 5 78 5 98 4 118 4 138 3 19 5 39 4 59 8 79 4 99 2 119 4 139 10 20 10 40 3 60 4 80 5 100 5 120 8 140 4 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 16 / 26

Variability of Estimates Activity Sampling distribution What we just constructed is called a sampling distribution. What is the shape and center of this distribution. Based on this distribution what do you think is the true population average? Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 17 / 26

Variability of Estimates Activity Sampling distribution What we just constructed is called a sampling distribution. What is the shape and center of this distribution. Based on this distribution what do you think is the true population average? µ = 5.39 σ = 2.37 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 17 / 26

Variability of Estimates Sampling distributions - via simulation Average number of Duke games attended Next let s look at the population data for the number of Duke basketball games attended: Frequency 0 50 100 150 0 10 20 30 40 50 60 70 Statistics 102 (Kenneth K. Lopiano) number Lec of Duke 8 games attended September 18, 2013 18 / 26

Variability of Estimates Sampling distributions - via simulation Average number of Duke games attended (cont.) Frequency Sampling distribution, n = 10: 0 500 1000 1500 2000 What does each observation in this distribution represent? Is the variability of the sampling distribution smaller or larger than the variability of the population distribution? Why? 0 5 10 15 20 sample means from samples of n = 10 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 19 / 26

Variability of Estimates Sampling distributions - via simulation Average number of Duke games attended (cont.) Frequency Sampling distribution, n = 10: 0 500 1000 1500 2000 What does each observation in this distribution represent? Sample mean, x, of samples of size n = 10. Is the variability of the sampling distribution smaller or larger than the variability of the population distribution? Why? 0 5 10 15 20 sample means from samples of n = 10 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 19 / 26

Variability of Estimates Sampling distributions - via simulation Average number of Duke games attended (cont.) Frequency Sampling distribution, n = 10: 0 500 1000 1500 2000 What does each observation in this distribution represent? Sample mean, x, of samples of size n = 10. Is the variability of the sampling distribution smaller or larger than the variability of the population distribution? Why? Smaller, sample means will vary less than individual observations. 0 5 10 15 20 sample means from samples of n = 10 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 19 / 26

Variability of Estimates Sampling distributions - via simulation Average number of Duke games attended (cont.) Sampling distribution, n = 30: Frequency 0 200 400 600 800 How did the shape, center, and spread of the sampling distribution change going from n = 10 to n = 30? 2 4 6 8 10 sample means from samples of n = 30 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 20 / 26

Variability of Estimates Sampling distributions - via simulation Average number of Duke games attended (cont.) Sampling distribution, n = 30: Frequency 0 200 400 600 800 How did the shape, center, and spread of the sampling distribution change going from n = 10 to n = 30? Shape is more symmetric, center is about the same, spread is smaller. 2 4 6 8 10 sample means from samples of n = 30 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 20 / 26

Variability of Estimates Sampling distributions - via simulation Average number of Duke games attended (cont.) Sampling distribution, n = 70: Frequency 0 200 400 600 800 1000 1200 3 4 5 6 7 8 9 sample means from samples of n = 70 Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 21 / 26

Variability of Estimates Sampling distributions - via CLT Central Limit Theorem Central limit theorem The distribution of the sample mean is well approximated by a normal model: x N (mean = µ, SE = n σ ) If σ is unknown, use s. So it wasn t a coincidence that the sampling distributions we saw earlier were symmetric. We won t go into the proving why SE = σ n, but note that as n increases SE decreases. As the sample size increases we would expect samples to yield more consistent sample means, hence the variability among the sample means would be lower. Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 22 / 26

Variability of Estimates Sampling distributions - via CLT CLT - Conditions Certain conditions must be met for the CLT to apply: 1 Independence: Sampled observations must be independent. This is difficult to verify, but is more likely if random sampling/assignment is used, and n < 10% of the population. 2 Sample size/skew: the population distribution must be nearly normal or n > 30 and the population distribution is not extremely skewed. This is also difficult to verify for the population, but we can check it using the sample data, and assume that the sample mirrors the population. Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 23 / 26

Variability of Estimates Sampling distributions - via CLT CLT - sample size/skew condition - simulations (1) http://onlinestatbook.com/stat sim/sampling dist/index.html Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 24 / 26

Variability of Estimates Sampling distributions - via CLT CLT - sample size/skew condition - simulations (2) http://onlinestatbook.com/stat sim/sampling dist/index.html Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 25 / 26

Variability of Estimates Sampling distributions - via CLT CLT - sample size/skew condition - simulations (3) http://onlinestatbook.com/stat sim/sampling dist/index.html Statistics 102 (Kenneth K. Lopiano) Lec 8 September 18, 2013 26 / 26