Module 3: Sampling Distributions and the CLT Statistics (OA3102)

Module 3: Sampling Distributions and the CLT Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chpt 7.1-7.3, 7.5 Revision: 1-12 1

Goals for this Module Statistics and their distributions Deriving a sampling distribution Analytically Using simulation Sampling Distributions Distribution of the sample mean Distributions related to the normal Central Limit Theorem Normal approximation to the binomial Revision: 1-12 2

Definition: Statistic A statistic is a function of observable random variables in a sample and known constants Revision: 1-12 3

Statistics and Their Distributions (1) Remember, we denote random variables with upper case Roman letters E.g., Y 1, Y 2, Y 3, Represent placeholders for the actual values once we observe them We use lower case Roman letters to denote the observed values: y 1, y 2, y 3, Thus: Y 1, Y 2, Y 3, are random quantities and thus are described by probability distributions y 1, y 2, y 3, are just numbers Revision: 1-12 4

Statistics and Their Distributions (2) Since Y 1, Y 2, Y 3, are random variables, so is any function of them Y 1 n Yi n i 1 E.g., is a random variable It s the mean of n random variables before we observe their values Thus, statistics of random variables are random variables themselves So, they have their own probability distribution It s called the sampling distribution Revision: 1-12 5

Definition: Sampling Distribution A sampling distribution is the probability distribution of a statistic Revision: 1-12 6

Illustrating Random Statistics Consider drawing samples from a Weibull distribution with a=2 and b=5 (so that m =E(X)=4.43, m =4.16, and s=2.32) Six samples of size n=10 drawn from a Weibull distribution Note that the sample means, medians, and standard deviations are all different randomness! Revision: 1-12 7 * Figure and table from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

Demonstrating Randomness This is a demonstration showing that statistics (i.e., functions of random variables) are random variables too. TO DEMO Applets created by Prof Gary McClelland, University of Colorado, Boulder You can access them at www.thomsonedu.com/statistics/book_content/0495110817_wackerly/applets/seeingstats/index.html Revision: 1-12 8

Simple Random Sampling (1) The sampling distribution of a statistic depends on the Population distribution Sample size Method of sampling For this class, we will always assume simple random sampling (SRS) Each X (or Y) in the sample comes from the same distribution and is independent of the other Xs Shorthand: They re independent and identically distributed (iid) Revision: 1-12 9

Simple Random Sampling (2) In this class, we will be thinking of iid random variables from a probability distribution It s an idealized model of the real world Implies that the population is infinite in size In real world, populations are often finite If sample with replacement, then SRS still holds If sample without replacement, but sample less than 5 percent of population, close-enough approximation Revision: 1-12 10

Example (Review) A balanced (i.e., fair ) die is tossed three times. Let Y 1, Y 2, and Y 3 be the outcomes, and denote the average of the three outcomes by ( Y-bar ) Y Find the mean and standard deviation of m That is, find and Y s Y Y Revision: 1-12 11

Example (Review) Revision: 1-12 12

Analytically Deriving a Sampling Distribution Consider the following problem The NEX automobile service center charges $40, $45, or $50 for a tune-up on 4, 6, and 8 cylinder cars, respectively The pmf of revenue for a random car, X, is So, m=46.5 and s 2 =15.25 What s the distribution of the average revenue from two tune-ups, (X 1 +X 2 )/2, assuming they are independent? Revision: 1-12 13

Analytically Deriving a Sampling Distribution, cont d Tabulating all outcomes, associated probabilities, and statistics gives Thus, we calculate: Revision: 1-12 14 * Table from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

Picturing the Sampling Distribution The two distributions look like this: Note that the: Distribution of X Means of the two distributions look to be the same Variability of the sampling distribution looks smaller This is not an accident Sampling Distribution of (X 1 +X 2 )/2 Revision: 1-12 15 * Figures from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

Another Sampling Distribution Consider the same service center, but now calculate the sampling distribution of the average revenue from four (independent) 4 tune-ups: 1 X Xi 4 The sampling distribution looks like this i1 Revision: 1-12 16 * Figure from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

Back to the Die Example We could do the same thing to derive the sampling distribution for the mean of three rolls of the die E.g., we know: The outcomes range from (roll three ones) to Y 6 (roll three sixes) Y 1 There are 6 3 =216 possible outcomes of the three rolls but not all translate into unique values Specific values sampling distribution can take on are 3/3, 4/3, 5/3, 6/3, 7/3,,17/3, 18/3 Revision: 1-12 17 Y

Example: Analytically Calculating the Sampling Distribution Calculate Pr Y 1 : Now calculate Pr Y 4 / 3 : Revision: 1-12 18

Example: Analytically Calculating the Sampling Distribution And now calculate Pr Y 5 / 3 : Etc Revision: 1-12 19

Using Simulation to Approximate the Sampling Distribution These calculations are tedious Use R to simulate for approximate results Revision: 1-12 20

Approximate sampling distribution Now, Fancier Previous plot shows frequencies using a histogram Let s do some more calculations and clean things up Check against exact answer Revision: 1-12 21

So, Here s a Nicer Plot Revision: 1-12 22

Simulation Experiments As we ve seen, can use simulation to empirically estimate sampling distributions Can be useful when analytics hard/impossible Need to specify: Statistic of interest Population distribution Sample size Number of replications Revision: 1-12 23

Example Statistic: sample mean Population distribution: N(8.25, 0.75 2 ) Sample size: (a) n=5, (b) n=10, (c) n=20, (d) n=30 Number of replications: 500 each Revision: 1-12 24 * Figures from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

Another Example Statistic: sample mean Population distribution: LN(3, 0.16) Sample size: (a) n=5, (b) n=10, (c) n=20, (d) n=30 Number of replications: 500 each Revision: 1-12 25 * Figures from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

Sampling Distributions Related to the Normal Distribution of the sample mean (when the population is normally distributed) Chi-squared (c 2 ) distribution Sums of squared normally distributed r.v.s t distribution Ratio of standard normal rv to function of a chisquared random variable F distribution Ratio of (a function of) chi-squared r.v.s Revision: 1-12 26

Why Should We Care??? Eventually we will be doing hypothesis tests and constructing confidence intervals Important statistics that we will want to test have these sampling distributions So, it seems pretty esoteric here, but all of these distributions will play important roles in practical, real-world problems Revision: 1-12 27

Remember Linear Combinations of Random Variables (see Theorem 5.12) Given a collection of n random variables Y 1, Y 2,, Y n and n numerical constants a 1, a 2,, a n, the random variable i1 is called a linear combination of the Y i s Note that we get the Total, X=T 0, if a 1 =a 2 = =a n =1 Sample mean, X= Y n X a Y a Y a Y a Y i i 1 1 2 2 n n if a 1 =a 2 = =a n =1/n But also note the Y i s are not necessarily iid Revision: 1-12 28

Some Useful Facts (1) Let Y 1, Y 2,, Y n have mean values m 1, m 2,, m n, respectively, and variances 2 2 2 s, respectively 1, s 2,, s n 1. Whether or not the Y i s are independent E a Y a Y a Y 1 1 2 2 n n a E Y a E Y a E Y 1 1 2 2 a m a m a m 1 1 2 2 n a m i i Revision: 1-12 i1 29 n n n n

Some Useful Facts (2) 2. If the Y 1, Y 2,, Y n are independent So, Var a Y a Y a Y 1 1 2 2 3. For any Y 1, Y 2,, Y n, n n a Var Y a Var Y a Var Y 2 2 2 1 1 2 2 n a s a s a s 2 2 2 2 2 2 1 1 2 2 n 3 s a s a s a s 2 2 2 2 2 2 a Y a Y a Y 1 1 2 2 n 3 1 1 2 2 n n Var a1y 1 a2y2 a Y a a Cov( Y, Y ) Revision: 1-12 30 n n n n i j i j i1 j1 n

Sampling Distribution of the Sample Mean (Population Normally Dist d) Theorem 7.1: Let Y 1, Y 2,, Y n be a random sample of size n from a normal distribution with mean m Y and standard deviation s Y n Then, 1 2 Y Y i N my, sy n n i1 In particular, note that The sample mean of normally distributed random variables is normally distributed m 2 2 my s s Y Y Also and Y This is true for any sample size n Revision: 1-12 31 n

Proof Revision: 1-12 32

Proof (continued) Revision: 1-12 33

Proof (continued) Revision: 1-12 34

Example 7.2 Amount dispensed (in ounces) by a beer bottling machine is normally distributed with s 2 =1.0. For a sample of size n=9, find the probability that the sample mean is within 0.3 ounces of the true mean m. Solution: Revision: 1-12 35

Example 7.2 (continued) Revision: 1-12 36

Table 4, Appendix 3 Revision: 1-12 37

Example 7.3 In Example 7.2, how big of a sample size do we need if we want the sample mean to be within 0.3 ounces of m with probability 0.95? Solution: Revision: 1-12 38

Example 7.3 (continued) Revision: 1-12 39

Sampling Distribution of the Sum of Squared Standard Normal R.V.s Theorem 7.2: Let Y 1, Y 2,, Y n be defined as in Theorem 7.1. Then Zi Yi my sy are iid standard normal r.v.s and 2 c 2 n n 2 Yi m Y 2 Zi c n i1 i1 sy where n denotes a chi-square distribution with n degrees of freedom Proof based on theorem from Chapter 6, so we ll skip it Revision: 1-12 40

The Chi-squared Distribution Chi-squared distribution has one parameter can take on values 1, 2, 3, Distribution is very skewed for lower values of n f(x,) is positive only for values of x>0 Graphs of three 2 density functions: Revision: 7-10 41 * Figure from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

Looking Up Chi-squared Quantiles Can look up in WM&S Table 6 Note that, because the distribution is not symmetric, must look up each tail separately Table gives probability in the right tail: Revision: 7-10 42 * Figures from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008.

WM&S Table 6 43

Example 7.4 Let Z 1, Z 2,, Z 6 be a random sample from the standard normal distribution. Find the number b such that n 2 Pr Zi b0.95 i1 Solution: Revision: 1-12 44

Sampling Distribution: Ratio of Sample Variance to Population Variance Theorem 7.3: Let Y 1, Y 2,, Y n be an iid sample from a normal distribution with mean m Y and standard deviation s Y. Then n s Y 2 c 2 1 S 1 n 2 2 Yi Y c n s Y i1 1 where n 1 denotes a chi-square distribution with n-1 degrees of freedom Y Also, and S 2 are independent random variables Revision: 1-12 45

Proof (for n=2) Revision: 1-12 46

Proof (continued) Revision: 1-12 47

Example 7.5 In Example 7.2, amount dispensed (in ounces) is normally distributed with s 2 =1.0. For a sample of size n=10, find b 1 and b 2 such that 2 Pr b 1 S b2 0.9. Solution: Revision: 1-12 48

Example 7.2 (continued) Revision: 1-12 49

Sampling Distribution: Sample Mean (Popul n Normally Dist d, s Unknown) Definition 7.2: Let Z be a standard normal r.v. and let W be a chi-square distributed r.v. Then if Z and W are independent where t is the t distribution with dfs In particular, note that T Z ~ t W Y m s / n Y m Z S n W n 2 2 n 1 S s n 1 1 Revision: 1-12 50 ~ t n1

f (x) Illustrating the t Distribution 0.40 normal 0.30 t (3 df) t (10 df) t (100 df) 0.20 0.10 0.00-4 -3-2 -1 0 1 2 3 4 Revision: 7-10 51 x

WM&S Table (Inside Front Cover) Revision: 7-10 52

Example 7.6 The tensile strength of a type of wire is normally distributed with unknown mean m and variance s 2 Six pieces are randomly selected from a large roll Tensile strength will be measured (Y 1,,Y 6 ) We usually use to estimate m and S 2 for s 2, so it s 2 reasonable to estimate s with S n So, find the probability that 2S 2 n Y will be within of the true population mean m 2 Y Y Revision: 1-12 53

Example 7.6 Solution Revision: 1-12 54

Sampling Distribution: Ratio of Chi- Squared RVs (and Their DFs) Definition 7.3: Let W 1 and W 2 be independent chi-square distributed r.v.s with 1 and 2 dfs, respectively. Then W1 1 F ~ F1, 2 W where F 1, 2 2 2 In particular, note that is the F distribution with 1 & 2 dfs 2 2 s n 1 S s n 1 Fn 1 n2 2 2 S1 s n1 1 S 1 1 1 n1 1 W1/ n1 1 2 2 2 2 ~ S2 s 2 W 2 2 2 2 2 / n2 1 1, 1 Revision: 1-12 55

The F Distribution The F distribution specified by the two degrees of freedom, and We will often be interested in right tail probabilities Notation: 1 2 F a, 1, 2 That s how WM&S Table 7 set up (next slide) For left tail probabilities, must use F 1/ 1 a,, a,, 1 2 2 1 Revision: 2-10 56 * Figure from Probability and Statistics for Engineering and the Sciences, 7 th ed., Duxbury Press, 2008. F

WM&S Table 7 Revision: 2-10 57

Example 7.7 If we take independent samples of size n 1 =6 and n 2 =10 from two normal populations with equal population variances, find the number b such that Solution: 2 S 1 Pr b 0.95 2 S2 Revision: 1-12 58

Exercise 7.7: Table 7 Excerpt Revision: 1-12 59

Finding Probabilities and Quantiles Using R R functions: Note: functions are based on cumulative probabilities (i.e., the left tails), not the right tails To do calculations like in the tables Use the lower.tail=true option (so p=a) Use the function as is, but remember p=1-a Revision: 1-12 60

Back to the Examples Example 7.2: Example 7.3: Example 7.4: Example 7.6: Example 7.7: Revision: 1-12 61

The Central Limit Theorem (CLT) The Central Limit Theorem says that, for sufficiently large n [1], sums of iid r.v.s are approximately normally distributed As n gets bigger, the approximation gets better More precisely, as n, the distribution of Y m Z s n converges to a standard normal distribution Where E(Y)=m and Var(Y)=s 2 [1] A generally conservative rule of thumb is n>30 Revision: 1-12 62

CLT (continued) So, let Y 1, Y 2,, Y n be a random sample from any distribution with mean m Y and standard deviation s Y Then if n is sufficiently large, Y has an approximate normal distribution with 2 2 m m and s s X X X X n Similarly, if n is sufficiently large, then Y Y Z Y m s n has an approximate standard normal distribution with and Revision: 1-12 63

5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 Frequency 1 3 5 7 9 11 13 15 17 19 21 23 25 Frequency Frequency Frequency Example: Sums of Dice Rolls Roll of a Single Die 120 100 80 60 40 Sum of 1 roll 20 0 1 2 3 4 7005 6 Sum of Tw o Dice Outcome 600 500 400 300 Sum of 2 rolls 200 100 0 2 3 4 5 6 7 8 970 10 11 12 Sum 60 50 40 30 Sum of 5 Dice Sum of 5 rolls 20 10 0 Sum 350 300 250 200 150 100 Sum of 10 Dice Sum of 10 rolls 50 Revision: 1-12 Sum 64 0

Demonstrating the CLT This is a simulation demonstrating the Central Limit Theorem. TO DEMO Applets created by Prof Gary McClelland, University of Colorado, Boulder You can access them at www.thomsonedu.com/statistics/book_content/0495110817_wackerly/applets/seeingstats/index.html Revision: 1-12 65

Illustrating the CLT in R > m<-matrix(data=runif(10000*100),nrow=10000,ncol=100) > avg1col <- m[,1] > avg2col <- apply(m[,1:2],1,mean) > avg3col <- apply(m[,1:3],1,mean) > avg4col <- apply(m[,1:4],1,mean) > avg5col <- apply(m[,1:5],1,mean) > avg10col <- apply(m[,1:10],1,mean) > avg20col <- apply(m[,1:20],1,mean) > avg50col <- apply(m[,1:50],1,mean) > avg100col <- apply(m[,1:100],1,mean) > par(mfrow=c(3,3)) > hist(avg1col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/12)),lwd=2,col="red",add=true) > hist(avg2col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*2))),lwd=2,col="red",add=true) > hist(avg3col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*3))),lwd=2,col="red",add=true) > hist(avg4col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*4))),lwd=2,col="red",add=true) > hist(avg5col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*5))),lwd=2,col="red",add=true) > hist(avg10col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*10))),lwd=2,col="red",add=true) > hist(avg20col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*20))),lwd=2,col="red",add=true) > hist(avg50col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*50))),lwd=2,col="red",add=true) > hist(avg100col,prob=true,xlim=c(0,1)) > curve(dnorm(x,.5,sqrt(1/(12*100))),lwd=2,col="red",add=true) Revision: 1-12 66

The CLT More Formally Theorem 7.4: Let Y 1, Y 2,, Y n be iid r.v.s with 2 mean EY m and variance Y s Define i U n Y Y my s / n Y Then as n the distribution function of U n converges to the standard normal distribution: u Revision: 1-12 67 n i1 Y s i Y nm 1 t 2 /2 lim Pr Un u e dt for all u n 2 Var i Y n Y

Example 7.8 For the whole population, achievement scores on a certain test have mean m Y = 60 and s Y = 8. For a random sample of n=100 scores from students at one school, the average score is 58. Is there evidence to suggest this school is inferior? That is, what s the probability of seeing an average score of 58 if the true school average matches the population? Revision: 1-12 68

Example 7.8 Solution: Revision: 1-12 69

Example 7.9 The service times for customers coming through a Navy Exchange checkout counter are iid with m Y = 1.5 and s Y = 1.0. Approximate the probability that n=100 customers can be served in less than 2 hours. Solution: Revision: 1-12 70

Example 7.9 Revision: 1-12 71

Checking the Solutions in R Example 7.8 Example 7.9 Revision: 1-12 72

Normal Approximation to the Binomial A r.v. Y~Bin(n,p) is the number of successes out of n independent trials with probability of success p for each trial Define indicator variables X 1, X 2,, X n as 1 if the ith trial is a success X i 0 if the ith trial is a failure So, X 1, X 2,, X n are iid Bernoulli r.v.s and we have Y = X 1, X 2,, X n That is, Y is a sum of iid random variables, so for large enough n the CLT applies Revision: 1-12 73

Exercise 7.10 Candidate A believes she can win an election if she can get 55% of the votes in precinct 1. Assuming 50% of the precinct 1 voters favor her and n=100 random voters show up, what is the (approximate) probability she will receive at least 55% of their votes? Solution: Revision: 1-12 74

Exercise 7.10 Revision: 1-12 75

When to Use the Approximation? Y and Y/n have an approximate normal distribution for large enough n, but large enough n depends on p Rule of thumb: Approximation works well when p 3 pq / n lies in the interval (0,1) An equivalent criterion is 9 max( pq, ) n min( pq, ) See extra credit Exercise 7.70 Revision: 1-12 76

Exercise 7.11 Suppose Y has a binomial distribution with n=25 and p=0.4. Find the exact probabilities that Y 8 and Y=8 and compare these with the corresponding values from the normal approximation. Exact solutions: Table 1 in Appendix 3 gives Pr Y 8 0.274 and Pr Y 8 Pr Y 8 Pr Y 7 0.274 0.154 0.120 In R: Revision: 1-12 77

Exercise 7.11 Solution: Revision: 1-12 78

The Continuity Correction The issue is we are approximating a discrete distribution with a continuous one So, to improve the approximation, rather than use the value for the continuous distribution halfway between the two discrete values In other words Add 0.5 to the value we re approximating for Pr Y y calculations Subtract 0.5 from the value we re approximating for Pr Y y calculations Revision: 1-12 79

Exercise 7.11 Solution with continuity correction: Revision: 1-12 80

What We Covered in this Module Statistics and their distributions Deriving a sampling distribution Analytically Using simulation Sampling Distributions Distribution of the sample mean Distributions related to the normal Central Limit Theorem Normal approximation to the binomial Revision: 1-12 81

Homework WM&S chapter 7 Required exercises: 1, 2, 9, 25, 31a-c, 48, 49, 72, 73 Extra credit: 15a&b, 70 Useful hints: Problem 7.1: Get to the applet more directly at www.thomsonedu.com/statistics/book_content/0495110817_ wackerly/applets/seeingstats/index.html. Click on 7. Distributions to the Normal > DiceSample Problem 7.25 part b: Use R, not the applet. The relevant R function is qt(p,df,lower.tail=false) Problem 7.31: Solutions in the back of the book are wrong. Revision: 1-12 82