Data Analysis and Statistical Methods Statistics 651 http://wwwstattamuedu/~suhasini/teachinghtml Suhasini Subba Rao Review of previous lecture The main idea in the previous lecture is that the sample average has a distribution, and when the sample size is large, the sample average (obtained from this sample) is close to being normally distributed We first note if in general an estimator is close to the population mean, then the variance of the estimator will be small To understand this look at the density plots (look again at variance explanation lecture4pdf in Lecture 4, and large variance small variancepdf) Remember that the population variance is defined as 1 N N (x i µ) 2 i=1 When this is small the outcomes are in general close to the mean µ When it is large then the outcomes are spread about 1 Reasoning that the sample average is random and has a distribution Suppose a population contains 10 individuals (a very small population) You can make a histogram of the height of these individuals Call this Histogram A You can also evaluate the population mean and the population variance The probability that a randomly selected person s height lies in a certain interval is determined by Histogram A Suppose you collect all samples of size two from this 10 individual population (that is all subsets of containg 2 people from these 10) There are 45 such subsets For each subset (sample) you calculate the average (this is the sample average) There will be 45 averages This 45 averages can also be treated as a population We can make a histogram of these averages Call this Histogram B The mean of this population is the same as the mean of the population with 10 people, but the variance of this population will be different Suppose you randomly select a subset of size two and take the average of these two individuals The average will be one of the 45 values The probability the sample average will lie in any given interval is determined by Historgam B You collect all samples (subsets) of size 3 out of 10 There are 120 such subsets For each subset you calculate the average (the sample average) These 120 averages can also be considered as a population You can make a histogram of these averages Call this Histogram C The average of any random sample of size 3 must belong to one of these 120 numbers The probability the average lies in any given interval is determined by Histogram C Reasoning that the variance of the sample mean gets smaller as the sample size gets larger Suppose you draw one individual from a population of people and 2 3
measure their height Their height could be close to the mean height but it could also be extremely small or extremely large Suppose you draw three individuals from a population of people and measure all of three heights One height may be an extreme value (extremely far from the mean), but it is highly unlikely that all three heights will be extreme (the probability of this happening will be very small) Therefore the average of the three heights will smooth out the extreme behaviour and is likely to be closer to the mean height than an individual height Taking this argument to the extreme Suppose that I have a population of 1000 people I take as my sample 999 people This sample is random The average of these 999 people will be very close to the average of the 1000 people (the height of the poor excluded individual hardly counts) Since sample average is likely to be close to the population mean, then the variance of the sample average will be small Therefore the variance of these averages involving a 999 individuals will be very, very small But note, that the variance is not small because the sample is large relative to the size of the population - it is small because of the sample is large, regardless of the size of the population I often get the comment If the population size is a billion and the sample size is 500, then the sample mean based is bad But if the population size is 100 and the sample size is 500, then the sample mean is good This is not correct The sampling is always done under the replacement (you can draw that individual again), the relative proportion of the sample size to population does not matter What matters is the sample size and the population variance The variance of the sample mean σ 2 /n only depends on these two factors, not on the population size (to statisticians the population is usually infinite!) In summary the average of three is likely to be closer to the mean than just one height on its own This means the variance of an average 4 5 involving three individuals will be smaller than the variance of just one individual The variance of the sample mean is σ 2 /n Suppose the variance of one person is σ 2 The variance of the average of three will be σ 2 /3 Notice that σ 2 is still there This is because the variance of the population will also ways effect the variance of the average But the sample size also has an effect In the above we have reasoned that the sample average does have a distribution and the variance of this distribution decreases as the sample size grows It can be shown that not only does the variance of the distribution decrease with sample size, but it becomes more bell shaped It becomes a normal distribution The larger the population variance σ 2, the larger the the variance of the average σ 2 /n The larger the sample size n, the smaller the population variance σ 2 /n The average is approximately normal with mean µ and variance σ 2 /n 6 7
A game We choose 5 people in the class This is our population (it is fixed) Suppose their ages are 22, 24, 23, 25, 27 The mean of this population is 242 and the variance is 296 Since these 5 students form our population, neither the mean or variance is random They are fixed and cannot be changed (unless we change the population) Sample Sample Average Sample Sample Average 1 22, 22 22 2 24, 24 24 3 23, 23 23 4 25, 25 25 5 27, 27 23 6 22, 24 23 7 24, 22 23 8 22, 25 235 9 25, 22 235 10 22,27 245 11 27, 22 245 12 24, 23 235 13 24, 23 235 14 24, 25 245 15 25, 24 245 16 24, 27 255 17 27, 24 255 18 23, 25 24 19 25, 23 24 20 25, 27 26 21 27, 25 26 22 23, 27 25 23 27, 23 25 24 22, 23 235 25 23, 22 235 Suppose we take a sample of size two The sample can be anyone of the following possibilities: Associated to each sample is the sample average We see that this 8 9 can also be considered as the population of all averages of size two It also has a mean which is 242 (same as the mean of the population 22, 24,23,25,27) and a variance which is 148 The population mean and variance of the sample average population are fixed, even though the sample average is random and can be anyone of the possibilities given in the table above Exercise: calculate the population mean and variance of this population yourself What do you notice? The sample average is random, it can be anyone of the 25 possibilities given above The mean of the population sample average is the same as the mean of the original population 22,24,23,25, 27 The variance of the population of the sample averages is 148, which is half the variance of the orginal population which is 296 In other words the variance of the sample average is σ 2 /2 = 296/2 (since σ 2 = 296 and n = 2) 10 11
How these results help us We have shown that the sample mean has a distribution which is close to normal with mean µ and variance σ 2 /n (σ 2 is the variance of the population - variance of one randomly chosen person) In the game above you see how the variance of the average is σ 2 /n when the sample size is n The sample average can always be used as an estimator of the mean We want to construct confidence intervals for the mean Inside the CI is where the true mean is most likely to be You recall from Lecture 11 these intervals are constructed under the assumption of normality Constructing confidence intervals for the mean Suppose X i has mean µ and variance σ 2 We know that the average X = 1 n n i=1 X i is close to normal and is approximately N(µ, σ 2 /n) we can construct a confidence interval for the mean We shall assume (for now) that the variance is σ 2 is known (is this reasonable?) If n is sufficiently large (the case n = 2 is not enough), the we can assume that the distribution of the average is approximately normal We can use this information to construct CIs By the CLT we have X N(µ, σ2 n ) Therefore ( P X 196 σ µ X + 196 σ ) = 095 n n 12 13 Given the sample mean X, the 95% CI is [ X 196 σ n, X + 196 σ n This means that for every 100 intervals constructed about 95 would contain the mean µ Example 1 A forester wishes to estimate the average number of trees per acre over a 2000-acre plantation She can use this information to determine the total timber volume in the plantation The standard deviation for the distribution of the number of trees in an acre in 121 A random sample of n = 50 1-acre plots are selected and examined It is found that average number of trees per acre (based on this sample) is 273 Use this information to construct a 95% CI for the mean number of trees per acre 14 15
Solution 1 The sample average X = 1 50 50 k=1 X i The variance for one acre is known to be 121 2, hence the variance for the sample mean is 121 2 /50 The sample size n = 50, which is large enough to assume normality Therefore the 95% CI is: [ 121 273 196,273 + 196 121 50 50 The length of interval is 2 196 121 50 The 99% CI is [ 273 256 121 50,273 + 256 121 50 The length of interval is 2 256 121 50 Question Construct the CI when the sample mean is again 273, but the sample size is now 150 A summary X 1,, X n is a random sample (say heights/weights of n randomly selected invididuals) The distribution of this (ie height or weight) has mean µ and variance σ 2 The original distribution does not have to be normal (ie the histogram of heights can differ from the normal curve) The sample average X is random too, and has a distribution But if n is large enough, regardless of the original distribution, the distribution of the heights will be close to normal, with mean µ and variance σ 2 /n However, the closer the original distribution is to normal (we may know this from previous experiments etc) the smaller n needs to be for the approximation to be good In other words, if the original distribution is far from normal we will need a large sample size (say at least n = 40) for the normality result to be true 16 17 Example 2 A social worker is interested in estimating the average time outside prison a first time offender spends outside prison before the re-offend A random sample of n = 150 first time offenders are considered Based on this data it is found that the average time they spend 32 years away from prison The sample standard deviation is 11 years Stating all assumptions construct a 99% CI for the true average µ Solution 2 The sample mean is X = 32 The sample standard deviation is 11 The sample size is large n = 150, hence we can assume normality of the sample mean X Moreover, since we have estimated the standard deviation s = 11 using 150 observations (relatively large sample), we can assume it is a good estimator of the true sample standard deviation σ Hence in our calculations we will use s = 11 in place of the true standard deviation σ Therefore the sample variance of the sample mean X is 11 2 /150 The 99% CI is [ 32 256 11 150, 32 + 256 11 150 The length is 2 256 11 150 18 19
Sample size and the confidence interval Example: X = 1038, σ 2 = We compare the 95% confidence intervals for n = 9 and n = 25 We see n=9 [ X 196 σ 9, X + 196 σ 9 n=25 [ X 196 σ 25, X + 196 σ 25 What are the lengths of the above intervals? For n=9 it is 2 196 σ 9 For n=25 it is 2 196 σ 25 Observe that the length does not depend on X, it s the same length regardless of the values of X n=9 [1038 196,1038 + 196 = [663,1413 9 9 n=25 [1038 196,1038 + 196 = [812,1263 25 25 We see that the second interval is smaller than the first interval When the sample size is large we have more information about the population The larger the sample size the smaller the length of the confidence intervals Because the estimator is in general better 20 21 The population variance σ 2 and the confidence interval We see that the variance σ 2 of one observation also has an effect on the length of the confidence interval Example: Suppose X = 1038, n = 9 σ 2 = [1038 196,1038 + 196 = [663,1413 9 9 σ 2 = 100 [1038 196 100 100, 1038 + 196 = [338,1691 9 9 The larger the variance of the random variable X i, the more variability in the sample mean, hence it is unlikely a small interval will capture the true mean Choosing the sample size for estimating µ How can one determine the number of observations to be included in a sample? To have a very large sample size would be nice, but often it can be too costly A sample size which is too small, can contain inadequate information We need to developed a compromise between desired accuracy and cost to obtain this accuracy How to choose the sample size n? This can be a complex issue that often depends on knowledge of the researcher 22 23
Tolerable error Often a researcher will choose the sample size n based on the length of the confidence interval This means she is able to accept the length of the CI interval having some preset length 2E (this is called the tolerable error) Hence the following interval should have the length 2E: Choosing n The length of the interval is ( X 3 + 196 ) ( X 3 3 196 ) = 2 196 n n n [ X 196 σ n, X + 196 σ n For example, if we know that σ 2 = 3 and 2E = 05 Then [ X 3 196, X 3 + 196 n n should have length 05 Hence we need to choose n such that this is satisfied This should have length 3, therefore we solve 05 = 2 196 3 n This gives the sample size: n = ( 2 196 3) 2 = 1844 05 Hence, the smallest value of n we can use such that the tolerable error (or, equivalently, the length of the confidence interval) is 05 is 185 24 25 In general: Length of a confidence interval Since at the 95% level [ X 196 σ n, X + 196 σ n The length of the confidence interval is X + 196 σ n }{{} confidence factor ( X 196 σ nconfidence factor ) = 2196 σ n }{{} 2 confidence factor = 2E Therefore we need to choose n such that 2E = 2 196 σ n that is n = (196)2 σ 2 E 2 The above was for 95% CI General CIs and tolerable error What is the length of a 99% CI, how should we choose n in this case? In general we if go for the (1 α) 100% CI, where α is pre-selected (so that we know z α/2 = 196,256 etc) Then we need to solve σ 2E = 2z α/2 n We need to pre-select the value of E The smaller E, the larger our sample size E will have to be The researcher must decide how much precision s/he requires 26 27
Some more practice on CIs sample X 1, X 2, X 3, X 4, X 5 X [ X 196 σ 5, X + 196 σ 5 The sample mean X is known as a point estimator The interval [ X 196 σ n, X + 196 σ n is known as a 95% confidence interval Like the sample mean the confidence interval is also random See Figure 54 in Ott and Longnecker (page 197) for a good illustration Below we construct 95% CIs for the mean using the average in each sample 1 4755 6174 14092 16166 10741 1038 [5350, 15421 2 16376 14995 11078 114 19817 1471 [9684, 19755 3 178 3094 6174 7015 16782 1008 [5045 15116 4 14212 73 0669 3700 9077 700 [1963 120 5 17845 6084 10556 0693 11194 927 [4239, 14310 6 7422 2743 2489 19446 18505 1012 [5086, 15156 7 6642 10475 15404 2279 17077 1037 [5340, 15411 8 15380 14440 13421 94 16322 1380 [8764, 18835 9 1000 10057 19754 19675 5031 1110 [6068, 16139 10 15019 10182 11708 11117 15131 1263 [7596, 17667 The population variance var(x i ) = σ 2 =, hence σ 5 = p /5 The true mean is 10, how many intervals cotain it? 28 29 An illustration For each sample average we plot the interval: Aside: Confidence intervals in the general case Up until now we have looked at 95% (99% or 90%) CIs But is easy to construct any 100(1 α)% CIs 15 10 5 If we want a 100(1 α)% confidence interval, find the z α, such that P(Z z α ) = α/2 (recall plot) For example if want a 95% interval, then P(Z 196) = 0025 Using the arguments in Lecture 11, this implies ( ) σ P X z α n µ X σ + z α n = 1 α The green line is where the population mean is In reality it is unknown If we did this plot for 100 different samples about 95 would intersect with the population mean and leads to the 100(1 α)% confidence interval [ X z α σ n, X + z α σ n 30 31