Data Analysis and Statistical Methods Statistics 651

Size: px

Start display at page:

Download "Data Analysis and Statistical Methods Statistics 651"

Barnard Watts
5 years ago
Views:

1 Review of previous lecture: Why confidence intervals? Data Analysis and Statistical Methods Statistics Suhasini Subba Rao Suppose you want to know the mean height of people at A&M. You take as the sample all the people in this class and evaluate the average height of the class (this is the sample mean). The sample mean based on this class is 5.5 feet. What does 5.5 feet tell you about the mean height in the university? It is highly unlikely that the population mean will be exactly 5.5 feet. A more informative piece of information, than just the average, is to give an interval which one can say with a certain degree of confidence the mean should lie in. This is known as a confidence interval (CI). The size of the interval gives us information about how accurate the estimator 5.5 may be. 1 However without any assumptions on the distribution of the estimator, we are unable to construct an interval. We show today that it is often possible to assume normality. Example I Let us suppose that we have an estimator which we call X of the mean. This estimator is a random variable, which, for now, we assume is normally distributed with N(µ, σ 2 ). Suppose X = 3, this does not really tell us much about the location of the true mean µ. But (i) [3 1.96σ,3+1.96σ] tells us with 95% confidence the unknown mean µ lies in this interval. (ii) [3 1.64σ, σ] tells us with 90% confidence the mean µ lies in this interval. (iii) [3 2.56σ, σ] tells us with 99% confidence the mean µ lies in this interval. The smaller the confidence level (95% is smaller than 99%), the smaller the interval. Conversely the larger the confidence we require the larger the interval needs to be. Hence as illustrated in Example I there is a trade off between pinpointing the location of the mean and how much confidence we want in the interval. If we want to pin-point the mean, the interval should be smaller but then the confidence we have in that interval will be less. If we want more confidence that the mean lies in that interval, then the interval should be larger. But a larger interval is not very informative about the location of the mean. An extreme example is an interval which goes from minus infinity to plus infinity. The mean is definitely inside this interval (100% confidence), but it s not very informative about the location of the true mean! Example II Consider the example above. (i) In the case that the standard deviation is σ = 1, then the 95% CI is 2 3

2 [3 1.96, ]. (i) In the case that the standard deviation is σ = 100, then the 95% CI is [3 196, ]. The larger the variance the wider the CI. When the variance is large we need a larger interval to ensure that it includes the unknown mean µ. Note that the interval [X 1.96 σ, X σ], tells us that for every 100 draws of the random variable X, the mean µ should lie in this interval approximately 95 times. Elections on planet Frog In the local newspaper I found this; A poll of likely voters put candidate Smith in second place with 21%... Below the article I found this; The telephone survey of 500 likely voters was conducted by Ramissen Reports. The margin of error of sampling for the survey is +/ 4.5% percentage points at the midpoint with 95% confidence. What is the survey saying? What does +/ 4.5% tell us? 4 5 Interpretation of poll report The total number of people in the population who will vote for Smith is unknown, let us call it π, we use the sample to estimate it. 21% is calculated by dividing the number of people who said they will vote for Smith by the total number of people interviewed. 21% is used as an estimator of the population proportion π. Of course the population proportion does not equal 21%, but the true proportion may lie in some interval about 21%. The newspaper says that with 95% confidence the mean lies in the interval [21 4.5, ] = [16.5,25.5]. The way this interval is calculated is by assuming the sample proportion is approximately normally distributed. It can be shown that this proportion is close to normal. In the telephone poll, they asked each person whether they would vote for Smith. The outcome could be either {yes (indicated by one) or no (indicated by a zero)}. Therefore, the response of a person is a binary random variable X i. The probability of a yes will be the true population proportion π (the number of people in the population who will vote for yes divided the total number of people in the population) which we are trying to estimate. Remember that P(X i = 1) = π. The number of people out of the 500 people sampled in the telephone survey who say they will vote for Smith is S 500 = X 1 + X X 500, this is a Binomial random variable with S 500 Bin(500, π). 6 7

3 There the estimator of π is the proportion estimator ˆπ = S n /500. The sample mean Remember that we showed in lecture 7 that under certain conditions S n (the number of people who said they would vote for X in the telephone poll) is close to a normal distributed. Hence the proportion ˆπ = S n /n is close to normal, which means that it is okay to assume normality of the proportion estimator, and the CI makes sense. This argument is true for any general average not just the proportion, we show why now. In statistics it is often of interest to estimate the population mean. But we usually only have a sample from the population. We can evaluate the sample mean. This raises several questions; How close is the sample mean to the population mean. How good an estimator is the sample mean? The confidence interval gives us information about location of the population mean and the estimator s accuracy. 8 9 The sample mean is random too The sample X 1,..., X n are random variables with mean µ and variance σ 2. The sample mean X = 1 n n i=1 X i, is also a random variable. The sample mean is an estimator of the population mean µ. Generally, it will never give spot on the population mean, but an interval constructed about the sample mean may contain the population mean. In the same way everytime I draw a sample it will be different, each time I calculate the sample mean from the sample I get a different sample mean. Since X 1,..., X n are random variables with a distribution, then the sample mean is also a random variable with a distribution. It raises several questions? What is the standard deviation of this estimator (this is the standard error)? How does the standard error of the estimator relate to the standard deviation of the original distribution? What is the distribution of the sample mean (ie. if I draw all possible size n samples from a population and made a density plot based on all these samples what would it look like)? The variance of the estimator is σ 2 /n, the standard error is σ/ n (see how it relates the variance of the original distribution). In fact if n is large, it can be shown that the distribution is approximately normal (this is the central limit theorem). Armed with these two facts we can construct confidence intervals

4 The central limit theorem: The central limit theorem Suppose X 1,..., X n is a sample from a population with mean µ and variance σ 2. If the sample size n is large, then the sample mean X = 1 n n X i, i=1 (approximately) has the distribution X N(µ, σ2 n ). 12 This means that the random variable X i has mean µ and variance σ 2 (that is the height of a randomly chosen person has mean height µ and variance σ 2 ), then the average taken of a sample of n individuals, let us call this X, has mean µ and variance σ 2 /n. Observe how the variance does from σ 2 (variance of individual) to σ 2 /n variance of average of n people. Look at CLT lecture12.pdf pages 1-3. Look at sim/sampling dist/index.htm see how the distribution of the sample mean becomes more normal as we increase the sample size - regardless of the underlying distribution of the population is normal or not. 13 How large is large? How large, is large, is a difficult question, and varies from data to data. But rule of thumb is that is about 30. Also notice how the standard deviation (the stretch in the histogram) gets narrower as you increase the sample size from 5 to 25. If the data is highly non-normal (you can check this by making a QQplot), more observations are required for the sample mean to be normal. Look at sim/sampling dist/index.htm Select your underlying population distribution - it can be normal, or highly non-normal like a skew or a uniform. Choose the sample size from 5-25 (this is the number in the samples). Select 1000 samples, and make plots of the histogram. You will see that for sample size 25 all the sample means histograms look quite normal. But for smaller sample sizes, the histogram of the sample means look less normal

5 The larger sample, the smaller the variance To recollect suppose X i is a random variable with mean µ and variance σ 2. We have a sample X 1,..., X n. X is a random variable. If the sample size n is large we have, X N(µ, σ 2 n ). What else do we notice about the variance? As the sample size n gets larger the variance gets smaller. This is what we would expect, the larger the sample size, the more accurate is the the sample mean estimator of the population mean. Calculate the standard errors Suppose that X 1,..., X n is a random sample. It has a mean µ and variance 6. What is a suitable estimator of µ? What is the variance of estimator when n = 5? What is the variance of estimator when n = 25? What is the variance of estimator when n = 60? What is the standard error of estimator when n = 5? What is the standard error of estimator when n = 25? What is the standard error of estimator when n = 60? Can we construct confidence intervals for the mean? We have the random sample X 1, X 2,..., X n. We do not know the distribution it was drawn from. But we do know that the variance is σ 2 = 4, but the mean µ is unknown. What is a suitable estimator of µ? Is it possible to construct a 95% CI for µ? Look at CLT lecture12.pdf pages 4-5. Example Let us suppose that we observe the random sample X 1, X 2,..., X n, the sample mean is X = 6. We know that the population variance is 4. Under what assumptions can be construct a 95% CI for the population mean µ, if these assumptions are satisfied, construct a 95% CI. 18

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 http://wwwstattamuedu/~suhasini/teachinghtml Suhasini Subba Rao Review of previous lecture The main idea in the previous lecture is that the sample