Data Analysis and Statistical Methods Statistics 651

Similar documents
Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

ECON 214 Elements of Statistics for Economists 2016/2017

Data Analysis and Statistical Methods Statistics 651

Confidence Intervals Introduction

Chapter 7 Study Guide: The Central Limit Theorem

Sampling and sampling distribution

ECON 214 Elements of Statistics for Economists

Data Analysis and Statistical Methods Statistics 651

8.1 Estimation of the Mean and Proportion

As you draw random samples of size n, as n increases, the sample means tend to be normally distributed.

Data Analysis and Statistical Methods Statistics 651

Chapter Seven: Confidence Intervals and Sample Size

4.2 Probability Distributions

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Introduction to Statistics I

ECO220Y Continuous Probability Distributions: Normal Readings: Chapter 9, section 9.10

1. Variability in estimates and CLT

Lecture 8. The Binomial Distribution. Binomial Distribution. Binomial Distribution. Probability Distributions: Normal and Binomial

Data Analysis and Statistical Methods Statistics 651

Statistics 511 Supplemental Materials

Lecture 9 - Sampling Distributions and the CLT

1 Inferential Statistic

AP Statistics Chapter 6 - Random Variables

Statistics 13 Elementary Statistics

χ 2 distributions and confidence intervals for population variance

NORMAL RANDOM VARIABLES (Normal or gaussian distribution)

Section 7.5 The Normal Distribution. Section 7.6 Application of the Normal Distribution

Introduction to Business Statistics QM 120 Chapter 6

A continuous random variable is one that can theoretically take on any value on some line interval. We use f ( x)

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

AMS 7 Sampling Distributions, Central limit theorem, Confidence Intervals Lecture 4

In a binomial experiment of n trials, where p = probability of success and q = probability of failure. mean variance standard deviation

CHAPTER 6 Random Variables

Determining Sample Size. Slide 1 ˆ ˆ. p q n E = z α / 2. (solve for n by algebra) n = E 2

What was in the last lecture?

STAT Chapter 7: Confidence Intervals

Statistical Methods in Practice STAT/MATH 3379

Descriptive Statistics (Devore Chapter One)

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

The Normal Distribution

Lecture 6: Chapter 6

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

FEEG6017 lecture: The normal distribution, estimation, confidence intervals. Markus Brede,

Section The Sampling Distribution of a Sample Mean

Confidence Intervals and Sample Size

Lecture 9 - Sampling Distributions and the CLT. Mean. Margin of error. Sta102/BME102. February 6, Sample mean ( X ): x i

Statistics and Probability

STAT:2010 Statistical Methods and Computing. Using density curves to describe the distribution of values of a quantitative

Business Statistics 41000: Probability 4

Part V - Chance Variability

Chapter 6.1 Confidence Intervals. Stat 226 Introduction to Business Statistics I. Chapter 6, Section 6.1

Class 16. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Chapter 6: Random Variables

Chapter 6. The Normal Probability Distributions

Chapter 5. Continuous Random Variables and Probability Distributions. 5.1 Continuous Random Variables

Continuous Probability Distributions & Normal Distribution

Math 227 Elementary Statistics. Bluman 5 th edition

Probability. An intro for calculus students P= Figure 1: A normal integral

Chapter ! Bell Shaped

Making Sense of Cents

Chapter 4 Continuous Random Variables and Probability Distributions

Chapter 8 Statistical Intervals for a Single Sample

Chapter 8 Estimation

Estimation and Confidence Intervals

CH 5 Normal Probability Distributions Properties of the Normal Distribution

The topics in this section are related and necessary topics for both course objectives.

Data Analysis and Statistical Methods Statistics 651

Standard Normal Calculations

Statistics for Business and Economics: Random Variables:Continuous

Topic 6 - Continuous Distributions I. Discrete RVs. Probability Density. Continuous RVs. Background Reading. Recall the discrete distributions

Statistics (This summary is for chapters 17, 28, 29 and section G of chapter 19)

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

Section Random Variables and Histograms

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Standard Normal, Inverse Normal and Sampling Distributions

Central Limit Theorem

Chapter 5. Sampling Distributions

1 Small Sample CI for a Population Mean µ

2011 Pearson Education, Inc

Math489/889 Stochastic Processes and Advanced Mathematical Finance Homework 5

Normal Distribution. Notes. Normal Distribution. Standard Normal. Sums of Normal Random Variables. Normal. approximation of Binomial.

7 THE CENTRAL LIMIT THEOREM

Theoretical Foundations

Chapter 4 Continuous Random Variables and Probability Distributions

Review of commonly missed questions on the online quiz. Lecture 7: Random variables] Expected value and standard deviation. Let s bet...

MATH 3200 Exam 3 Dr. Syring

. 13. The maximum error (margin of error) of the estimate for μ (based on known σ) is:

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

Value (x) probability Example A-2: Construct a histogram for population Ψ.

Lecture 3. Sampling distributions. Counts, Proportions, and sample mean.

A.REPRESENTATION OF DATA

Lecture 23. STAT 225 Introduction to Probability Models April 4, Whitney Huang Purdue University. Normal approximation to Binomial

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Chapter 4 Random Variables & Probability. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

6 Central Limit Theorem. (Chs 6.4, 6.5)

Chapter 7 1. Random Variables

Chapter 9: Sampling Distributions

Random Variables. 6.1 Discrete and Continuous Random Variables. Probability Distribution. Discrete Random Variables. Chapter 6, Section 1

Transcription:

Data Analysis and Statistical Methods Statistics 651 http://wwwstattamuedu/~suhasini/teachinghtml Suhasini Subba Rao Review of previous lecture The main idea in the previous lecture is that the sample average has a distribution, and when the sample size is large, the sample average (obtained from this sample) is close to being normally distributed We first note if in general an estimator is close to the population mean, then the variance of the estimator will be small To understand this look at the density plots (look again at variance explanation lecture4pdf in Lecture 4, and large variance small variancepdf) Remember that the population variance is defined as 1 N N (x i µ) 2 i=1 When this is small the outcomes are in general close to the mean µ When it is large then the outcomes are spread about 1 Reasoning that the sample average is random and has a distribution Suppose a population contains 10 individuals (a very small population) You can make a histogram of the height of these individuals Call this Histogram A You can also evaluate the population mean and the population variance The probability that a randomly selected person s height lies in a certain interval is determined by Histogram A Suppose you collect all samples of size two from this 10 individual population (that is all subsets of containg 2 people from these 10) There are 45 such subsets For each subset (sample) you calculate the average (this is the sample average) There will be 45 averages This 45 averages can also be treated as a population We can make a histogram of these averages Call this Histogram B The mean of this population is the same as the mean of the population with 10 people, but the variance of this population will be different Suppose you randomly select a subset of size two and take the average of these two individuals The average will be one of the 45 values The probability the sample average will lie in any given interval is determined by Historgam B You collect all samples (subsets) of size 3 out of 10 There are 120 such subsets For each subset you calculate the average (the sample average) These 120 averages can also be considered as a population You can make a histogram of these averages Call this Histogram C The average of any random sample of size 3 must belong to one of these 120 numbers The probability the average lies in any given interval is determined by Histogram C Reasoning that the variance of the sample mean gets smaller as the sample size gets larger Suppose you draw one individual from a population of people and 2 3

measure their height Their height could be close to the mean height but it could also be extremely small or extremely large Suppose you draw three individuals from a population of people and measure all of three heights One height may be an extreme value (extremely far from the mean), but it is highly unlikely that all three heights will be extreme (the probability of this happening will be very small) Therefore the average of the three heights will smooth out the extreme behaviour and is likely to be closer to the mean height than an individual height Taking this argument to the extreme Suppose that I have a population of 1000 people I take as my sample 999 people This sample is random The average of these 999 people will be very close to the average of the 1000 people (the height of the poor excluded individual hardly counts) Since sample average is likely to be close to the population mean, then the variance of the sample average will be small Therefore the variance of these averages involving a 999 individuals will be very, very small But note, that the variance is not small because the sample is large relative to the size of the population - it is small because of the sample is large, regardless of the size of the population I often get the comment If the population size is a billion and the sample size is 500, then the sample mean based is bad But if the population size is 100 and the sample size is 500, then the sample mean is good This is not correct The sampling is always done under the replacement (you can draw that individual again), the relative proportion of the sample size to population does not matter What matters is the sample size and the population variance The variance of the sample mean σ 2 /n only depends on these two factors, not on the population size (to statisticians the population is usually infinite!) In summary the average of three is likely to be closer to the mean than just one height on its own This means the variance of an average 4 5 involving three individuals will be smaller than the variance of just one individual The variance of the sample mean is σ 2 /n Suppose the variance of one person is σ 2 The variance of the average of three will be σ 2 /3 Notice that σ 2 is still there This is because the variance of the population will also ways effect the variance of the average But the sample size also has an effect In the above we have reasoned that the sample average does have a distribution and the variance of this distribution decreases as the sample size grows It can be shown that not only does the variance of the distribution decrease with sample size, but it becomes more bell shaped It becomes a normal distribution The larger the population variance σ 2, the larger the the variance of the average σ 2 /n The larger the sample size n, the smaller the population variance σ 2 /n The average is approximately normal with mean µ and variance σ 2 /n 6 7

A game We choose 5 people in the class This is our population (it is fixed) Suppose their ages are 22, 24, 23, 25, 27 The mean of this population is 242 and the variance is 296 Since these 5 students form our population, neither the mean or variance is random They are fixed and cannot be changed (unless we change the population) Sample Sample Average Sample Sample Average 1 22, 22 22 2 24, 24 24 3 23, 23 23 4 25, 25 25 5 27, 27 23 6 22, 24 23 7 24, 22 23 8 22, 25 235 9 25, 22 235 10 22,27 245 11 27, 22 245 12 24, 23 235 13 24, 23 235 14 24, 25 245 15 25, 24 245 16 24, 27 255 17 27, 24 255 18 23, 25 24 19 25, 23 24 20 25, 27 26 21 27, 25 26 22 23, 27 25 23 27, 23 25 24 22, 23 235 25 23, 22 235 Suppose we take a sample of size two The sample can be anyone of the following possibilities: Associated to each sample is the sample average We see that this 8 9 can also be considered as the population of all averages of size two It also has a mean which is 242 (same as the mean of the population 22, 24,23,25,27) and a variance which is 148 The population mean and variance of the sample average population are fixed, even though the sample average is random and can be anyone of the possibilities given in the table above Exercise: calculate the population mean and variance of this population yourself What do you notice? The sample average is random, it can be anyone of the 25 possibilities given above The mean of the population sample average is the same as the mean of the original population 22,24,23,25, 27 The variance of the population of the sample averages is 148, which is half the variance of the orginal population which is 296 In other words the variance of the sample average is σ 2 /2 = 296/2 (since σ 2 = 296 and n = 2) 10 11

How these results help us We have shown that the sample mean has a distribution which is close to normal with mean µ and variance σ 2 /n (σ 2 is the variance of the population - variance of one randomly chosen person) In the game above you see how the variance of the average is σ 2 /n when the sample size is n The sample average can always be used as an estimator of the mean We want to construct confidence intervals for the mean Inside the CI is where the true mean is most likely to be You recall from Lecture 11 these intervals are constructed under the assumption of normality Constructing confidence intervals for the mean Suppose X i has mean µ and variance σ 2 We know that the average X = 1 n n i=1 X i is close to normal and is approximately N(µ, σ 2 /n) we can construct a confidence interval for the mean We shall assume (for now) that the variance is σ 2 is known (is this reasonable?) If n is sufficiently large (the case n = 2 is not enough), the we can assume that the distribution of the average is approximately normal We can use this information to construct CIs By the CLT we have X N(µ, σ2 n ) Therefore ( P X 196 σ µ X + 196 σ ) = 095 n n 12 13 Given the sample mean X, the 95% CI is [ X 196 σ n, X + 196 σ n This means that for every 100 intervals constructed about 95 would contain the mean µ Example 1 A forester wishes to estimate the average number of trees per acre over a 2000-acre plantation She can use this information to determine the total timber volume in the plantation The standard deviation for the distribution of the number of trees in an acre in 121 A random sample of n = 50 1-acre plots are selected and examined It is found that average number of trees per acre (based on this sample) is 273 Use this information to construct a 95% CI for the mean number of trees per acre 14 15

Solution 1 The sample average X = 1 50 50 k=1 X i The variance for one acre is known to be 121 2, hence the variance for the sample mean is 121 2 /50 The sample size n = 50, which is large enough to assume normality Therefore the 95% CI is: [ 121 273 196,273 + 196 121 50 50 The length of interval is 2 196 121 50 The 99% CI is [ 273 256 121 50,273 + 256 121 50 The length of interval is 2 256 121 50 Question Construct the CI when the sample mean is again 273, but the sample size is now 150 A summary X 1,, X n is a random sample (say heights/weights of n randomly selected invididuals) The distribution of this (ie height or weight) has mean µ and variance σ 2 The original distribution does not have to be normal (ie the histogram of heights can differ from the normal curve) The sample average X is random too, and has a distribution But if n is large enough, regardless of the original distribution, the distribution of the heights will be close to normal, with mean µ and variance σ 2 /n However, the closer the original distribution is to normal (we may know this from previous experiments etc) the smaller n needs to be for the approximation to be good In other words, if the original distribution is far from normal we will need a large sample size (say at least n = 40) for the normality result to be true 16 17 Example 2 A social worker is interested in estimating the average time outside prison a first time offender spends outside prison before the re-offend A random sample of n = 150 first time offenders are considered Based on this data it is found that the average time they spend 32 years away from prison The sample standard deviation is 11 years Stating all assumptions construct a 99% CI for the true average µ Solution 2 The sample mean is X = 32 The sample standard deviation is 11 The sample size is large n = 150, hence we can assume normality of the sample mean X Moreover, since we have estimated the standard deviation s = 11 using 150 observations (relatively large sample), we can assume it is a good estimator of the true sample standard deviation σ Hence in our calculations we will use s = 11 in place of the true standard deviation σ Therefore the sample variance of the sample mean X is 11 2 /150 The 99% CI is [ 32 256 11 150, 32 + 256 11 150 The length is 2 256 11 150 18 19

Sample size and the confidence interval Example: X = 1038, σ 2 = We compare the 95% confidence intervals for n = 9 and n = 25 We see n=9 [ X 196 σ 9, X + 196 σ 9 n=25 [ X 196 σ 25, X + 196 σ 25 What are the lengths of the above intervals? For n=9 it is 2 196 σ 9 For n=25 it is 2 196 σ 25 Observe that the length does not depend on X, it s the same length regardless of the values of X n=9 [1038 196,1038 + 196 = [663,1413 9 9 n=25 [1038 196,1038 + 196 = [812,1263 25 25 We see that the second interval is smaller than the first interval When the sample size is large we have more information about the population The larger the sample size the smaller the length of the confidence intervals Because the estimator is in general better 20 21 The population variance σ 2 and the confidence interval We see that the variance σ 2 of one observation also has an effect on the length of the confidence interval Example: Suppose X = 1038, n = 9 σ 2 = [1038 196,1038 + 196 = [663,1413 9 9 σ 2 = 100 [1038 196 100 100, 1038 + 196 = [338,1691 9 9 The larger the variance of the random variable X i, the more variability in the sample mean, hence it is unlikely a small interval will capture the true mean Choosing the sample size for estimating µ How can one determine the number of observations to be included in a sample? To have a very large sample size would be nice, but often it can be too costly A sample size which is too small, can contain inadequate information We need to developed a compromise between desired accuracy and cost to obtain this accuracy How to choose the sample size n? This can be a complex issue that often depends on knowledge of the researcher 22 23

Tolerable error Often a researcher will choose the sample size n based on the length of the confidence interval This means she is able to accept the length of the CI interval having some preset length 2E (this is called the tolerable error) Hence the following interval should have the length 2E: Choosing n The length of the interval is ( X 3 + 196 ) ( X 3 3 196 ) = 2 196 n n n [ X 196 σ n, X + 196 σ n For example, if we know that σ 2 = 3 and 2E = 05 Then [ X 3 196, X 3 + 196 n n should have length 05 Hence we need to choose n such that this is satisfied This should have length 3, therefore we solve 05 = 2 196 3 n This gives the sample size: n = ( 2 196 3) 2 = 1844 05 Hence, the smallest value of n we can use such that the tolerable error (or, equivalently, the length of the confidence interval) is 05 is 185 24 25 In general: Length of a confidence interval Since at the 95% level [ X 196 σ n, X + 196 σ n The length of the confidence interval is X + 196 σ n }{{} confidence factor ( X 196 σ nconfidence factor ) = 2196 σ n }{{} 2 confidence factor = 2E Therefore we need to choose n such that 2E = 2 196 σ n that is n = (196)2 σ 2 E 2 The above was for 95% CI General CIs and tolerable error What is the length of a 99% CI, how should we choose n in this case? In general we if go for the (1 α) 100% CI, where α is pre-selected (so that we know z α/2 = 196,256 etc) Then we need to solve σ 2E = 2z α/2 n We need to pre-select the value of E The smaller E, the larger our sample size E will have to be The researcher must decide how much precision s/he requires 26 27

Some more practice on CIs sample X 1, X 2, X 3, X 4, X 5 X [ X 196 σ 5, X + 196 σ 5 The sample mean X is known as a point estimator The interval [ X 196 σ n, X + 196 σ n is known as a 95% confidence interval Like the sample mean the confidence interval is also random See Figure 54 in Ott and Longnecker (page 197) for a good illustration Below we construct 95% CIs for the mean using the average in each sample 1 4755 6174 14092 16166 10741 1038 [5350, 15421 2 16376 14995 11078 114 19817 1471 [9684, 19755 3 178 3094 6174 7015 16782 1008 [5045 15116 4 14212 73 0669 3700 9077 700 [1963 120 5 17845 6084 10556 0693 11194 927 [4239, 14310 6 7422 2743 2489 19446 18505 1012 [5086, 15156 7 6642 10475 15404 2279 17077 1037 [5340, 15411 8 15380 14440 13421 94 16322 1380 [8764, 18835 9 1000 10057 19754 19675 5031 1110 [6068, 16139 10 15019 10182 11708 11117 15131 1263 [7596, 17667 The population variance var(x i ) = σ 2 =, hence σ 5 = p /5 The true mean is 10, how many intervals cotain it? 28 29 An illustration For each sample average we plot the interval: Aside: Confidence intervals in the general case Up until now we have looked at 95% (99% or 90%) CIs But is easy to construct any 100(1 α)% CIs 15 10 5 If we want a 100(1 α)% confidence interval, find the z α, such that P(Z z α ) = α/2 (recall plot) For example if want a 95% interval, then P(Z 196) = 0025 Using the arguments in Lecture 11, this implies ( ) σ P X z α n µ X σ + z α n = 1 α The green line is where the population mean is In reality it is unknown If we did this plot for 100 different samples about 95 would intersect with the population mean and leads to the 100(1 α)% confidence interval [ X z α σ n, X + z α σ n 30 31