STAT 503 Sampling Distribution and Statistical Estimation 1 Simple Random Sampling Simple random sampling selects with equal chance from (available) members of population. The resulting sample is a simple random sample. Slide 1 Consider an urn containing N balls with numbers x i written on them. Draw n balls from the urn. Equal chance for C N n possible samples w/o replacement: finite population. Equal chance for N n possible samples w/ replacement: infinite population. The x i s need not to be all different. One can let N, but then one can not sample w/o replacement. A sample from an infinite population consists of independent, identically distributed (i.i.d.) observations. Sampling Distribution Sampling distribution describes the behavior of sample statistics such as X and s 2. Slide 2 In a certain human population, 30% of the individuals have superior distance vision (20/15 or better). Consider the sample proportion of superior vision, Clearly, X = 20ˆp Bin(20,.3). Possible values for ˆp are {0,.05,.10,...,.95,1}, with the probabilities given by ˆp = X/n, where X is the number of people in the sample with superior vision. Find the sampling distribution of ˆp for n = 20. P(ˆp=x) = C 20 20x(.3) 20x (.7) 20(1 x). For example, P(ˆp=.3) = C 20 6 (.3) 6 (.7) 14 =.192. In R, use dbinom(0:20,20,.3).
STAT 503 Sampling Distribution and Statistical Estimation 2 Sampling Distribution, Sampling The sampling distribution of the total of 2 rolls of a fair die. x 2 3 4 5 6 7 8 9 10 11 12 p 1 2 3 4 5 6 5 4 3 2 1 Slide 3 The sampling distribution of the total of 4 rolls of a fair die. x <- outer(1:6,1:6,"+") ## outer matrix dist2 <- table(x); dist2/sum(dist2) ## total of 2 rolls xx <- outer(x,x,"+") ## 4-way array dist4 <- table(xx); dist4/sum(dist4) ## total of 4 rolls Sampling from a finite collection of numbers. sample(1:200,10,rep=false) ## w/o replacement, default sample(1:6,33,replace=true) ## with replacement sample(0:3,30,rep=true,prob=c(1,3,3,1)) ## rbinom(30,3,.5) Simulating Sampling Distributions When analytical derivation is cumbersome or infeasible, one may use simulation to obtain the sampling distribution. Example: The sample median of 17 r.v. s from N(0,1). Slide 4 x <- matrix(rnorm(170000),17,10000) md <- apply(x,2,median) hist(md,nclass=50); plot(density(md)) Example: The largest of 5 Poisson counts from Poisson(3.3). x <- matrix(rpois(50000,3.3),5,10000) mx <- apply(x,2,max) table(mx); table(mx)/10000 ppois(1:13,3.3)^5-ppois(0:12,3.3)^5
STAT 503 Sampling Distribution and Statistical Estimation 3 Sampling Distribution of X Use upper case X to denote the sample mean as a r.v. For an infinite population with mean µ and standard deviation σ, µ X = µ and σ X = σ n. Slide 5 Theheightsofmalestudentsona large university campus have µ = 68 and σ = 5. Consider X with sample size n = 25. µ X = µ = 68 σ X = σ n = 5 25 = 1 X is more concentrated around µ. To double the accuracy of X, one needs to quadruple the sample size n. For finite population, σ X = σ n N n N 1. Central Limit Theorem Consider an infinite population with mean µ and standard deviation σ. For n large, Slide 6 P( X µ X σ X Given µ = 68 and σ = 5. Find the probability for the average height of n = 25 students to exceed 70. P( X>70) = P( X µ X σ X > 70 68 ) 1 1 Φ(2) =.0228. z) Φ(z). The shape of sampling distribution approaches normal as n. Usually, an n 30 is sufficiently large for CLT to kick in. For normal population, X is always normal, regardless of n.
STAT 503 Sampling Distribution and Statistical Estimation 4 n=1 Effects of Sample Size n=4 n=16 Slide 7 n=1 n=4 n=16 Normal Approximation of Binomial Recall that if X Bin(n,p), then X = n i=1 X i, where X i Bin(1,p). By the Central Limit Theorem, X np X/n p P( z) = P( z) Φ(z) np(1 p) p(1 p)/n Slide 8 Consider X Bin(25,.3). P(X 8) =.6769 P(X 8) = P(X 8.5) Φ( 8.5 7.5 2.291 ) =.6687 P(X=8) =.1651 P(X=8) = P(7.5 X 8.5) Φ(.4) Φ(0) =.1687 When a continuous distribution is used to approximate a discrete one, continuity correction is needed to preserve accuracy. For np,n(1 p) 5, the approximation is reasonably accurate.
STAT 503 Sampling Distribution and Statistical Estimation 5 Basic Structure of Inference Slide 9 Statistical Inference makes educated guesses about the population based on information from the sample. All guesses are prone to error, and the quantification of imprecision is an important part of statistical inference. 1. Estimation estimates the state of population, which is typically characterized by some parameter, say θ. 2. Hypothesis testing chooses from among postulated states of population, such as H 0 : θ = θ 0 versus H a : θ θ 0, where θ 0 is a known number. Examples of Estimation and Testing Slide 10 A plant physiologist grew 13 soybean seedlings of the type Wells II. She measured the total stem length (cm) for each plant after 16 days of growth, and got x = 21.34 and s = 1.22. She may estimate the average stem length by a point estimate, µ 21.34, or by an interval estimate, 18.68<µ<24.00. As reported by AMA, 16 out of every 100 doctors in any given year are subject to malpractice claims. A hospital of 300 physicians received claims against 58 of their doctors in one year. Was the hospital simply unlucky? Or does the number possibly indicate some systematic wrongdoings at the hospital? The number 58/300 is within chance variation of θ 0 =.16.
STAT 503 Sampling Distribution and Statistical Estimation 6 Estimating Population Mean Slide 11 Observing X 1,..., X n from a population with mean µ and variance σ 2, one is to estimate µ. The procedure (or formula) one uses is called an estimator, which yields an estimate after the data are plugged in. Observing X 1,..., X 5, one may use one of the following point estimators for µ: Observing 5.1, 5.1, 5.3, 5.2, 5.2, one may use one of the following point estimates for µ: ˆµ 1 = X ˆµ 2 = X 1 ˆµ 3 = (X 1 +X 3 )/2 ˆµ 4 = X ˆµ 5 = µ 0 ˆµ 1 = x = 5.18 ˆµ 2 = x 1 = 5.1 ˆµ 3 = (x 1 +x 3 )/2 = 5.2 ˆµ 4 = x = 5.2 ˆµ 5 = 5 Properties of Point Estimators Slide 12 To choose among all possible estimators, one compares properties of the estimators. Unbiasedness: µˆθ = θ. Small SD: σˆθ. ˆµ 1, ˆµ 2, and ˆµ 3 are all unbiased. µ X = µ X1 = µ (X1 +X 3 )/2 = µ σ 2 X = σ 2 /5 σ 2 X 1 = σ 2 σ 2 (X 1 +X 3 )/2 = σ 2 /2 A better estimator yields better estimates on average. A better estimator may not always yield a better estimate.
STAT 503 Sampling Distribution and Statistical Estimation 7 Sample Mean as Estimator of Population Mean One usually uses the sample mean x to estimate the population mean µ, as X has the smallest standard deviation among all unbiased estimators of µ. Slide 13 To quantify the imprecision of the estimation of µ by x, one estimates σ X = σ X n by s n, the standard error of the sample mean. Soybean stem length: n = 13, x = 21.34, and s = 1.22. ˆσ X = s n = 1.22 13 =.338 For X nearly normal, X lies within ±2 σx n of µ about 95% of the time. Do not confuse σ X with σ X. Confidence Intervals A point estimate will almost surely miss the target, although its standard error indicates by how far the miss is likely to be. An interval estimate provides a range for the parameter estimate. Slide 14 Soybean stem length: Assume normality with σ = 1.2 known. One has X N(µ,(1.2) 2 /13), so P( X µ 1.2/ 1.96) =.95. 13 Solving for µ, one obtains X 1.96 1.2 1.2 µ X+1.96 13 13 For X i N(µ,σ 2 ), i = 1,...,n with σ 2 known, X ±z 1 α/2 σ n provides an interval estimator that covers µ with probability (1 α). It yields a (1 α)100% confidence interval for µ, with a confidence coefficient (1 α)100%.
STAT 503 Sampling Distribution and Statistical Estimation 8 Coverage, Large Sample CIs Slide 15 As an estimator, a CI is a moving bracket chasing a fixed target. As an estimate, a CI may or may not cover the truth. With a large sample from an arbitrary distribution for σ unknown, an confidence interval for µ with an approximate conf. coef. (1 α)100% is given by X ±z 1 α/2 s n. Normality comes from CLT. Unknown σ estimated by s. Replace s by σ if known. Small Sample CIs based on t-distribution For a small sample with σ unknown, one has to assume normality. Slide 16 Consider Z i N(0,1), i = 1,...,n. The distribution of is called a t-distribution Z s/ n with a degree of freedom (df) ν = n 1. A t-distribution with ν = reduces to N(0,1). 0.0 0.1 0.2 0.3 0.4 df=1,10,100-2 0 2 For X i N(µ,σ 2 ), i = 1,...,n, P( X µ s/ n t 1 α/2,n 1) = 1 α, X ±t 1 α/2,n 1 s n provides a (1 α)100% CI for µ. t 1 α,ν as ν. For σ known, use z 1 α/2, σ. Table C.4 lists t 1 α,ν, but the notation in text drops (α,ν).
STAT 503 Sampling Distribution and Statistical Estimation 9 Confidence Intervals for µ: Summary Slide 17 An agronomist measured stem diameter (mm) in 8 plants of a variety of wheat, and calculated x = 2.275 and s =.2375. Assuming normality, a 95% CI for µ is given by 2.275±2.5(.2375)/ 8, or (2.076,2.474), where t.975,7 = 2.5. If one further knows that σ =.25, then he can use 2.275±1.96(.25)/ 8, or (2.102, 2.448). In the ideal situation with normality and known σ, always use X ±z 1 α/2 σ n With a small normal sample but unknown σ, estimate σ by s and replace z 1 α/2 by t 1 α/2,n 1 to allow for the uncertainty. When n is large, CLT grants normality of X, s estimates σ reliably, and z 1 α/2 t 1 α/2,n 1. Coverage versus Precision To cover the truth more often, one needs a higher confidence coefficient, but at the expense of wider intervals. Slide 18 The interval (, ) has 100% coverage but is useless. A point estimate is the most precise but always misses. Given sample size n, X ±z1 α/2 σ/ n is the shortest interval estimate for µ among all that have a confidence coefficient (1 α)100%. To achieve both coverage and precision, one has to take a large enough sample.
STAT 503 Sampling Distribution and Statistical Estimation 10 Planning Sample Size Slide 19 The agronomist is planning a new study of wheat stem diameter, and wants a 95% CI of µ no wider than.2 mm. From experience and pilot study, he believes that σ =.25 is about right. The half-width of CI is z.975 σ n = 1.96.25 n. Solving for n from 1.96(.25)/ n.1, one gets n 24. Let h be the desired half-width for a (1 α)100% CI. Solving for n from z 1 α/2 σ n h, one has ( z1 α/2 σ n h ) 2 z 1 α/2 t 1 α/2,n 1 for large n. Need a conservative estimate of σ. To cut the width by half, one needs to quadruple the sample size n. CI for Population Proportion Slide 20 123 adult female deer were captured and 97 found to be pregnant. Construct a 95% CI for pregnant proportion in the population. Since ˆp = 97 123 =.7886, ˆσˆp =.7886(1.7886) 123 =.08, the 95% CI is given by.7886±1.96(.08), or (.7165,.8607). Fora95%CIwithhalf-widthh 3%, it is safe to have n (1.96(0.5)/0.03) 2 = 1067.1. Consider X i Bin(1,p), i = 1,...,n, independent. One has X = i X i Bin(n,p). For n large, by CLT, X/n p P( z) Φ(z). p(1 p)/n The sample proportion ˆp = X/n is actually an X. As an estimate of σˆp = p(1 p)/n one may use ˆp(1 ˆp)/n. A (1 α)100% CI for p is thus ˆp±z 1 α/2 ˆp(1 ˆp)/n. σ = p(1 p) 0.5.
STAT 503 Sampling Distribution and Statistical Estimation 11 Confidence Interval for σ 2 For X 1,...,X n from a population with variance σ 2, the sample variance s 2 = i (X i X) 2 /(n 1) is an unbiased estimate of σ 2. Slide 21 With Z i N(0,1), i = 1,...,n. i (Z i Z) 2 follows a χ 2 -distribution with a degree of freedom (df) ν = n 1. 0.0 0.5 1.0 1.5 df=5,25 0.0 0.5 1.0 1.5 2.0 2.5 3.0 χ 2 ν ν For X i N(µ,σ 2 ), i = 1,...,n, (n 1)s 2 /σ 2 follows χ 2 n 1. For X i N(µ,σ 2 ), i = 1,...,n, P( χ2.025,n 1 n 1 < s2 < χ2.975,n 1 ) = σ 2 n 1 0.95. Solving for σ 2, a 95% CI is (n 1)s 2 χ 2.975,n 1 < σ 2 < (n 1)s2 χ 2.025,n 1 For n = 13 and s =.2375, χ 2.025,12 = 4.4038 and χ 2.975,12 = 23.337, so a 95% CI for σ 2 is 12(.2375) 2 < σ 2 < 12(.2375)2, 23.337 4.4038 or 0.1703 < σ < 0.3920.. Simulations of Coverage, Robustness Slide 22 ## generate data and set parameters ## n <- 10; x <- matrix(rnorm(10000*n),ncol=10000) ## mu <- 0; sig <- 1 ## N(0,1) n <- 30; x <- matrix(runif(10000*n),ncol=10000) mu <-.5; sig <- sqrt(1/12) ## U(0,1) ## calculate CIs and coverage mn <- apply(x,2,mean); v <- apply(x,2,var) hwd <- qnorm(.975)*sig/sqrt(n); lcl<-mn-hwd; ucl<-mn+hwd mean((lcl<mu)&(ucl>mu)) ## z-interval for mu hwd <- qt(.975,n-1)*sqrt(v/n); lcl<-mn-hwd; ucl<-mn+hwd mean((lcl<mu)&(ucl>mu)) ## t-interval for mu lcl <- sqrt(v*(n-1)/qchisq(.975,n-1)) ucl <- sqrt(v*(n-1)/qchisq(.025,n-1)) mean((lcl<sig)&(ucl>sig)) ## chisq-interval for sig