Sampling and sampling distribution

Sampling and sampling distribution September 12, 2017 STAT 101 Class 5 Slide 1

Outline of Topics 1 Sampling 2 Sampling distribution of a mean 3 Sampling distribution of a proportion STAT 101 Class 5 Slide 2

Statistical Inference Many economic and social decisions are based on figures from the entire population, e.g., how many homeless people are there? what is the household income? A census every unit in the population is studied is the gold standard but very costly Statisticians use a representative portion of the population a sample (a pseudo-population ) to solve these problems The method of using a sample to study a population is called statistical inference STAT 101 Class 5 Slide 3

Population and sample Population The set of all units of interest Finite Population size N is enumerable Infinite N is not finite (note that infinite continuous as in the definition of random variables) Sample Any subset of a population. Sample size n can be as small as one unit of the population A finite population can be analysed as an infinite population if (1) N is very big (2) n N < 0.05 (3) N is small but sampling is carried out with replacement We assume an infinite population or a finite population with (1), (2) or (3) STAT 101 Class 5 Slide 4

Parameters and statistics Every problem about a population can be characterised by some summaries called parameters, e.g., the proportion of homeless people the mean income A statistic is the equivalence of a parameter calculated from a sample, e.g., the proportion of homeless people in the sample the sample mean income Parameters are usually unknown whereas statistics are known Inferential statistics uses a statistic to infer about a parameter STAT 101 Class 5 Slide 5

Common population quantities and sample counterparts Parameter Statistic Probability distribution Histogram (Population) mean, µ (Sample) mean, X (Population) variance, σ 2 (Sample) variance, s 2 (Population) standard deviation, σ (Sample) standard deviation, s (Population) proportion, p (Sample) proportion, ˆp STAT 101 Class 5 Slide 6

Simple random sample A simple random sample (SRS) is chosen in such a way that every member of the population has the same probability of being selected A SRS is a pseudo-population that mimics the true population and hence, we can use the SRS statistic(s) to answer questions about the unknown population parameter(s) We assume members in our sample are independently drawn from the population each unit in the sample to contribute a separate piece of information about the parameter of interest There are other sampling schemes but we focus on SRS here Hereafter, we refer a SRS of independent observations as a sample STAT 101 Class 5 Slide 7

Sampling from a population Population N i=1 µ = X i N X 1, X 2,, X N N σ 2 i=1 = (X i µ) 2 N Sample X 1, X 2,, X n X = n i=1 X i n Intuition tells us X is similar to µ and s 2 is similar to σ 2 STAT 101 Class 5 Slide 8 n s 2 i=1 = (X i X ) 2 n

Sampling error Example Sampling with replacement from a finite population Population Sample Units X 1,..., X 7 = 1, 2, 3, 4, 5, 6, 7 X 1,..., X 5 =3, 6, 5, 1, 6 Size N = 7 n = 5 Mean µ = X 1+...+X N N = 1+...+7 7 = 4 X = X 1+...+X n n = 3+6+5+1+6 5 = 4.2 X µ = 4.2 4 is called a sampling error Every sample of size n is subject to sampling error because only a subset of the population is used to infer about the whole In practice, µ is unknown and hence is also unknown and it cannot be estimated X 1,..., X 5 are generic symbols for five units randomly selected with replacement from the population; they are not necessarily the first five units in the population STAT 101 Class 5 Slide 9

Sampling distribution Every SRS is randomly drawn from the population, hence X and its sampling error X µ are both random we cannot make definitive statements about anything random (c.f., class 1 slide 10) We look for the probabilities of different values of X, i.e., we want to find its distribution The distribution of X is sometimes called a sampling distribution The distribution of X also tells us the distribution of its sampling error = X µ since µ is just a constant even though it is unknown The distribution of tells us the likely values of STAT 101 Class 5 Slide 10

Sampling distribution (2) Sample k Population 1, 2, 3, 4, 5, 6, 7 4, 5, 6, 1, 7 X = 4.6 4.6 µ = Sample 2 3, 6, 5, 1, 6 X = 4.2 Sample 1 4.2 µ = 1, 4, 6, 2, 2 X = 3 3 µ = STAT 101 Class 5 Slide 11

Finding sampling distribution Method 1 Different SRSs give an empirical sampling distribution of X Few samples have X near 1 or 7 only appear if SRS gives nearly all 1s or all 7s a rare event Highest frequencies of X near population mean µ = 4 very often X is similar to µ and sampling error is small The distributions of X and are identical except the values are translated This method is not feasible because: 1. Drawing many SRSs is time consuming 2. Usually the entire population, such as 1,2,3,4, 5,6,7, and hence µ are unknown 0 to 500 500 to 1000 1000 to 1500 1500 to 2000 2000 to 2500 2500 to 3000 3000 to 3500 3500 to 4000 Distribution of X and x 0 1 2 3 µ 5 6 7 8 X 4 3 2 1 0 1 2 3 4 sampling error x STAT 101 Class 5 Slide 12

Find sampling distribution Method 2 Possible sampling errors = (Possible values of X ) 0 Sampling error µ X By assuming a sufficiently large sample of n observations, the Central Limit Theorem (CLT, P.-S. Laplace, 1810, 1811) shows that using X to estimate µ of any population, the sampling distribution of X (and its sampling error) is approximately normal }{{} X Normal }{{} (µ, var( X )) and = X µ Normal }{{}}{{}}{{} (0, var( ) ) }{{} statistic sampling sampling sampling error sampling sampling distribution variation distribution variation We do not know where exactly is among the red s. However, using the empirical rules, we can be 95% certain that is no more than 0 ± 2 var( ) STAT 101 Class 5 Slide 13

Sampling variation Sample X Sampling error 1 3, 6, 5, 1, 6 4.2 4.2 µ 2 1, 4, 6, 2, 2 3 3 µ.... k 4, 5, 6, 1, 7 4.6 4.6 µ Any X 1, X 2, X 3, X 4, X 5 X 1+...+X 5 5 X 1+...+X 5 5 µ 1 Sampling variation, var( X ), tells us if we wish to estimate µ using X, different samples may give different estimates and different sampling errors 2 X = X1+...+X5 5 so var( X ) is due to var(x 1 ),..., var(x 5 ) 3 X 1, X 2,..., X 5 are randomly drawn from the population, they must have the same behaviour as any X randomly drawn from the population, i.e., var(x 1 ) = var(x 2 ) =... = var(x 5 ) = var(x ) STAT 101 Class 5 Slide 14

Sampling variation (2) ( ) X1 +... + X n var( X ) = var n = 1 n 2 var(x 1 +... + X n ) = 1 n 2 [var(x 1) +... + var(x n )] }{{} X 1,...,X n are independent = 1 n 2 n var(x ) }{{} var(x 1)=...=var(X n) var(x ) var(x ) = }{{ n } depends on var(x ) and n var(sampling error ) = var( X µ) = var( X ) }{{} µ is a constant Sampling variation depends on { (1) var(x ), how different are the values of X in the population (2) n, the sample size STAT 101 Class 5 Slide 15

Why sampling variation matters? Sampling distribution Sampling error 0 0 Large sampling variation Our sampling error is among the s and so may be large Small sampling variation Our sampling error is among the s and so unlikely to be large STAT 101 Class 5 Slide 16

What is a proportion? Example We wish to estimate the proportion, p, of homeless people in a population of N individuals. Let X indicate whether someone is homeless: { 1 homeless X = 0 not homeless Suppose the value of X in the population are X 1 = 1 (homeless), X 2 = 0 (not homeless), X 3 = 0,...,X N = 1, which is a collection of 1 s and 0 s p = #1 s N = 1 + 0 + 0 +... + 1 N = X 1 + X 2 + X 3 +... + X N N = µ Hence a proportion is a special case of µ with only 1 s and 0 s STAT 101 Class 5 Slide 17

Sampling to estimate a proportion Example (cont d) We take a sample X 1,..., X n and estimate p µ using X ˆp = X 1 +... + X n n X 1,..., X n are: { 1 with probability p 0 with probability 1 p We use CLT for X, i.e., X N(µ, var(x ) ) }{{ n } var( X ) var(x ) = E(X 2 ) E(X ) 2 = (1) 2 p + (0) 2 (1 p) = p p 2 = p(1 p) Hence CLT for ˆp is ˆp N(p, p(1 p) n ) STAT 101 Class 5 Slide 18 p 2 {}}{ µ 2