Epidemiology Principle of Biostatistics Chapter 7: Sampling Distributions (continued) John Koval

Principle of Biostatistics Chapter 7: Sampling Distributions (continued) John Koval Department of Epidemiology and Biostatistics University of Western Ontario

Next want to look at histogram of sample statistics sample mean, median, sample variance, sample standard deviation to see what their distribution looks like

sample mean of Bernoullis Consider the sample of 10 observations from a Bernoulli that is, the sample of 10 responses to the question Do you smoke? where Yes is valued as 1 and No is valued as 0 In what are we interested??

Random variables - some math Les us call X 1, a random variable which measures the response (0 or 1) of the first person and X 2 is the response is the response of the second person etc, up to X 10, the response of the 10 th person let Y be the sum of the responses of all ten subjects Then P, the sample proportion, is the average (sample mean) or all ten responses that is P = Y n = 10 1 X i n = 0+1+1...+0 10

Distribution of a sample mean of Bernoullis Remember that Y is the sum of 10 Bernoullis so that what is the distribution of Y? (which can be thought of number of successes in a sample of size 10)

Distribution of a sample mean of Bernoullis Remember that Y is the sum of 10 Bernoullis so that what is the distribution of Y? (which can be thought of number of successes in a sample of size 10) Binomial (10,0.2) where π = 0.2 is the population proportion of smokers or the probability of picking a smoker at random Hence the distribution of the sample proportion is that of a multiple of the binomial distribution that is, it is a curve which has the same boxes as the binomial except the x-axis is marked in proportions rather that integers

Binomial Distribution B(10,0.2) x Pr(X=x) 0 0.10737 1 0.26844 2 0.30199 3 0.20133 4 0.08808 5 0.02642 6 0.00551 7 0.00079 8 0.00007 9 0.00000 10 0.00000

Bin(10,0.2) Probability 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 1 2 3 4 5 6 7 8 9 10

Distribution of proportion x Pr(X=x) 0.0 0.10737 0.1 0.26844 0.2 0.30199 0.3 0.20133 0.4 0.08808 0.5 0.02642 0.6 0.00551 0.7 0.00079 0.8 0.00007 0.9 0.00000 1.0 0.00000

proportion of 10 Bern(0.2) s Probability 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

distribution of proportions If the proportion is the average of a number of Bernoulli distributions its distribution is exactly a multiple of a Binomial Hence we can always plot its distribution and calculate probabilities From a previous lecture, we know that for large sample size, n, and nπ > 5 the binomial distribution can be approximated by a Normal distribution Similarly, the distribution of the proportion for large sample size, n, and nπ > 5 can be approximated by a multiple of a Normal distribution

Sample means from other distributions easy stuff ends here If we have more complicated distributions that produce the data of which we are calculating sample means we cannot get the distributions so easily as for the proportion However, for large samples, the distribution can be approximated

Sampling from a Binomial Consider taking a random sample of 10 people to you have administered the earlier described Stress Scale We assume that the distribution of the Stress Scale is Binomial(10,0.2) From what we have just done we know that, if we simulate the taking of such sample many times we can plot the resulting statistic and see the distribution of the statistic in this case, that of the sample mean

Distribution of sample mean - 1000 simulations Title distribution of sample mean ; options ps=24 ls=64; data samples; seed=25487; nsim = 1000; nsam=10; nquest=10; pi=0.2; do nrun = 1 to nsim; sumx = 0; do i =1 to nsam ; x=ranbin(seed,n,pi); sumx = sumx+x; end; xbar=sumx/nsam; output; end;

Distribution of sample mean (continued) this is a default plot proc means; var xbar; title sampling distribution of sample means ; proc chart; vbar xbar/type=pct space=0; proc gchart; vbar xbar/type=pct space=0;

Statistics Sample statistics nsam Mean Std Dev Minimum Maximum --------------------------------------------------- 10 1.9980000 0.3982510 0.6000000 3.7000000 30 1.9983867 0.2340997 1.1666667 2.9000000 100 1.9984980 0.1279179 1.5600000 2.5600000 --------------------------------------------------- as the sample size increases 1. the standard deviation gets smaller 2. the range gets smaller, and more symmetric

CHART output for sample size 10 Graphical representation of changes with sample size Percentage 10 *** **** 8 ****** ******* 6 ********* ********** 4 *********** ************* 2 *************** ******************* --------------------------- 1.1 1.5 1.9 2.3 2.7 3.1

CHART output for sample size 30 Percentage 12 ** **** 10 ****** ** ****** ** 8 ******** **** ******** **** 6 ** **************** ** **************** 4 ************************ ************************ 2 ****************************** ************************************ ------------------------------------- 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9

CHART output for sample size 100 Percentage 10 * **** 8 ****** ******* 6 ********* ********** 4 ************ ************** 2 ***************** ********************* ------------------------------------ 1.7 1.9 2.1 2.3 2.5

sample size 10- default plot fancier graphs

sample size 30- default plot

sample size 100 - default plot

Distribution of sample mean (continued again) this is a plot with a defined range so that we can compare the output for 10,30,100 proc gchart; vbar xbar/type=pct space=0 midpoints = 0.6 to 3.4 by 0.2;

sample size 10- plot with defined range

sample size 30- plot with defined range

sample 100- plot with defined range can see that plots centre around population mean (2.0)

Conclusions 1. as sample size gets larger variance decreases 2. as sample size gets larger curve looks more symmetric

Distribution of sample mean (more) alternatively use Proc UNIVARIATE s command HISTOGRAM for both the histogram and approximating normal proc univariate; var xbar; histogram /normal(mu = 2.0 sigma = 0.4); where sigma = 0.2309 for n sam = 30 and sigma = 0.1265 for n sam = 100

sample size 10- histogram and theoretical distribution

sample size 30- histogram and theoretical distribution

sample 100- histogranmand theoretical distribution

Conclusions 1. as sample size gets larger curve looks more Normal

Sampling from other distributions 1. Normal - perfect distribution of sample mean is Normal regardless of sample size 2. symmetric, eg, Uniform distribution of sample mean is symmetric (for uniform, tails may be truncated) for smallish samples, distribution is normal approximately 3. asymmetric - continuous counterpart of Binomial like Binomial 3.1 for large sample size, distribution is approximately normal 3.2 for small sample size, approximation to normal is poor

The Central Limit Theorem take sample of size nsam for nsam large enough the distribution of the sample mean will be Normal

The Central Limit Theorem (statistically) sample from (µ,σ 2 ) nsam times for nsam large enough X N(µ,σ 2 /nsam)