Sampling Marc H. Mehlman marcmehlman@yahoo.com University of New Haven (University of New Haven) Sampling 1 / 20
Table of Contents 1 Sampling Distributions 2 Central Limit Theorem 3 Binomial Distribution (University of New Haven) Sampling 2 / 20
Sampling Distributions Sampling Distributions Sampling Distributions (University of New Haven) Sampling 3 / 20
Sampling Distributions Parameters and Statistics As we begin to use sample data to draw conclusions about a wider population, we must be clear about whether a number describes a sample or a population. A parameter is is a number that describes some characteristic of of the population. In In statistical practice, the value of of a parameter is is not known because we cannot examine the entire population. A statistic is is a number that describes some characteristic of of a sample. The value of of a statistic can be computed directly from the sample data. We often use a statistic to to estimate an unknown parameter. Remember s and p: statistics come from samples and parameters come from populations. We write µ (the Greek letter mu) for the population mean and σ for the population standard deviation. We write x(x-bar) for the sample mean and s for the sample standard deviation. 4 (University of New Haven) Sampling 4 / 20
Sampling Distributions Statistical Estimation The process of statistical inference involves using information from a sample to draw conclusions about a wider population. Different random samples yield different statistics. We need to be able to describe the sampling distribution of possible statistic values in order to perform statistical inference. We can think of a statistic as a random variable because it takes numerical values that describe the outcomes of the random sampling process. Population Sample Collect data from a representative Sample... Make an Inference about the Population. 5 (University of New Haven) Sampling 5 / 20
Sampling Distributions Sampling Variability Different random samples yield different statistics. This basic fact is called sampling variability: the value of a statistic varies in repeated random sampling. To make sense of sampling variability, we ask, What would happen if we took many samples? Population Sample Sample Sample Sample Sample Sample Sample Sample 6 (University of New Haven) Sampling 6 / 20
Sampling Distributions Sampling Distributions The law of large numbers assures us that if we measure enough subjects, the statistic x-bar will eventually get very close to the unknown parameter µ. If we took every one of the possible samples of a certain size, calculated the sample mean for each, and graphed all of those values, we d have a sampling distribution. The population distribution of of a variable is is the distribution of of values of of the variable among all individuals in in the population. The sampling distribution of of a statistic is is the distribution of of values taken by the statistic in in all possible samples of of the same size from the same population. 7 (University of New Haven) Sampling 7 / 20
Sampling Distributions Mean and Standard Deviation of a Sample Mean Mean of a sampling distribution of a sample mean There is no tendency for a sample mean to fall systematically above or below µ, even if the distribution of the raw data is skewed. Thus, the mean of the sampling distribution is an unbiased estimate of the population mean µ. Standard deviation of a sampling distribution of a sample mean The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample. It is smaller than the standard deviation of the population by a factor of n. Averages are less variable than individual observations. 8 (University of New Haven) Sampling 8 / 20
Sampling Distributions The Sampling Distribution of a Sample Mean When we choose many SRSs from a population, the sampling distribution of the sample mean is centered at the population mean µ and is less spread out than the population distribution. Here are the facts. The Sampling Distribution of of Sample Means Suppose that x is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ. Then : The mean of the sampling distribution of x is µ x = µ The standard deviation of the sampling distribution of x is σ x = σ n Note : These facts about the mean and standard deviation of x are true no matter what shape the population distribution has. If If individual observations have the N(µ,σ) distribution, then the sample mean of of an SRS of of size n has the N(µ, σ/ n) distribution regardless of of the sample size 9 n. n. 9 (University of New Haven) Sampling 9 / 20
Central Limit Theorem Central Limit Theorem Central Limit Theorem (University of New Haven) Sampling 10 / 20
Central Limit Theorem Central Limit Theorem I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the law of frequency of error [the normal distribution]. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self effacement amidst the wildest confusion. The huger the mob, and the greater the anarchy, the more perfect is its sway. It is the supreme law of Unreason. Francis Galton In the previous slide, the sampling distribution of X is depicted as: 1 with mean µ, ie unbiased. 2 with standard deviation σ/ n. 3 with normal distribution. The first two depictions are always true, regardless of sample size or population distribution. The Central Limit Theorem (below) says the third depiction is approximately true, regardless of population distribution, for large sample sizes, n. As Francis Galton said, the averaged effects of random acts from a large mob form a familiar pattern. Theorem (Central Limit Theorem, CLT) Consider a random sample of size n from a population with mean µ and standard deviation σ. For large n, the sampling distribution of X is approximately N ( µ, σ/ n ). (University of New Haven) Sampling 11 / 20
Central Limit Theorem Example Based on service records from the past year, the time (in hours) that a technician requires to complete preventative maintenance on an air conditioner follows the distribution that is strongly right-skewed, and whose most likely outcomes are close to 0. The mean time is µ = 1 hour and the standard deviation is σ = 1. Your company will service an SRS of 70 air conditioners. You have budgeted 1.1 hours per unit. Will this be enough? The central limit theorem states that the sampling distribution of the mean time spent working on the 70 units is: = μ =1 μ x σ x = σ n = 1 70 = 0.12 The sampling distribution of the mean time spent working is approximately N(1, 0.12) because n = 70 30. 1.1 1 z = 0.12 = 0.83 P(x > 1.1) = P(Z > 0.83) = 1 0.7967 = 0.2033 If you budget 1.1 hours per unit, there is a 20% chance the technicians will not complete the work within the budgeted time. 11 (University of New Haven) Sampling 12 / 20
Central Limit Theorem A Few More Facts Any linear combination of independent Normal random variables is also Normally distributed. More generally, the central limit theorem notes that the distribution of a sum or average of many small random quantities is close to Normal. Finally, the central limit theorem also applies to discrete random variables. 12 (University of New Haven) Sampling 13 / 20
Binomial Distribution Binomial Distribution Binomial Distribution (University of New Haven) Sampling 14 / 20
Binomial Distribution Definition (Bernoulli Distribution, X BIN(1, p)) Model: X = # heads after tossing a coin once, that has a probability of heads on each toss equal to p. Definition (Binomial Distribution, X BIN(n, p)) Model: X = # heads after tossing a coin n times, that has a probability of heads on each toss equal to p. Theorem If X BIN(n, p) and j is a nonnegative integer between 0 and n inclusive ( ) n P(X = j) = p j (1 p) n j. j Furthermore µ X = np, σ 2 X = np(1 p) and σ X = np(1 p). (University of New Haven) Sampling 15 / 20
Binomial Distribution Let Y 1, Y 2,, Y n be a random sample from BIN(1, p). Then 1 X def = n j=1 Y j BIN(n, p). 2 ˆp def = Ȳ = # of heads # of tosses is an unbiased estimator of p. 3 For ( large n, the) distribution of ˆp = Ȳ is approximately N p, by the Central Limit Theorem. Since X = nȳ p(1 p) n one has Theorem (Normal Approximation for Binomial Distribution) For ( large n, one has X BIN(n, p) is approximately distributed as N np, ) np(1 p). For how large of n is the above approximate good? Convention When np 10 and n(1 p) 10. (University of New Haven) Sampling 16 / 20
Binomial Distribution When dealing with discrete random variables as the binomial distribution, a continuity correction can greatly improve accuracy. For instance consider the example: Example (Exact) Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the last 100 red lights run, what is the probability that there were 25 or fewer accidents? Solution: Letting X BIN(100, 0.3) be the number of accidents. The exact answer is ( ) 25 25 100 P(X = j) = (0.3) j (0.7) 100 j = 0.1631, j j=0 j=0 (obtained with Mathematica). Or using R, > pbinom(25,100,0.3) [1] 0.1631301 The exact answer can t easily be obtained without a computer. (University of New Haven) Sampling 17 / 20
Binomial Distribution Example (Normal approximation without continuity correction) Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the last 100 red lights run, what is the probability, approximately, that there were 25 or fewer accidents? Solution: Let X BIN(100, 0.3). Since 100(0.3) ( 10 and 100(1 0.3) 10, X has approximately the same distribution as Y N 30, ) 100(0.3)(1 0.3) = N(30, 4.582576). Thus P[X 25] P [Y 25] [ ] Y 30 25 30 = P 4.582576 4.582576 = P [Z 1.091089] = 0.1379 using the Table. Instead of using a table, one can get more accuracy using R for the normal approximation without continuity correction: > pnorm(25,30,sqrt(100*0.3*(1-0.3))) [1] 0.1376168 The approximation is unsatisfactory. (University of New Haven) Sampling 18 / 20
Binomial Distribution Continuity Correction Let X BIN(n, p) and let j, k be integers such that 0 j k n. Then it is common practice to use the following approximation when np 10 and n(1 p) 10: P [j X k] P [j 0.5 Y k + 0.5] ( where Y N np, ) np(1 p). (University of New Haven) Sampling 19 / 20
Binomial Distribution Example (Normal approximation with continuity correction) Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the last 100 red lights run, what is the probability, approximately, that there were 25 or fewer accidents? Since 100(0.3) ( 10 and 100(0.7) 10 the above convention says, letting Y N 30, ) 100(0.3)(1 0.3) = N(30, 4.582576) P(X 25) P(Y 25.5) ( ) Y 30 25.5 30 = P 4.582576 4.582576 = P(Z 0.9819805) 0.1635 using the Table. Instead of using a table, one can get more accuracy using R for the normal approximation with continuity correction: > pnorm(25.5,30,sqrt(100*0.3*(1-0.3))) [1] 0.1630547 This approximation is much, much better than the normal approximation without continuity correction. (University of New Haven) Sampling 20 / 20