AP Statistics Ch. 7 Notes Sampling Distributions A major field of statistics is statistical inference, which is using information from a sample to draw conclusions about a wider population. Parameter: A number that describes some characteristic of the population. In practice, the value of the parameter is usually unknown because it is often impossible to eamine the entire population. µ (population mean) 2 σ (population variance) and σ (population standard deviation) p (population proportion the proportion of individuals in the population with a certain characteristic) Statistic: A number that describes some characteristic of a sample. The value of a statistic can be computed directly from the sample data. We often use a statistic to estimate an unknown parameter. (sample mean) 2 s (sample variance) and s (sample standard deviation) ˆp (sample proportion the proportion of individuals in the sample with a certain characteristic). Remember: statistics come from samples and parameters come from populations. Eample: Identify the population, the parameter, the sample, and the statistic in each of the following setting. a) A pediatrician wants to know the 75 th percentile for the distribution of heights of 10-year-old boys, so she takes a sample of 50 patients and calculates Q3 = 56 inches. b) A Pew Research Center poll asked 1102 12- to 17-year-olds in the United States if they have a cell phone. Of the respondents, 71% said yes. Sampling Variability: The value of a statistic varies from sample to sample in repeated samples from the same population. In order to answer a question about a parameter based on a sample, we need to know eactly how the value of a statistic varies from sample to sample. We ask ourselves, What would happen if we took many samples of the same size from this population? Take a large number of samples of the same size from the same population. Calculate the statistic (like or ˆp ) for each sample. Make a graph of the different values of the statistic. Eamine the distribution displayed in the graph for shape, center, and spread, as well as for outliers or other deviations.
Sampling Distribution: The distribution of values taken by the statistic in all possible samples of the same size from the same population. Eamples: The aces and face cards are removed from a deck of cards so that only cards 2 through 10 remain. The deck is thoroughly shuffled and a sample of 5 cards is selected. The median value of the five cards is recorded. This process is repeated 25 times and the following values of sample median are observed. 2 3 4 5 6 7 Sample Median 8 9 10 Describe what you see: shape, center, spread, and any unusual values. A computer was used to simulate choosing 500 SRSs of size 5 from the deck of cards described above. The graph below shows the distribution of the sample median for these 500 samples. 2 3 4 5 6 7 8 9 10 SampleMedian Is this the sampling distribution of the sample median? Why or why not? Describe the distribution. Suppose that another student prepared a different deck of cards and claimed that it was eactly the same as the one used previously. When you took an SRS of size 5, the median was 4. Does this provide convincing evidence that the student s deck is different?
There are three distributions involved when we sample repeatedly, and it is very important to be clear which one we are talking about. Population distribution: Gives the values of the variable of interest for all individuals in the population. Distribution of sample data: Gives the values of the variable of interest for the individuals in the sample. Sampling distribution: Gives the values of the statistic for all possible samples of the same size taken from the population. The population distribution and the distribution of sample data describe individuals. A sampling distribution describes how a statistic varies in many samples from the population. When we calculate the value of a statistic, we usually want to use it to estimate the value of a population parameter. To determine how reliable a prediction based on a statistic is, we consider the shape, center, and spread of the sampling distribution for the statistic. Unbiased estimator: A statistic used to estimate a parameter is an unbiased estimator if the mean of its sampling distribution is equal to the true value of the parameter being estimated. Unbiased doesn t mean perfect. An unbiased estimator will almost always provide an estimate that is not eactly equal to the true value of the population parameter. It is called unbiased because in repeated samples, the estimates won t be consistently too high or consistently too low. When we talk about biased and unbiased estimators, we are assuming that the sampling process we are using has no bias. We are assuming that there are no sampling or non-sampling errors present, just sampling variability. If the sampling process is flawed, there can be bias even if we are using what is otherwise considered an unbiased estimator.
Eample: A teacher thoroughly mies identically-sized slips of paper numbered 1 through 342 in a bag. Students draw out repeated samples of four numbers each and work to develop a formula to estimate the total number of slips, N, that are in the bag. The graph below shows the estimates produced by the following five different methods: (1) Partition = sample maimum (5/4) (2) Ma = sample maimum (3) MeanMediam = sample mean + sample median (4) SumQuartiles = Q 1 + Q 3 (both sample values) (5) TwiceIQR = 2 sample IQR Partition Ma MeanMedian SumQuartil... The thick line through the graph marks the true value of N = 342. a) Which of these statistics appear to be biased estimators? Eplain. TwiceIQR 0 100 200 300 400 500 600 700 b) Of the unbiased estimators, which is best? Why? c) Eplain why a biased estimator might be preferred over an unbiased estimator. Variability of a statistic: How much the value of the statistic varies from sample to sample. It is described by the spread of its sampling distribution. This spread is determined primarily by the size of the random sample. Larger samples give smaller variability in the values of a statistic. Larger samples will reduce the variability of a statistic, but they don t eliminate bias! Eample: Suppose that the heights of adult males are approimately Normally distributed with a mean of 70 inches and a standard deviation of 3 inches. To see why sample size matters, we took 1000 SRSs of size 100 and calculated the sample mean height and then took 1000 SRSs of size 1500 and calculated the sample mean height. Here are the results, graphed on the same scale for easy comparisons: 500 400 300 200 100 500 400 300 200 100 69.0 70.0 71.0 SampleMean100 69.0 70.0 71.0 SampleMean1500 Compare the shape, center, and spread of the distributions. What does this tell you about the relationship between sample size and sample mean?
We can represent the true value of the parameter we are trying to estimate by the bulls-eye of a target. The values of the statistic from sample to sample are represented by an arrow that is repeatedly shot at the target. Bias means our aim is off and we consistently miss the true value of the parameter. High variability means that the repeated shots (the values of the statistic from sample to sample) are widely scattered. When we select which statistic we want to use to estimate the value of a parameter, we want to choose a statistic that is accurate (unbiased) and precise (low variability).
Sample Proportions How good is the statistic ˆp (the proportion of individuals in a sample with a given characteristic) as an estimate of the parameter p (the proportion of individuals in the population with the given characteristic)? To answer, we ask, What would happen if we take many samples? Eample: In a population of N = 616 pennies, the proportion that were minted after 2005 is p = 0.175. Five hundred samples each of sizes 5, 10, 20, and 50 are taken. The distributions of ˆp for the 500 samples of each size are shown below. Compare the shape, center, and spread of the distributions. p-hat 5 p-hat 10 p-hat 20 p-hat 50 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Sample Proportions Each sy mbol represents up to 18 observ ations. 0.8 Sampling Distribution of a Sample Proportion Choose an SRS of size n from a population of size N with proportion p of successes and proportion q of failures. Let ˆp be the sample proportion of successes. Then: The mean of the sampling distribution of ˆp is µ ˆ = p. p pq The standard deviation of the sampling distribution of ˆp is σ ˆp = as long as the observations are n 1 independent or the 10% condition is satisfied: n N or N 10 n. 10 As n increases, the sampling distribution of ˆp becomes approimately Normal. Before you use Normal calculations, check that the Normal condition is satisfied: np 10 and nq 10. Since larger random samples give better information, it sometimes make sense to sample more than 10% of the population. In this case, there s a more accurate formula for calculating the standard deviation σ pˆ. It uses something called a finite population correction (FPC). The formula without the FPC will always give a larger (more conservative) estimate of standard deviation than the actual standard deviation. In case you are dying to pq n know, the formula is σ pˆ = 1. n N
When solving problems involving sample proportions: 1. Justify using the Normal distribution by checking np 10 and nq 10. 2. Find µ ˆp and σ pˆ. Check the 10% condition to justify using the formula for σ pˆ. 3. Write a probability statement and draw and shade a Normal curve. p ˆ µ pˆ 4. Perform Normal calculations either by using z-scores z = and a table or by using normalcdf on σ pˆ your calculator. 5. Write your answer in contet. Eamples: About 75% of young adult Internet users (ages 18 to 29) watch online video. Suppose that a sample survey contacts an SRS of 1000 young adult Internet users and calculates the proportion ˆp in this sample who watch online video. (a) What is the mean of the sampling distribution of p ˆ? Eplain the meaning of µ ˆ. p (b) Find the standard deviation of the sampling distribution of p ˆ. Check that the 10% condition is satisfied. Then eplain the meaning of σ pˆ. (c) Is the sampling distribution of ˆp approimately Normal? Check that the Normal condition is met. (d) If the sample size were 9000 rather than 1000, how would this change the sampling distribution of p ˆ? Eample: The superintendent of a large school district wants to know what proportion of middle school students in her district are planning to attend a four-year college or university. Suppose that 80% of all middle school students in her district are planning to attend a four-year college or university. What is the probability that an SRS of size 125 will give an estimate of this proportion that is within 7 percentage points of the true value?
Sample Means Eample: The histogram below shows the distribution of mint dates on a population of N = 616 pennies. Proportion 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 1960 1968 1976 1984 1992 Year 500 samples of size n = 5 and 500 samples of size n = 25 were taken from this population. The sample mean for each sample was calculated. The distributions of sample means for each sample size are shown below. 2000 2008 0.20 0.20 Proportion 0.15 0.10 0.05 Proportion 0.15 0.10 0.05 0.00 0.00 1980 1984 1988 1992 1996 2000 2004 2008 1980 1984 1988 1992 1996 2000 Sample Means, n=5 Sample Means, n=25 Compare the population distribution to the two distributions of sample means. 2004 2008 Mean and Standard Deviation of the Sampling Distribution of Suppose that is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ. Then: The mean of the sampling distribution of is µ = µ. σ The standard deviation of the sampling distribution of is σ = as long as the observations are n 1 independent or the 10% condition is satisfied: n N or N 10 n. 10 These facts about the mean and standard deviation of are true no matter what shape the population distribution has. If the sample is larger than 10% of the population, the finite population correction factor (FPC) is used and the σ n formula for the standard deviation of is σ = 1. n N
Eample: Suppose that the number of movies viewed in the last year by high school students has an average of 19.3 with a standard deviation of 15.8. Suppose we take an SRS of 100 high school students and calculate the mean number of movies viewed by the members of the sample. (a) What is the mean of the sampling distribution of? Eplain the meaning of µ. (b) What is the standard deviation of the sampling distribution of? Check that the 10% condition is satisfied. Eplain the meaning of σ. Sampling Distribution of a Sample Mean from a Normal Population Suppose that a population is Normally distributed with mean µ and standard deviation σ. Then the sampling distribution of has the Normal distribution with mean µ and standard deviation σ n, provided that the 10% condition is meant. This is true no matter what the sample size is. Eample: At the P. Nutty Peanut Company, dry-roasted, shelled peanuts are placed in jars by a machine. The distribution of weights in the jars is approimately Normal, with a mean of 16.1 ounces and a standard deviation of 0.15 ounces. (a) Without doing any calculations, eplain which outcome is more likely: randomly selecting a single jar and finding that the contents weigh less than 16 ounces or randomly selecting 10 jars and finding that the average contents weigh less than 16 ounces. (b) Find the probability of each event described above. The fact that averages of several observations are less variable than individual observations is important in many settings. It is common practice to repeat a measurement several times and report the average of the results. Think of the results of n repeated measurements as an SRS from the population of outcomes we would get if we repeated the measurement forever. The average of the n results is less variable than a single measurement.
Most population distributions are not Normal, so we need to figure out what shape the sampling distribution of is for a non-normal population or a population of unknown shape. The Central Limit Theorem (CLT) If a random sample of n observations is selected from any population and the sample size is sufficiently large ( n 30 ), then the sampling distribution of is approimately Normal. Normal Condition for Sample Means If the population distribution is Normal, then so is the sampling distribution of. This is true no matter what the sample size n is. If the population distribution is not Normal, the CLT tells us that the sampling distribution of will be approimately Normal in most cases if n 30. When solving problems involving sample means: 1. Justify using the Normal distribution using the conditions above. 2. Find µ and σ. Check the 10% condition to justify using the formula for σ. 3. Write a probability statement and draw and shade a Normal curve. µ 4. Perform Normal calculations either by using z-scores z = and a table or by using normalcdf on σ your calculator. 5. Write your answer in contet. Eamples: Suppose that the number of tets sent during a typical day by a randomly selected high school student follows a right-skewed distribution with a mean of 15 and a standard deviation of 35. Assuming that students at your school are typical teters, how likely is it that a random sample of 50 students will have sent more than a total of 1000 tets in the last 24 hours?