Lecture 6: Confidence Intervals

Lecture 6: Confidence Intervals Taeyong Park Washington University in St. Louis February 22, 2017 Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 1 / 29

Today... Review of sampling distributions Answer key for problem set 1 on Blackboard Confidence interval Lab: central limit theorem; confidence interval Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 2 / 29

Roadmap The ultimate goal of this course: Conduct linear regression analysis using real-world data; Interprete the results; Present the results effectively using plots. Confidence intervals and hypothesis testing Interpretations Sampling distributions; standard error; central limit theorem Confidence intervals and hypothesis testing Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 3 / 29

Three types of distributions Population distribution Sample data distribution Sampling distribution Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 4 / 29

Sampling distribution Sampling Distributions A sampling distribution is the distribution of a statistic given repeated sampling. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 5 / 29

Sampling distribution: Example Population: American voters A multitude of polls (samples) Statistic: ex. proportion of respondents that voted for Trump; the mean age Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 6 / 29

Sampling distribution Standard error The standard deviation of sampling distribution. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 7 / 29

Central limit theorem For random sampling with a large sample size n, the sampling distribution of the sample mean y is approximately normal. The mean of the distribution is equal to population mean µ. The standard deviation of the distribution is equal to σ n. ) y N (µ, n σ Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 8 / 29

Central limit theorem: Some notes If Y is normal, the CLT applies for all n. Otherwise, you need a large enough sample. Usually n=30 is good enough, but it will depend on the distribution. As n, the standard error is going to get smaller and smaller. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 9 / 29

What is next? We are hunting population parameters [µ, σ]. What percentage of Americans approve of President Trump? What is the average age of Missourian people? We sample from the population and calculate sample statistics [y, S]. Today we are going to learn how to use sample statistics to estimate population parameters. How? Probability theory ) and sampling distributions. σ y N (µ, n This will be our first true statistical inference. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 10 / 29

Estimation basics Point estimation A sample statistic that gives a good guess about a population parameter. Example: Point estimation for population mean (ˆµ) y = 1 n y i n i=1 Example: Point estimate for population standard deviation (ˆσ) (yi y) 2 S = n 1 Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 11 / 29

Estimation basics How do we choose among multiple possible estimators? Sample mean ȳ; sample median?; 1st quartile?; maximum number? We want our estimators to be: Unbiased (i.e., accurate), E(ˆµ) = µ with repeated sampling Efficient (i.e, precise), σˆµ is small(er) Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 12 / 29

Point estimates The point estimates for populations parameters µ and σ are: denoted ˆµ and ˆσ best estimated by y and S. They are best in terms of bias and efficiency. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 13 / 29

Point and interval estimates You are the campaign manager for a candidate who is deciding whether or not to publish a new deficit reduction proposal. You commission a poll of voters in the district to find out whether they approve or disapprove of this proposal. Which of the following statements would you find most useful from your pollster? 1 We can be 25% confident that between 54 and 55 percent of voters approve of the plan. 2 We can be 95% confident that between 48.5 and 59.5 percent of voters approve of the plan. 3 We can be 99% confident that between 45.75 and and 62.25 percent of voters approve of the plan. 4 We can be 100% confident that between 0 and 100 percent of voters approve of the plan. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 14 / 29

Confidence intervals A point estimate is OK, but it is not very useful without knowing how much confidence to have it. Solution interval estimation. Confidence interval A confidence interval for a population parameter is a range of numbers within which a parameter is believed to fall. Confidence level The probability that an interval would contain the parameter with repeated sampling. Examples: 0.95 95% confidence interval 0.70 70% confidence interval Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 15 / 29

Confidence intervals Confidence interval Point estimate ± Margin of error Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 16 / 29

Confidence interval for population means (large samples) We can use the sampling distribution of ȳ (assuming a large sample) to calculate a confidence interval for the population mean. Parameter: µ We want to estimate µ ˆµ. We use a sampling distribution and CLT: ȳ N ( ) σ µ, n A unbiased and efficient point estimate (ˆµ) is the sample mean ȳ. Then, how to calculate the margin of error in Point estimate ± Margin of error? The margin of error = z-score standard error. z-score depends on the confidence level: 95% level 1.96; 99% level 2.58. Standard error σȳ = σ by CLT. ˆσȳ = S n n Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 17 / 29

Confidence interval for population means (large samples) We plug in the estimated value of σ sample standard deviation S to get ˆσȳ. We use ȳ to estimate µ, which is sometimes denoted ˆµ Now we have an estimated sampling distribution, N(ȳ, ˆσȳ) We use our knowledge of the normal distribution to find a CI E.g., we want 2.5% of the probability to be outside of our interval on each side. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 18 / 29

Steps Calculate ȳ Calculate S and then ˆσȳ = S n How much area do we need under the curve to the right? (1-Confidence Coefficient)/2 Find the z-score associated with that number. Use these values to calculate ȳ ± Z ˆσȳ Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 19 / 29

Steps Calculate ȳ Calculate S and then ˆσȳ = S n How much area do we need under the curve to the right? (1-Confidence Coefficient)/2 Find the z-score associated with that number. Use these values to calculate ȳ ± Z ˆσȳ Exercise: If ȳ = 9.6, n = 100, and S = 4, what is the 99% confidence interval for µ? Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 19 / 29

Example If we know ȳ = 9.6, S = 4, n = 100, how can we find a 95% confidence interval for the population mean µ? Find values for L and R on the standard normal distribution such that: Pr(L µ R) = 0.95 Plug in our estimates, and see that ȳ N(µ, σȳ) N(ȳ, L = ȳ (Z ˆσȳ), R = ȳ + (Z ˆσȳ) Look for (1.95)/2 =.025 on the z-table S n ) Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 20 / 29

Example If we know ȳ = 9.6, S = 4, n = 100, how can we find a 95% confidence interval for the population mean µ? Find values for L and R on the standard normal distribution such that: Pr(L µ R) = 0.95 Plug in our estimates, and see that ȳ N(µ, σȳ) N(ȳ, L = ȳ (Z ˆσȳ), R = ȳ + (Z ˆσȳ) Look for (1.95)/2 =.025 on the z-table Answer: ȳ ± 1.96 ˆσȳ = 9.6 ± 1.96 4 10 = [8.816, 10.384] S n ) Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 20 / 29

When to use z-score? The sample size is large. or We know σ and the population is normally distributed. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 21 / 29

When the sample size is small To compute a CI for any random sample size, we need to assume that the population is normally distributed: CI = ȳ ± t n 1 ( s n ) We use the t-distribution (t-score) instead of the normal distribution (z-score), because the error produced by estimating σ using s is large due to the small sample size. n n A t-score is larger than a z-score a wider CI. Accounts for the increased error. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 22 / 29

The t-distribution and the normal Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 23 / 29

Notes on the t-distribution It has thicker tails than the normal distribution. Symmetric and bell-shaped Dispersion depends on degrees of freedom, sometimes listed as df or DOF. As df the t-distribution becomes essentially the normal distribution. NOTE: The use of the t-distribution is not related to the CLT. We are assuming the data is normally distributed. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 24 / 29

What about for categorical variables? Useful for any categorical variable. Citizens who plan to vote Nations with low tariffs Congressmen who support a balanced budget amendment. Students with blue eyes To summarize categorical data, we record the proportions of observations in the categories. Some new notation: Population parameter: 0 π 1 Estimator: ˆπ = Sample proportion Population parameter: σ = π(1 π) Population estimator: ˆσ = ˆπ(1 ˆπ) Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 25 / 29

Confidence intervals for proportions: A how-to guide y i = 0 or y i = 1 for all i You either have blue eyes or you don t, etc. So how to calculate a confidence interval? Calculate an estimator (ˆπ), ˆπ = 1 n n i=1 y i (psst... this is ȳ) a standard error for the estimator (σˆπ ), ˆσˆπ = ˆσ ˆπ(1 ˆπ) ˆπ(1 ˆπ) = = n n n and find the right Z for the confidence coefficient. The confidence interval is ˆπ ± Z ˆσˆπ Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 26 / 29

Take-away message The major conceptual difference is that this works for all categorical data. The major difference here is in the calculation of ˆσ We can calculate it just using ˆπ The formula is quite different than S. This is for large sample confidence intervals (n > 30) Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 27 / 29

Examples: Confidence intervals with proportions Source: Gallup poll Sample size: n=1785 Population: U.S. Adults Question: Did you, yourself, attend church or synagogue in the past 7 days? Sample Data: 750 said yes. How many No s? 1035 Find a 95% confidence interval for the proportion who went to church or synagogue. Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 28 / 29

Examples: Confidence intervals with proportions ˆπ(1 ˆπ) Sample Statistic: ˆπ = 0.420, ˆσˆπ = n.420(1.420) Confidence interval: 0.420 ± 1.96 1785 Final answer: [0.398, 0.442] = 0.420 ± 0.022 Park (Wash U.) U25 PS323 Intro to Quantitative Methods February 22, 2017 29 / 29