μ: ESTIMATES, CONFIDENCE INTERVALS, AND TESTS Business Statistics

CONTENTS Estimating parameters The sampling distribution Confidence intervals for μ Hypothesis tests for μ The t-distribution Comparison of z and t Old exam question Further study

ESTIMATING PARAMETERS Central task in inferential statistics Estimation estimating a parameter (population value) from a sample Example what proportion of cars in Amsterdam is electric? population value: π sample of size n = 200 cars yields 26 electric cars so, p = 26 200 = 0.13 this suggests π 0.13

ESTIMATING PARAMETERS Terminology Parameter a characteristic descriptive of the population e.g., μ, π, σ (or σ 2 ) Estimator a statistic derived from a sample to infer the value of a population parameter e.g., തX, P, S (or S 2 ) Estimate the value of the estimator in a particular sample e.g., x, ҧ p, s (or s 2 )

ESTIMATING PARAMETERS

ESTIMATING PARAMETERS Mean Standard deviation Proportion Estimator Estimate Population parameter തX = 1 σ n i=1 n X i x ҧ = 1 σ n i=1 n x i μ S = P = X n 1 σ n n 1 i=1 X i തX 2 s = p = x n 1 σ n n 1 i=1 x i xҧ 2 σ π

ESTIMATING PARAMETERS Another example (Amsterdam, 2015): what is the mean price of a glass of beer? population value: μ sample of size n = 64 glasses of beer yields x ҧ = 2.06 this suggests that μ = 2.06 But suppose we had taken a different sample again with sample size n = 64 but now perhaps yielding x ҧ = 2.13 then we would estimate μ = 2.13 Obviously there is sampling variation so a distribution of x-values ҧ (the sampling distribution of തX) Solution: point estimates and confidence intervals

THE SAMPLING DISTRIBUTION Example Consider a discrete uniform population consisting of the integers {0, 1, 2, 3} The population parameters are: μ = 1.5 σ = 1.118

THE SAMPLING DISTRIBUTION Sample n = 2 values and calculate xҧ Do this for all possible sample of size n = 2 You will get a distribution of x-values: ҧ the distribution തX

THE SAMPLING DISTRIBUTION We will study the variance of the estimate of a population parameter from a sample statistic We will do so by studying how the sample statistic varies when you draw a different sample Example: GMAT score of MBA students N = 2637 μ = 520.78 σ = 86.60

THE SAMPLING DISTRIBUTION Consider eight random samples, each of size n = 5 the sample means ( xҧ 1 = 504.0, xҧ 2 = 576.0,, xҧ 8 = 582) tend to be close to the population mean (μ = 520.78) sometimes a bit lower, sometimes a bit higher

THE SAMPLING DISTRIBUTION The dot plots show that the sample means ( xҧ 1,, xҧ 8 ) have much less variation than the individual data points (x 1,, x 2637 )

THE SAMPLING DISTRIBUTION An estimator is a random variable since samples vary so we write it as a capital letter, e.g., X, തX, S, etc. The sampling distribution of an estimator is the probability distribution of all possible values the statistic may assume when a random sample of (a fixed) size n is taken so we write X~N μ, σ, etc.

THE SAMPLING DISTRIBUTION The sampling distribution of തX for a population with μ = μ X and σ 2 = σ X 2 If the CLT holds തX~N μ X, σ X 2 So, the statistic തX is normally distributed has mean μ X and has standard deviation σ X n Fortunately, the CLT holds pretty often n 3 things: shape, mean, dispersion

THE SAMPLING DISTRIBUTION The standard deviation of the distribution of sample means തX is given by σx ത = σ X n has a special name: standard error of the mean is often abbreviated as the standard error (SE) decreases with increasing sample size but only according to the law of diminishing returns (1/ n) is often calculated by software (SPSS, etc.) is the basis for confidence intervals and hypothesis tests (see later) That s a bit confusing, because we will meet more standard errors later on

EXERCISE 1 What is the meaning of the standard error?

CONFIDENCE INTERVALS FOR μ A sample mean xҧ is a point estimate of the population mean μ it is the best possible estimate of μ but it will probably not be completely right A confidence interval (CI) for the mean is a range of possible values for μ: μ lower μ μ upper To simplify notation, we will drop the X from μ X now, and write just μ such that the interval CI μ = μ lower, μ upper contains the true value (μ) with a certain probability (e.g., 95%)

CONFIDENCE INTERVALS FOR μ From the CLT it follows that under certain conditions: the distribution of തX is normal the best estimate of തX of μ is xҧ the standard deviation of തX is σ n This implies that: with probability 2.5%, തX < μ 1.96 σ n μ > തX + 1.96 σ n with probability 2.5%, തX > μ + 1.96 σ n μ < തX 1.96 σ n so with probability 95%, തX 1.96 σ n μ തX + 1.96 σ n So, if we find a sample mean x, ҧ we can construct the following 95% confidence interval for μ: CI μ,0.95 = xҧ 1.96 σ n, x ҧ + 1.96 σ n

CONFIDENCE INTERVALS FOR μ Three notations for a confidence interval for μ xҧ 1.96 σ, x ҧ + 1.96 σ n n xҧ 1.96 σ μ n x ҧ ± 1.96 σ n x ҧ + 1.96 σ n

CONFIDENCE INTERVALS FOR μ Example Population μ = 520.78 (unknown) σ = 86.60 (known) normally distributed (assumed) Sample n = 5 (chosen) x ҧ = 504.0 (estimated) Calculation standard error of mean: 86.60 5 = 38.73 1.96 38.73 = 75.91 CI μ,0.95 = 428.09, 579.91

EXERCISE 2 Write the confidence interval 428.09, 579.91 in two alternative ways.

CONFIDENCE INTERVALS FOR μ The factor 1.96 is of course related to the 95% probability Other confidence levels: Where z α/2 is such that P Z z α/2 = α if Z is drawn from a Z-distribution General form of a 1 α 100% confidence interval of the mean: CI μ,1 α = xҧ z α/2 σn, x ҧ + z σ α/2 n

CONFIDENCE INTERVALS FOR μ

CONFIDENCE INTERVALS FOR μ Trade-off narrow CI low confidence level wide CI high confidence level Choice of confidence level depends on application more precision required for a refinery than for a dairy farm

CONFIDENCE INTERVALS FOR μ A confidence interval either does or does not contain μ The confidence level quantifies the risk Out of 100 confidence intervals, approximately 95% will contain μ, while approximately 5% might not contain μ

HYPOTHESIS TESTS FOR μ We can use the standard error to perform a hypothesis test recall that CI μ,0.95 = 428.09, 579.91 Suppose we hypothesize μ = 550 The value 550 is inside the 95% confidence interval for μ therefore the sample statistic+confidence interval will not suggest that the hypothesis (μ = 550) is wrong and we will not reject the hypothesis notice that we didn t say that μ = 550; we only said that we can t reject it (at a 5% significance level)

HYPOTHESIS TESTS FOR μ Another example: suppose we hypothesize that μ = 600 The value 600 is outside the confidence interval for μ finding a confidence interval not containing μ happens only in 5% of the cases so we conclude that μ 600 (at a 5% significance level) therefore the sample statistic+confidence interval will suggest that the hypothesis (μ = 600) is wrong and we will reject the hypothesis Much more on hypothesis tests later on!

THE t-distribution A closer look at CI μ,0.95 = xҧ 1.96 σ, x ҧ + 1.96 σ n n Given a sample mean x, ҧ you can find a 95% confidence interval for the population mean μ Sounds great when you don t know μ...... but it assumes you do know σ! There are many situations in which you don t know μ and you also don t know σ So what to do?

THE t-distribution A simple strategy If the population standard deviation σ is unknown, we can estimate it with the sample standard deviation s Then we use ±1.96 s n instead of ±1.96 σ n But we pay a price for that The reason is that s is itself an estimate of σ, and therefore uncertain The price we pay is that the factor 1.96 must be somewhat larger

THE t-distribution Recall that the CLT yields that ത X μ σ/ n where Z is the standard normal distribution Likewise, it can be shown that തX μ s/ n ~t ~N 0,1 where t is the t-distribution (or Student s t-distribution) which has an even more complicated formula than the normal distribution f z = 1 2π e 1 2 z2 vs. f t; ν = Γ 1 2 ν+1 νπγ 1 2 ν 1 + t2 ν 1 2 ν+1 Arrrgh: forget quickly!

THE t-distribution The confidence interval for μ with unknown σ is CI μ,1 α = xҧ t α/2 sn, x ҧ + t α/2 s n Where t α/2 is such that P T t α/2 = α if T is drawn from a t-distribution What is the t-distribution? quite similar to the Z-distribution (μ = 0, continuous, symmetric, bellshaped, infinite range,...) a little bit fatter tails it has 1 parameter, usually denoted with df or ν, and called degrees of freedom

THE t-distribution Graph of pdf of t-distribution Z (standard normal) distribution t-distribution with df = 1000 f x t-distribution with df = 13 t-distribution with df = 5 x

THE t-distribution Different notations t 13 t df = 13 etc. And likewise t 13;α/2 t 13 α/2 etc. So altogether for the confidence interval CI μ,1 α = Compare to xҧ z α/2 σn, x ҧ + z α/2 xҧ t n 1;α/2 sn, x ҧ + t n 1;α/2 s n σ n

THE t-distribution

THE t-distribution How to choose the parameter df? it is a parameter based on the sample size that is used to determine the value of the t-statistic it tells how many observations are used to estimate σ, less the number of intermediate estimates used in the calculation the df for the t-distribution in the case of a confidence interval for μ when σ is unknown, is df = n 1 but in other cases, it may be different Properties of t as n increases, the t-distribution approaches the shape of the normal distribution for a given confidence level α, t is always larger than z, so a confidence interval based on t is always wider than if z were used

THE t-distribution Reading the table of critical t-values e.g., t 0.025 9 t = 2.262 α/2 = 0.025 df = 9

THE t-distribution Look carefully at tables for z and t: z usually runs from left to right P X z = f x dx t usually runs from right to left P X t = t f x dx z

THE t-distribution Background of t developed by William Gosset in 1908 while working at Guiness Brewery, Dublin published under the pen name Student

THE t-distribution Example for confidence interval Population μ = 520.78 (unknown) σ = 86.60 (unknown) normally distributed (assumed) Sample n = 5 (chosen) x ҧ = 504.0 (estimated) s = 73.01 (estimated) Calculation standard error of mean: s n = 32.65 2.776 32.65 = 90.65 CI μ,0.95 = 413.35, 594.64 now we have a situation in which σ is not known to us df=4

THE t-distribution Repeat the hypothesis test for this case now CI μ,0.95 = 413.35, 594.65 So we will reject the hypothesis μ = 600 while we will not reject the hypothesis μ = 550 Exactly the same reasoning as with the z-test, but with (slightly) different numbers

COMPARISON OF z AND t When to use which? for a confidence interval for μ if σ 2 is known: use z for a confidence interval for μ if σ 2 is unknown: use t, and estimate σ 2 by s 2 How to find? from a table with z-values: given α, look up z from a table with t-values: given α and df, look up t What is the difference? confidence intervals with t are a bit wider than with z the difference is small for n 30 and negligible for n 100

COMPARISON OF z AND t Example: 50 confidence intervals with z and t 60 50 50 Samples, sample size n=10 Simulated from: N(2,9) distribution Based on 2 Based on s 2 (i) σ തX ± z α/2 n Sample Number i 40 30 20 തX ± t α/2;n 1 S n 10 0 0 2 4 0 2 4

OLD EXAM QUESTION 23 March 2015, Q1l

FURTHER STUDY Doane & Seward 5/E 8.4-8.5, 10.4 Tutorial exercises week 2 point estimate confidence interval, z test for mean t test for mean z versus t