Point Estimation. Edwin Leuven

Size: px

Start display at page:

Download "Point Estimation. Edwin Leuven"

Sylvia Barton
5 years ago
Views:

1 Point Estimation Edwin Leuven

2 Introduction Last time we reviewed statistical inference We saw that while in probability we ask: given a data generating process, what are the properties of the outcomes? in statistics the question is the reverse: given the outcomes, what can we say about the process that generated the data? Statistical inference consists in 1. Estimation (point, interval) 2. Inference (quantifying sampling error, hypothesis testing) 2/43

3 Introduction Today we take a closer look at point estimation We will go over three desirable properties of estimator: 1. Unbiasedness 2. Consistency 3. Efficiency And how to quantify the trade-off between location and variance using the Mean Squared Error (MSE) 3/43

4 Random sampling Statistical inference starts with an assumption about how our data came about (the data generating process ) We introduced the notion of sampling where we consider observations in our data X 1,..., X n as draws from a population or, more generally, an unknown probability distribution f (X) Simple Random Sample We call a sample X 1,..., X n random if X i are independent and identically distributed (i.i.d) random variables Random samples arise if we draw each unit in the population with equal probability in our sample. 4/43

5 Random sampling We will assume throughout that our samples are random! The aim is to use our data X 1,..., X n to learn something about the unknown probability distribution f (X) where the data came from We typically focus on E[X], the mean of X, to explain things but we can ask many different questions: What is the variance of X What is the 10th percentile of X What fraction of X lies below 100,000 etc. Very often we are interested in comparing measurements across populations What is the difference in earnings between men and women 5/43

6 Bias Consider 1. the estimand E[X], and 2. an estimator ˆX What properties do we want our estimator ˆX to have? One desirable property is that ˆX is on average correct We call such estimators unbiased Bias Bias = E[ ˆX] E[X] E[ ˆX] = E[X] 6/43

7 Bias The estimand in our example the population mean E[X] is a number For a given sample ˆX is also a number, we call this the estimate Bias is not the difference between the estimate and the estimand this the estimation error Bias is the average estimation error across (infintely) many random samples! 7/43

8 Estimating the Mean of X The sample average is an unbiased estimator of the mean E[ X n ] = 1 n ni=1 E[X i ] = E[X] but we can think of different unbiased estimators, f.e. X 1 is also an unbiased estimate of E[X] If X has a symmetric distribution then both median(x), and (min(x) + max(x))/2 are unbiased 8/43

9 Estimation the Variance of X The estimator of the variance Var(X) = 1 ni=1 n 1 (X i X n ) 2 Why divide by n 1 and not n? E[ 1 n (X i X n ) 2 ] = 1 n E[(X i X n ) 2 ] n n i=1 i=1 = 1 n E[Xi 2 2X i X n + X n 2 ] n i=1 = E[Xi 2 ] 2E[X i Xn ] + E[ X n 2 ] where the last line follows since = n 1 n (E[X 2 i ] E[X i ] 2 ) = n 1 n Var(X i) E[ X n 2 ] = E[X i X n ] = 1 n E[X i 2 ] + n 1 n E[X i] 2 9/43

10 Variance Estimation We can verify this through numerical simulation: n = 20; nrep = 10^5 varhat1 = rep(0, nrep); varhat2 = rep(0, nrep) for(i in 1:nrep) { x = rnorm(n, 5, sqrt(3)) sx = sum((x - mean(x))^2) varhat1[i] = sx / (n - 1); varhat2[i] = sx / n } mean(varhat1) ## [1] mean(varhat2) ## [1] /43

11 How to choose between two unbiased estimators? Since both are centered around the truth: pick the one that tends to be closest! One measure of close is Var( ˆX), the sampling variance of ˆX x1 = rep(0, nrep); x2 = rep(0, nrep) for(i in 1:nrep) { x = rnorm(100, 0, 1) x1[i] = mean(x); x2[i] = (min(x) + max(x)) / 2 } var(x1) ## [1] var(x2) ## [1] /43

12 How to choose between two unbiased estimators? Since both are centered around the truth: pick the one that tends to be closest! One measure of close is Var( ˆX), the sampling variance of ˆX y1 = rep(0, nrep); y2 = rep(0, nrep) for(i in 1:nrep) { x = runif(100, 0, 1) y1[i] = mean(x); y2[i] = (min(x) + max(x)) / 2 } var(y1) ## [1] var(y2) ## [1] /43

13 How to choose between two unbiased estimators? Normal(0,1) distribution Density x1 13/43

14 How to choose between two unbiased estimators? Uniform[0,1] distribution Density y1 14/43

15 How to choose between two unbiased estimators? The sampling distribution of our estimator depends on the underlying distribution of X i in the population! X i Normal the sample average outperforms the midrange X i Uniform the midrange outperforms the sample average However, the sample average is attractive default because it is often 1. has a sampling distribution that is well understood 2. more efficient (smaller sampling variance) than alternative estimators We will say more about this in the context of the WLLN and the CLT 15/43

16 The Standard Error Above we compared the average and the midrange estimators using the sampling variance Var( ˆX) = E[( ˆX E[ ˆX]) 2 ] = E[ ˆX 2 ] E[ ˆX] 2 It is however common to use the square root of the sampling variance of our estimators This is called the standard error Standard Error of ˆX = Var( ˆX) 16/43

17 The Standard Error of the Sample Proportion Consider a Bernouilli random variable X where { 1 with probability p X = 0 with probability 1 p The sample proportion is X n = 1 n Var( ˆX) = 1 n 2 i i X i with variance Var(X i ) = nvar(x) p(1 p) n 2 = n but this depends on p which is unknown! We have an unbiased estimator of p, namely ˆX and we can therefore estimate the variance as follows Var( X n ) = X(1 X)/n 17/43

18 The Standard Error of the Sample Mean When the distribution of X is unknown but i.i.d. we can also more generally derive the variance of the sample mean as follows Var( X n ) = 1 n 2 i Var(X i ) = Var(X) n this again depends on an unknown parameter, Var(X), but that we also have an estimator of so that Var(X) = 1 n 1 n (X i X n ) 2 i=1 Var(X n ) = Var(X)/n and we get the standard error by taking the square root 18/43

19 Calculating Standard Errors phat = mean(rbinom(100,1,.54)) sqrt(phat * (1-phat) / 100) # estimate ## [1] sqrt(.54*(1-.54)/100) # theoretical s.e. ## [1] sqrt(var(rnorm(100,1,2)) / 100) # estimate ## [1] sqrt(2^2/100) # theoretical s.e. ## [1] /43

20 Bias vs Variance Suppose we have 1. an unbiased estimator with a large sampling variance 2. a biased estimator with a small sampling variance Should we choosing our best estimator on? bias, or variance 20/43

21 Bias vs Variance Low Variance High Variance High Bias Low Bias 21/43

22 Bias vs Variance E[X]=0 Density xhat 22/43

23 Bias vs Variance E[X]=0 Density xhat 23/43

24 Mean Squared Error We may need to choose between two estimators one of which is unbiased Consider the biased estimator, is the sampling variance (or the standard error) still a good measure? Var( ˆX) = E[( ˆX E[ ˆX]) 2 ] = E[( ˆX (E[ ˆX] E[X]) E[X]) 2 ] = E[( ˆX E[X] Bias) 2 ] Suppose Var( ˆX biased ) < Var( ˆX unbiased ) what would you conclude? 24/43

25 Mean Squared Error We are interest in the spread relative to the truth!! This is called the Mean Squared Error (MSE) Mean Squared Error MSE = E[( ˆX E[X]) 2 ] We can show that MSE = E[( ˆX E[X]) 2 ] = E[( ˆX E[ ˆX] + E[ ˆX] E[X]) 2 ] = E[( ˆX E[ ˆX]) 2 ] + (E[ ˆX] E[X]) 2 ] }{{}}{{} Var( ˆX) Bias 2 There is a potential trade-off between Bias and Variance 25/43

26 Mean Squared Error Consider again the following two estimators of the variance: 1. Var(X) = 1 ni=1 n 1 (X i X n ) 2 2. Var(X) = 1 ni=1 n (X i X n ) 2 We saw that 1. is unbiased while 2. is not How about the MSE? Consider the example on p.10 bias2 var mse vhat vhat here X N (5, 3) and n = 20, try for X χ 2 (1) and vary n 26/43

27 Consistency We mentioned unbiasedness as an attractive property of an estimator But unbiasedness is a finite sample property silent on how close the estimate is to the truth a nonlinear function of an unbiased estimator is typically not unbiased We will now consider consistenty which is a large sample property consistent estimators converge to the truth as sample sizes grow large a nonlinear function of a consistent estimator is typically consistent 27/43

28 Consistency Consistency Let ˆθ n be an estimator of θ based on a sample of size n. We call ˆθ n consistent if it gets closer and closer to θ as data accumulates, and write: ˆθ n θ The precise definition is: lim Pr( ˆθ n θ > ɛ) = 0 ɛ > 0 n Weak law of large numbers If X i are i.i.d. random variables with E[ X i ] <, then 1 n X i E[X i ] i 28/43

29 Consistency Consider sampling from a population of voters where { 1 if person i support the right X i = 0 if person i support the left and Pr(X i = 1) = 0.54 Denote our data by x 1,..., x n We estimate p by ˆp = (x x n )/n 29/43

30 Consistency phat(n) n 30/43

31 Consistency phat(n) n 31/43

32 Consistency phat(n) n 32/43

33 Biased and Consistent Consider then U Uniform[0, θ] ˆθ = max(u 1,..., u n ) is a biased estimator since but is consistent E[ˆθ] = n n 1 θ 33/43

34 Biased and Consistent phat(n) n 34/43

35 Biased vs Consistent Estimates of the mean unbiased and consistent X unbiased and inconsistent X 1 biased and consistent X + 1/n biased and inconsistent can you think of one? 35/43

36 Consistent Estimators Finding unbiased estimators is not so easy because even if E[ˆθ] = θ E[g(ˆθ)] g(θ) For example, if we know that E[ˆθ] = σ 2 then E[ ˆθ] σ Finding consistent estimators is much easier because of the WLLN and because functions and combinations of consistent estimators are often again consistent 36/43

37 Consistent Estimators Continuous Mapping Theorem (CMT) If g( ) is a continuous function and ˆθ a consistent estimator of θ, then g(ˆθ) g(θ) This means that if then ˆθ σ 2 ˆθ σ 37/43

38 Consistent Estimators Suppose you want a consistent estimator of the variance of X: By the WLLN you know that 1 n and by the CMT and therefore that Var(X) = E[X 2 ] E[X] 2 X i E[X], and 1 n i i 1 n i ( 1 X i ) 2 E[X] 2 n i X 2 i X 2 i E[X 2 ] ( 1 X i ) 2 Var(X) n i This is an application of the Method of Moments 38/43

39 Summary With point estimation the objective is to estimate (compute a best guess of) a population parameter θ using our data Parameters are things like: means, percentiles, minima, maxima, differences in means between groups, etc. etc. Estimates differ across samples, and estimators are therefore random variables Estimators have a distribution 39/43

40 Summary To characterize an estimator we focussed on two key properties of its sampling distribution: 1. location (unbiasedness, consistency) 2. spread (variance, MSE) Unbiasedness, E[ˆθ] = θ means that the expectation of our estimator equals the population parameter it intends to estimate The expectation here is across infintely many random samples, and unbiasedness means that we are correct on average Unbiasedness is a finite sample property because it is true for samples of any size 40/43

41 Summary While being on target on average (location) is important, we never have this average estimate but a single one We would therefore prefer to be close to the target in a given sample This is more likely to happen if the spread of our estimator is small A natural measure of spead is the variance: Var(ˆθ) = E[(ˆθ E[ˆθ]) 2 ] But for a biased estimator it measure the spread around the wrong location since then E[ˆθ] = θ + Bias 41/43

42 Summary This is why we turned to the Mean Squared Error (MSE) MSE(ˆθ) = E[(ˆθ θ) 2 ] which measures the spread of the estimator ˆθ around the true parameter value θ We saw that MSE = Variance + Bias 2 and that a trade-off between bias and variance can make us prefer a biased estimator over an unbiased one 42/43

43 Summary We often use consistent estimators because unbiased estimators are difficult to find or may not exist Consistent estimators can be biased in small samples, but converge to the population parameter as more data become avaiable: ˆθ θ The Weak Law of Large Numbers says that with random sampling sample averages are consistent estimators of corresponding population averages We can often combine consistent estimators to construct new consistent estimators Consistency is a large sample property 43/43

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ. 9 Point estimation 9.1 Rationale behind point estimation When sampling from a population described by a pdf f(x θ) or probability function P [X = x θ] knowledge of θ gives knowledge of the entire population.