Chapter 5. Statistical inference for Parametric Models

Outline Overview Parameter estimation Method of moments How good are method of moments estimates? Interval estimation

Statistical Inference for Parametric Models Parametric statistical inference refers to the process of estimating the parameters of a chosen distributional model from a sample; quantifying how accurate these estimates are; dealing formally with the uncertainty that exist in the data.

Example: Diseased trees. Recall: the variable of interest is run length of diseased trees; we have assumed a Geometric(θ) distribution can be used to model this variable; the parameter θ is the probability of a tree being diseased, so θ Θ = [0,1]. Previously we experimented graphically with different values of θ:

Example: Diseased trees. p.m.f. 0.0 0.2 0.4 0.6 p.m.f. 0.0 0.2 0.4 0.6 0 1 2 3 4 5 Run length 0 1 2 3 4 5 Run length

Estimation: Using methods to be introduced in Section 5.2, we find a best guess of the parameter value: ˆθ = 0.324 for the partial data and ˆθ = 0.343 for the full data. The estimate hasn t been changed very much by the addition of more data. The benefit of the extra data is the increased reliability of the best estimate for the full data set.

Estimation: Quantifying reliability through reflecting uncertainty: we construct a set of values of θ Θ which are the most plausible given the observed data; specifically, we estimate an interval of value for θ; if we have more data, we have more information about θ and therefore the interval is tighter; we must also decide how confident we want to be that θ lies in the interval.

Confidence intervals for θ: 50% interval has an even chance of containing the true value of θ: (0.29,0.35) for the partial data and (0.33,0.36) for the full data. 95% interval has a good chance of containing the true value of θ: (0.23,0.43) for the partial data and (0.29,0.41) for the full data. Q. The length of these intervals are

50% confidence interval: 0.06 for the partial data and 0.03 for the full data. 95% confidence interval: 0.2 for the partial data and 0.12 for the full data.

Prediction under the fitted model: We can now estimate the p.m.f. by replacing the unknown parameters by their estimates: p(x; ˆθ) = ˆθ x (1 ˆθ) for x = 0,1,.... Q. This can now be used to estimate the probability of longer runs of diseased trees than we saw in the data. Assuming ˆθ = 0.343, what is the probability of a run of 6 or more diseased trees? For example, for a Geometric(θ) random variable X and a positive integer x then from Math 104 P(X x;θ) = θ x, so the estimate for the probability of a run of 6 or more diseased trees is P(X 6; ˆθ) = ˆθ 6 = (0.343) 6 = 0.001625. This would not be possible without a model and an inference method for the model

Assessment of the fitted model: Count 0.0 0.2 0.4 0.6 0 1 2 3 4 5 Run length

Assessment of the fitted model: Conclusions: the estimated parameter is a reasonable value for the data; the underlying Geometric model seems to be a good choice. So we learn about the probability of disease but also the random mechanism by which disease spreads.

Statistical Inference for Parametric Models We have now completed the final stages of statistical analysis of trees data. Recall the whole procedure: 1. selection of parametric model Geometric(θ); 2. estimation of unknown model parameters; 3. assessment of validity of model choice; 4. use of model to predict aspects of variable of interest. These stages are followed in all statistical inference: here the model fitted well; if the model doesn t fit then we need to cycle round with a new model choice; improvements to the model are prompted by observed weaknesses and strengths of earlier analyses.

Parameter estimation In Chapter 4, while looking at various parametric models, we learned that within the same parametric family models, the description of probabilities can dramatically change by the choice of parameters. We now introduce a systematic approach to choosing parameters using data.

Method of moments In a smoking ban survey, if 75 out of 100 surveyed agree with the ban, a reasonable estimate for θ, the population proportion who agree with the law, would be 0.75, the sample proportion. The simple idea of using a sample quantity in place of a population quantity is the basis of method of moments, as well as the summary statistics measures used in exploratory data analysis. A more elaborate approach using likelihood will be discussed in MATH 235.

Sample mean Often the population mean itself is of primary interest and sample mean is one of the most popular summary statistics. Theorem Let X 1,,X n are identically distributed random variables with expectation µ = E[X], then E[ X] = µ. This theorem suggests that taking an empirical average might be a good idea when only a finite sample is available.

First we look at the case where n = 2. Then we have [ X1 + X ] 2 E[ X] = E = E[X 1 + X 2 ] 2 2 = E[X 1] + E[X 2 ] = µ + µ = µ 2 2

First we look at the case where n = 2. Then we have [ X1 + X ] 2 E[ X] = E = E[X 1 + X 2 ] 2 2 = E[X 1] + E[X 2 ] = µ + µ = µ 2 2 We can generalise to any n: [ X1 + + X ] n E[ X] = E n = E[X 1] + + E[X n ] n = E[X 1 + + X n ] n = µ + + µ n = nµ n = µ Note that X is a random variable and the expectation takes into account all possible values of x that can arise in any particular observations.

Exercise 5.2.1 Poisson. Suppose X 1,X 2 are i.i.d. Poisson(θ) random variables and X = X 1+X 2 2. What is E[ X]?

Exercise 5.2.2 Poisson. Suppose X 1,X 2 are i.i.d. Poisson(θ) random variables and X = X 1+X 2 2. What is E[ X]? E[ X] = θ

Sample proportion Exercise 5.2.3 Bernoulli. Bernoulli random variables take values on 0,1. For example, 5 responses from a survey on the smoking ban might look like: 0,1,1,0,1 0 for disagree and 1 for agree. If we take the average, 3/5 is the proportion of responses that agree with the law. If we take another sample of 5, the proportion may change. In general when the random variables X 1,,X n only take values of 0 and 1, X is called sample proportion.

Sample proportion and Binomial distribution Recall that each X i is called Bernoulli trial. So if they are i.i.d. (what does that mean?), then the sample proportion is indeed the sample mean of Bernoulli random variables. Moreover, we know that the sum of the random variables follows a Binomial distribution: Y = n X i Binomial(n,θ) i=1 with µ = nθ and σ 2 = nθ(1 θ). In this case, the sample proportion is simply Y n, where Y Binomial(n,θ).

In this case, the sample proportion is simply Y n, where Y Binomial(n,θ). In particular, E[ n X i ] = E[Y ] = nθ i=1 [ Y ] E[ X] = E = θ n n Var[ X i ] = Var[Y ] = nθ(1 θ) i=1 [ Y ] Var[ X] = Var n = Var[Y ] θ(1 θ) n 2 = n

Exercise 5.2.4 Binomial. Give an interpretation of Y and Y /n for the survey example. This will be used later to quantify sampling variation of the sample proportion.

Y represents the total number of people (responses) who agree with the law and Y /n represensts the proportion of people (responses) who agree with the law. This will be used later to quantify sampling variation of the sample proportion.

The method of moments For a random variable X, E[X] - the first moment E[X 2 ] - the second moment E[X k ] - the kth moment, k = 1,2,, The sample moments are then calculated by ˆµ k = 1 n ˆµ k = 1 n n i=1 n i=1 X k i x k i with random variables: estimator with observations: estimate If the unknown parameter θ is expressed by the population moments or some function of them, it can be estimated by replacing the population quantities by the corresponding sample quantities.

Exercise 5.2.5 Poisson distribution. The first moment for the Poisson distribution is the parameter µ = E[X]. The first sample moment is X = 1 n n i=1 X i which is the method of moments estimator of µ. Since θ = µ, this is the estimator of the rate parameter θ.

Exercise 5.2.6 Exponential distribution. The first moment for the Exponential distribution is µ = E[X] and the method of moments estimator of µ is X. Since θ = 1 E[X], this can be estimated by

θ = 1 X

Exercise 5.2.7 Normal distribution. The first and second moments for the normal distributions are Thus, µ 1 = E[X] = µ µ 2 = E[X 2 ] = µ 2 + σ 2 µ = µ 1 σ 2 = µ 2 µ 2 1 What are the method of moments estimator of these parameters?

ˆµ = X ˆσ 2 = 1 n n i=1 X 2 i X 2 = 1 n n (X i X) 2 i=1

Sampling distribution of sample mean: symmetric case n = 10 n = 20 no. experiments 0 100 200 no. experiments 0 50 150 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ n = 50 n = 100 no. experiments 0 100 250 no. experiments 0 100 250 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ Figure: Sampling distribution of ˆθ with increasing sample size

n = 100 n = 1000 no. experiments 0 200 no. experiments 0 100 200 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ n = 5000 n = 10000 no. experiments 0 100 200 no. experiments 0 100 250 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ Figure: Sampling distribution of ˆθ with increasing sample size

Effect of sample size Exercise 5.2.8 Sample size. Summarise your findings about the estimates ˆθ and comment on the effect of sample size. The estimates, on average, agree with the true value of θ = (0, 0.3, 0.5, 0.7, 1) however there is considerable variability depending on the sample size. In general the larger sample size, the (larger, smaller) variability in the estimates and thus the (more accurate, less accurate) the estimates become.

Effect of sample size The estimates, on average, agree with the true value of θ = (0, 0.3, 0.5, 0.7, 1) however there is considerable variability depending on the sample size. In general the larger the sample size is, the (larger, smaller) variability in the estimates and thus the (more accurate, less accurate) the estimates become.

Sampling distribution of sample mean: asymmetric case The same phenomeon is expected if the result came from a survey instead of a coin tossing experiment, except that the centre of the distribution will change accordingly. The only limitation with the survey is that we would not be able to run the same survey for 1000 times to see the effect!

n = 10 n = 20 no. experiments 0 100 250 no. experiments 0 100 200 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ n = 50 n = 100 no. experiments 0 200 no. experiments 0 100 200 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ Figure: Sampling distribution of ˆθ with increasing sample size

Exercise 5.2.9 What is the underlying distributional model used in Figure 3? How is the shape of the distribution of ˆθ affected by the sample size? The shape of the distribution of different estiamtes becomes more (symmetric, flat) when the sample size increases. Explain why the larger sample size would be preferrable in practice?

Exercise 5.2.10 What is the underlying distributional model used in Figure 3? X 1 10, where X 1 Binomial(10,0.8) 20, where X 2 X Binomial(20, 0.8), 3 50, where X X 3 Binomial(50,0.8), 4 100, where X 4 Binomial(100,0.8) X 2 How is the shape of the distribution of ˆθ affected by the sample size? The shape of the distribution of different estiamtes becomes more (symmetric, flat) when the sample size increases. Explain why the larger sample size would be preferrable in practice? smaller variability and less skewed distribution for the sample mean

n = 100 n = 1000 no. experiments 0 50 150 no. experiments 0 100 250 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ n = 5000 n = 10000 no. experiments 0 100 250 no. experiments 0 50 150 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ 0.0 0.2 0.4 0.6 0.8 1.0 ˆθ Figure: Sampling distribution of ˆθ with increasing sample size

Variability of sample mean A mathematical explanation of the behaviour of the sample mean estimates comes from the following property. Theorem If X 1,,X n are i.i.d random variables with expectation E[X i ] = µ and Var[X i ] = σ 2, then Var[ X] = σ2 n.

Consider the case where n = 2: [ X1 + X ] 2 Var[ X] = Var 2 = σ2 + σ 2 = σ2 4 2 = Var[X 1 + X 2 ] 2 2

Consider the case where n = 2: [ X1 + X ] 2 Var[ X] = Var 2 = σ2 + σ 2 = σ2 4 2 We can generalise to any n: [ X1 + + X ] n Var[ X] = Var n = σ2 + + σ 2 n 2 = Var[X 1 + X 2 ] 2 2 = Var[X 1 + + X n ] n 2 = nσ2 n 2 = σ2 n

How variable the estimates are If more and more sample were available, the estimates will converge to the true parameter because variance decreases (converges to zero) as sample size increases; Var[ X] = σ2 n 0 as n E[ X] = θ for all sample sizes and thus there is no systematic bias. therefore the estimate will converge to the true parameter as the sample size increases. Note that properties of sample mean do not depend on any particular distributional models and thus is not limited to the parametric models that we are considering here.

Standard Error We have seen that it is important to take into account sampling variability in the estimation. As a measure of precision, standard error is defined as the squared root of variance of the estimator. Estimator: ˆθ Standard error(ˆθ) = StdError(ˆθ) = Var(ˆθ).

If X 1,,X n are an i.i.d. sample from X with E[X] = µ and Var[X] = σ 2, then the method of moments estimator of θ is ˆµ = X and the standard error is

If X 1,,X n are an i.i.d. sample from X with E[X] = µ and Var[X] = σ 2, then the method of moments estimator of θ is ˆµ = X and the standard error is StdError(ˆµ) = σ n In practice when σ is unknown, it will be replaced by its estimate.

Exercise 5.2.11 Poisson distribution. If X Poisson(θ), then we know µ = θ and σ 2 = θ. Thus, the method of moments estimator of θ is ˆθ = X and its standard error is given by StdErrorr(ˆθ) = ˆθ n.

Exercise 5.2.12 Binomial distribution If X Binomial(n,θ), then we know µ = nθ and σ 2 = nθ(1 θ). Since θ = µ n, the method of moments estimator θ is ˆθ = X n sample proportion. and the standard error of the estimator is

StdError(ˆθ) = ˆθ(1 ˆθ). n

Hospitals example We consider the hospitals example. Hospital 2: 5 out of 10 operations classified as a success. What does this tell us about the probabilities of successful operations now?

Hospitals example We consider the hospitals example. Hospital 2: 5 out of 10 operations classified as a success. What does this tell us about the probabilities of successful operations now? Two possible answers depending on assumptions: Independence: this assumption seems reasonable from the context of the problem. Identically distributed: it is not clear from the context whether the probability of success is the same at each hospital. We will look at what happens when we assume the successes at the two hospitals are identically distributed; NOT identically distributed.

We denote by X 1 the random variable number of successful operations at the first hospital; X 2 the random variable number of successful operations at the second hospital; We assume that X 1 and X 2 : are independent; with X i following Binomial(10,θ i ) for i = 1,2. We observe variable X = (X 1,X 2 ) with value x = (x 1,x 2 ) = (9,5).

Non-identically distributed i.e. θ = (θ 1,θ 2 )) The method of moments estimates are and the variance is ˆθ 1 = 9 10 ˆθ 2 = 5 10 Var[ ˆθ 1 ] = ˆθ 1 (1 ˆθ 1 ) 10 So the standard errors are = 0.009, Var[ˆθ 2 ] = ˆθ 2 (1 ˆθ 2 ) 10 = 0.025. StdError(ˆθ 1 ) = 0.009 = 0.095, StdError(ˆθ 2 ) = 0.025 = 0.158. There is no correlation between the estimates. This is reasonable as the data from each hospital tells us only about the probability of a successful operation for that hospital.

Identically distributed i.e. θ = θ so θ 1 = θ 2 = θ. Then we can aggregate the information as total number of trials is 10+10=20 and the number of successes is 9+5=14. So the method of moments estimate is ˆθ = (9 + 5)/(10 + 10) = 0.7 and the variance is

Standard error of functions of sample mean We have seen that X plays an important role in estimating the population mean and we were able to quantify variability of the sample mean. Other cases such as Exponential distribution, the estimator for θ is 1/ X, but what is Var[1/ X]? Certainly, [ Var 2 ] X 1 + X 2 [ 2 ] [ 2 ] Var + Var. X 1 X 2

Standard error of functions of sample mean Taylor approximation: If g is differentiable, then g( X) g(µ) + g (µ)( X µ) Write [ ] Var = Var[g( X)], where g(x) = 1/x. 1 X So, we can compute the variance using the approximation: Var[g( X)] Var[g(µ) + g (µ)( X µ)] = Var[g (µ)( X µ)] = g (µ) 2 Var[ X µ] = g (µ) 2 Var[ X] Verify each step! Note that g should be evaluated at µ, which is a function of θ.

Assume X 1,,X n are an i.i.d. sample from X with E[X] = µ and Var[X] = σ 2. Let X = 1 n n i=1 X i. The standard error of g( X) is StdError[g( X)] = g (µ) σ n.

For the Exponential(θ), we have and µ = 1 θ, σ2 = 1 θ 2 StdError( X) = σ n = 1 θ n Now θ = 1/µ and ˆθ = 1/ X: if g(x) = 1/x, g (x) = 1/x 2 and g (µ) = 1/µ 2 = θ 2, so the standard error is Since θ is unknown, StdError(1/ X) = θ 2 1 θ n = θ n StdError(1/ X) = ˆθ n

Diseased Trees example Consider the diseased tree example. For the diseased trees example for the partial and full data n = 50 and 109 respectively. First let us use general notation for the observations x 1,...,x n so that we can do the mathematics once for both cases. Here we are assuming the data come from i.i.d. random variables with p.m.f. p(x;θ) = θ x (1 θ), so that θ = θ, Θ = [0,1] and θ (1 θ) 2. µ = E[X] = θ 1 θ and σ2 = Var[X] = The method of moments estimate of θ is ˆθ = x 1 + x.

For the partial data 50 i=1 x i = 0 31 + 1 16 + 2 2 + 3 0 + 4 1 + 5 0 = 24 so ˆθ = 24/50 1+24/50 = 0.3243. For the full data 109 x i = 0 71 + 1 28 + 2 5 + 3 2 + 4 2 + 5 1 = 57 i=1 so ˆθ = 57/109 1+57/109 = 0.3434.

For the standard error of ˆθ: if g(x) = x 1+x, g (x) = 1, so (1+x) 2 g (µ) = 1 = (1 θ) 2 and σ = θ (1+µ) 2 1 θ so the standard error of ˆθ is ˆθ(1 ˆθ) StdError(ˆθ) =. n

For the standard error of ˆθ: if g(x) = x 1+x, g (x) = 1, so (1+x) 2 g (µ) = 1 = (1 θ) 2 and σ = θ (1+µ) 2 1 θ so the standard error of ˆθ is ˆθ(1 ˆθ) StdError(ˆθ) =. n For the partial data, StdError(ˆθ) = 0.3243(1 0.3243) 50 = 0.054 For the full data, StdError(ˆθ) = 0.3434(1 0.3434) 109 = 0.037

Confidence region Instead of picking out one single value, we choose a set of parameter values which are consistent with the observed data. We term such a set a confidence region for θ. confidence region. The estimator of the confidence region is a random region, which has the probability 1 α of containing the true value θ 0. If the parameter of interest is 1-dimensional, it is called confidence interval. Based on ˆθ, how do we choose a confidence region to ensure the required probability?

Probability of an interval We first look at the case where we know the underlying distribution is Normal. Write z (p) for the pth quantile of the Normal(0,1) distribution. If X Normal(µ,σ 2 ) then In other words, Pr(µ z (α/2) σ X µ + z (α/2) σ) = 1 α Pr(µ [X z (α/2) σ,x + z (α/2) σ]) = 1 α

Probabilities of Normal distribution µ 3σ µ 2σ µ σ µ µ + σ µ + 2σ µ + 3σ P(µ σ < X < µ + σ) = 0.683 P(µ 2σ < X < µ + 2σ) = 0.954 P(µ 3σ < X < µ + 3σ) = 0.997 Figure: Illustration of coverage probability of Normal(µ, σ 2 ) distribution.

If X 1,,X n are i.i.d. random variables from Normal(µ,σ 2 ) distribution and X = 1 n n i=1 X i, then X Normal(µ, σ2 n ). The proof of this result is given in MATH 230. Therefore, we can choose an interval that has the required probability for the estimator X: Pr(µ [ X z (α/2) σ n, X + z (α/2) σ n ]) = 1 α Hence, [ X z (α/2) σ n, X + z (α/2) σ n ] is (1 α)% confidence interval of µ.

Central Limit Theorem If X 1,,X n are i.i.d. random variables from unknown distribution function and X = 1 n X i, then X approximately Normal(µ, σ2 n ).

No matter what distribution the original data come from, the sample mean follows approximately Normal distribution if you have a large enough sample. This is one of the most significant results in statistics. Again the formal proof will be given in MATH 230. Here we use our intuition and informal justification shown in Figures This result allows us to construct (approximate) confidence interval as we did for the Normal data.

Confidence interval Generally, an approximate (1 α)% confidence interval can be constructed from standard errors of an estimator. The standard error is the factor which determines the width of confidence intervals for θ. The (1 α)% confidence interval for θ is (ˆθ z (α/2) StdError(ˆθ), ˆθ + z (α/2) StdError(ˆθ)). where z (p) is the pth quantile of Normal(0,1) distribution.

Exercise 5.2.13 Hospitals. Find the 95% confidence interval for θ 1 under two assumptions considered earlier. Use z (0.025) = 1.96. Non-identically distributed: (ˆθ1 1.96 StdError, ˆθ 1 + 1.96 StdError(ˆθ 1 )) = (0.9 1.96 0.095,0.9 + 1.96 0.095) = (0.7138,1.0862) Identically distributed: (ˆθ1 1.96 StdError(ˆθ 1 ), ˆθ 1 + 1.96 StdError(ˆθ 1 )) = (0.7 1.96 0.0105,0.7 + 1.96 0.0105) = (0.4992,0.9

Interpretation of confidence interval Exercise 5.2.14 Suppose that all our MATH 105 students take a sample of the same size from UG for a smoking ban survey and each student, based on his/own sample, constructs a 95% confidence interval for the UG proportion of students who agree with the law. Would all the confidence intervals be the same? Would the length of all the confidence intervals be the same? Would your confidence interval contain the true value of population proportion? Would exactly 95 out of 100 intervals contain the true value of population proportion?

Interpretation of confidence interval Would all the confidence intervals be the same? Probably not. Would the length of all the confidence intervals be the same? No - because the standard error is also estimated Would your confidence interval contain the true value of population proportion? Would exactly 95 out of 100 intervals contain the true value of population proportion?

Interpretation of confidence interval Would all the confidence intervals be the same? Would the length of all the confidence intervals be the same? Would your confidence interval contain the true value of population proportion? We do not know whether each confidence region contains the true value. However, we can expect that approximately 95% of those intervals would contain the true value. Would exactly 95 out of 100 intervals contain the true value of population proportion? No, it is possible that all intervals could contain the true value by chance or more than 5% intervals may not contain the true value. Our intervals are only a sample of all possible confidence intervals!

Sampling distribution θ 1.96 x StdErr θ θ + 1.96 x StdErr θ^ 1.96 x StdErr θ^ θ^ + 1.96x StdErr Figure: Illustration of 95% confidence interval