Estimation after Model Selection

Size: px

Start display at page:

Download "Estimation after Model Selection"

Tabitha Lyons
6 years ago
Views:

1 Estimation after Model Selection Vanja M. Dukić Department of Health Studies University of Chicago Edsel A. Peña* Department of Statistics University of South Carolina ENAR 2003 Talk March 31, 2003 Tampa Bay, FL Research support from NSF

2 Motivating Situations Suppose you have a random sample X = (X 1, X 2,..., X n ) (possibly censored) from an unknown distribution F which belongs to either the Weibull class or the gamma class. What is the best way to estimate F(t) or some other parameter of interest? Suppose it is known that the unknown df F belongs to either of p models M 1, M 2,..., M p, which are possibly nested. What is the best way of estimating a parameter common to each of these models?

3 Intuitive Strategies Strategy I: Utilize estimators developed under larger model M, or implement a fully nonparametric approach. Strategy II (Classical): [Step 1 (Model Selection):] Choose most plausible model using the data, possibly via information measures. [Step 2 (Inference):] Use estimators in the chosen sub-model, but with these estimators still using the same data X. Strategy III (Bayesian): Determine adaptively (i.e., using X) the plausibility of each of the sub-models, and form a weighted combination of the sub-model estimators or tests. Referred also as model averaging.

4 Relevance and Issues What are the consequences of first selecting a sub-model and then performing inference such as estimation or testing hypothesis, with these two steps utilizing the same sample data (i.e., double-dipping)? Is it always better to do model-averaging, that is, a Bayesian framework, or equivalently, under what circumstances is model averaging preferable over a classical two-step approach? When the number of possible models increases, would it be better to simply utilize a wider, possibly nonparametric, model?

5 A Concrete Gaussian Model Data: X (X 1, X 2,..., X n ) IID F M = { N(µ, σ 2 ) : µ R, σ 2 > 0 } Uniformly minimum variance unbiased (UMVU) estimator of σ 2 is the sample variance ˆσ 2 UMV U = S2 = 1 n 1 n i=1 (X i X) 2. Decision-theoretic framework with loss function (ˆσ L 1 (ˆσ 2,(µ, σ 2 2 σ 2 ) 2 )) =. σ 2

6 Risk function: For the quadratic loss L 1, ( Risk(ˆσ 2 ) = Variance ˆσ 2 σ 2 ) + [ Bias ( )] ˆσ 2 2 σ 2 S 2 is not the best. Dominated by ML and the minimum risk equivariant (MRE) estimators: ˆσ 2 MLE = 1 n n i=1 (X i X) 2 ˆσ 2 MRE = ( n n + 1 ) ˆσ MLE 2

7 Model M p : Our Test Model Suppose we do not know the exact value of µ, but we do know it is one of p possible values. This leads to model M p : M p = { N(µ, σ 2 ) : µ {µ 1,..., µ p }, σ 2 > 0 } where µ 1, µ 2,..., µ p are known constants. Under M p, how should we estimate σ 2? What are the consequences of using the estimators developed under M? Can we exploit structure of M p to obtain better estimators of σ 2?

8 Classical Estimators Under M p Sub-Model MLEs and MREs: ˆσ 2 i = 1 n n j=1 (X j µ i ) 2 ; ˆσ MRE,i 2 = 1 n + 2 n j=1 (X j µ i ) 2 Model Selector: ˆM = ˆM(X) ˆM = arg min 1 i pˆσ2 i = arg min 1 i p X µ i. ˆM chooses the sub-model leading to the smallest estimate of σ 2, or whose mean is closest to the sample mean.

9 MLE of σ 2 under M p (a two-step adaptive estimator): ˆσ p,mle 2 p = ˆσ2ˆM = I{ ˆM = i}ˆσ i 2. i=1 An alternative Estimator: Use the sub-model s MRE to obtain ˆσ 2 p,mre = ˆσ2 MRE, ˆM = p i=1 I{ ˆM = i}ˆσ 2 MRE,i. Properties of adaptive estimators not easily obtainable due to interplay between the model selector ˆM and the sub-model estimator.

10 Bayes Estimators Under M p Joint Prior for (µ, σ 2 ): Independent priors Prior for µ: Multinomial(1, θ) Prior for σ 2 : Inverted Gamma(κ, β) Posterior Probabilities of Sub-Models: ( θ i nˆσ 2 i /2 + β ) (n/2+κ 1) θ i (x) = p j=1 θ ( j nˆσ 2 j /2 + β ) (n/2+κ 1)

11 Posterior Density of σ 2 : π(σ 2 x) = C p i=1 θ i ( 1 σ 2 ) (κ+n/2) exp [ 1 σ 2 ( nˆσ 2 i /2 + β )]. Bayes (Weighted) Estimator of σ 2 : ˆσ 2 p,bayes (X) = p {( n i=1 n + 2(κ 2) θ i (X) ) ( ) ( ) } ˆσ i 2 2(κ 2) β +. n + 2(κ 2) κ 2 Non-Informative Priors: Uniform prior for sub-models: θ i = 1/p, i = 1,2,..., p; β 0.

12 One particular limiting Bayes estimator is: ˆσ 2 p,lb1 = p (ˆσ 2 i ) n/2 p i=1 j=1 (ˆσ2 j ) n/2 ˆσ 2 i an adaptively weighted estimator formed from the sub-model estimators. But, based on the simulation studies, a better one is that formed from the sub-model MREs: ˆσ 2 p,plb1 = ( n n + 2 ) ˆσ p,lb1 2

13 Comparing the Estimators R (ˆσ 2 UMV U,(µ, σ2 ) ) = 2 n 1. R (ˆσ 2 MRE,(µ, σ2 ) ) = 2 n+1. Efficiency measure relative to ˆσ 2 UMV U : Eff(ˆσ 2 : ˆσ UMV 2 U ) = R(ˆσ2 UMV U,(µ, σ2 )) R(ˆσ 2,(µ, σ 2. )) Eff(ˆσ 2 MRE : ˆσ2 UMV U ) = n+1 n 1 = n 1.

14 Properties of M p -Based Estimators Notation: Let Z N(0,1) and with µ i0 the true mean, define = µ µ i 0 1. σ Proposition: Under M p, ˆσ i 2 d 1 ( ) = W + V 2 σ 2 n i, i = 1,2,..., p; with W and V independent, and W χ 2 n 1 ; V = Z1 n N p ( n, J 11 ).

15 Notation: Given, let (1) < (2) <... < (p) be the ordered values. always has a zero component. Theorem: Under M p, with ˆσ p,mle 2 d 1 = σ 2 n {W+ p I{L( (i), ) < Z < U( (i), )}(Z n (i) ) 2 i=1 L( (i), ) = U( (i), ) = n [ ] (i) + 2 (i 1) ; n [ ] (i) + (i+1). 2 ;

16 Mean: EpMLE( ) E = 1 2 n p p i=1 i=1 ˆσ 2 p,mle σ 2 (i) [φ(l( (i), )) φ(u( (i), ))] + 2 (i) [Φ(U( (i), )) Φ(L( (i), ))]; Case of p = 2. EpMLE( ) = 1 { ( ) n φ 2 ( ) 2 n ( ) [ n 2 1 Φ ( )]} n 2

17 EpMLE sqrt(n) Delta /2 ˆσ 2 p,mle is negatively biased for σ2 (even though each submodel estimator is unbiased). Effect of double-dipping.

18 Variance: VpMLE( ) Var = 1 n 2 (1 1 n ˆσ 2 p,mle σ 2 ) + 1 p n i=1 ζ (i) (4) p i=1 ζ (i) (2) 2 ; ζ (i) (m) E { I{L( (i), ) < Z U( (i), )}(Z n (i) ) m}. These formulas enable computations of the theoretical risk functions of the classical M p -based estimators.

19 An Iterative Estimator Consider the Class: C = { σ 2 (c) cˆσ 2 p,mle : c 0} The risk function of σ 2 (c), which is a quadratic function in c, could be minimized wrt c. The minimizing value is c ( ) = EpMLE( )/{V pmle( ) + [EpMLE( )] 2 } Given a c, = (µ µ i0 1 p )/σ could be estimated via ˆ = (µ µ ˆM 1 p) σ(c ) This in turn could be used to obtain a new estimate of c ( )

20 Algorithm for σ 2 p,iter Step 0 (Initialization): Set a value for tol (say, tol = 10 8 ) and set c old = 1. Step 1: Define σ 2 = (c old )ˆσ 2 p,mle. Step 2: Compute ˆ = (µ µ ˆM 1 p)/ σ. Step 3: Compute c new = EpMLE(ˆ ) V pmle(ˆ )+[EpMLE(ˆ )] 2. Step 4: If c old c new < tol set σ 2 p,iter = σ2 then stop; else c old = c new then back to Step 1.

21 Impact of Number of Sub-Models Theorem: With n > 1 fixed, if as p, max 2 i p (i) (i 1) 0, (1), and (p), then Eff (ˆσ 2 p,mre : ˆσ2 MRE ) 2(n + 2) 2 (n + 1)(2n + 7) < 1. Therefore, the advantage of exploiting the structure of M p could be lost forever when p increases!

22 Representation: Weighted Estimators Umbrella Estimator: For α > 0, define ˆσ 2 p,lb (α) = p i=1 (ˆσ 2 i ) α p j=1 (ˆσ2 j ) α ˆσ2 i. Theorem: Under M p, ˆσ 2 p,lb (α) σ 2 d = W n {1 + H(T;α)}; T = (T 1, T 2,..., T p ) = V / W;

23 H(T;α) = p i=1 θ i (T;α)T 2 i ; θ i (T;α) = (1 + T2 i ) α p j=1 (1 + T2 j ) α. Even with this representation, still difficult to obtain exact expressions for the mean and variance. Developed 2nd-order approximations, but were not so satisfactory when n 15. In the comparisons, we resorted to simulations to approximate the risk function of the weighted estimators.

24 Some Simulation Results Figures 1 and 2 Simulated and Theoretical Risk Curves for n = 3 and n = 10 (Based on replications per )

25 Theoretical and/or Simulated Relative (to UMVU) Efficiency Curves Efficiency (relative to UMVU) pmle simulated pmle theoretical pmre simulated pmre theoretical pplb1 simulated piter simulated Delta

26 Theoretical and/or Simulated Relative (to UMVU) Efficiency Curves Efficiency (relative to UMVU) pmle simulated pmle theoretical pmre simulated pmre theoretical pplb1 simulated piter simulated Delta

27 Table: Relative efficiency (wrt UMVU) for symmetric and increasing p with limits [ 1,1] and n = 3,10,30 using 1000 replications. Except for the first set, denoted by (*), where the mean vector is {0,1}, the other mean vectors are of form [ 1 : 2 k : 1] whose p = 2 (k+1) + 1. A last letter of s on the label means theoretical, whereas an s means simulated.

28 n k p pmles pmlet pmres pmret pplb1s piters 3 * * *

29 Concluding Remarks In models with sub-models, and interest is to infer about a common parameter, possible approaches are: Approach I: Use a wider model subsuming the sub-models, possibly a fully nonparametric model. Possibly inefficient, though might be easier to ascertain properties. Approach II: A two-step approach: Select sub-model using data; then use procedure for chosen sub-model, again using same data.

30 Approach III: Utilize a Bayesian framework. Assign a prior to the sub-models, and (conditional) priors on the parameters within the sub-models. Leads to model-averaging. Approaches (II) and (III) are preferable over approach (I); but when the number of sub-models is large, approach (I) may provide better estimators and a simpler determination of the properties. If the sub-models are quite different and the model selector can choose the correct model easily, or the sub-models are not too different that an erroneous choice of the model by the selector will not matter much, approach (II) appears

31 preferable. In the in-between situation, approach (III) seems preferable. For the specific Gaussian model considered, the iterative estimator actually performed in a robust fashion. To conclude, Observe Caution! when doing inference after model selection especially when double-dipping on the data!

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER Two hours MATH20802 To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER STATISTICAL METHODS Answer any FOUR of the SIX questions.