Estimation after Model Selection

Similar documents
Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Chapter 8: Sampling distributions of estimators Sections

CS340 Machine learning Bayesian model selection

Chapter 7: Estimation Sections

Chapter 7: Estimation Sections

Chapter 7: Point Estimation and Sampling Distributions

Chapter 7: Estimation Sections

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Chapter 7 - Lecture 1 General concepts and criteria

Chapter 8. Introduction to Statistical Inference

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Adaptive Experiments for Policy Choice. March 8, 2019

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

EE641 Digital Image Processing II: Purdue University VISE - October 29,

Point Estimation. Principle of Unbiased Estimation. When choosing among several different estimators of θ, select one that is unbiased.

Confidence Intervals Introduction

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Applied Statistics I

SYSM 6304 Risk and Decision Analysis Lecture 2: Fitting Distributions to Data

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

High Dimensional Bayesian Optimisation and Bandits via Additive Models

Dealing with forecast uncertainty in inventory models

Likelihood Methods of Inference. Toss coin 6 times and get Heads twice.

Statistics for Business and Economics

The Bernoulli distribution

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

STAT 509: Statistics for Engineers Dr. Dewei Wang. Copyright 2014 John Wiley & Sons, Inc. All rights reserved.

Qualifying Exam Solutions: Theoretical Statistics

Bayesian Linear Model: Gory Details

Asset Allocation and Risk Assessment with Gross Exposure Constraints

Introduction to Sequential Monte Carlo Methods

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Chapter 4: Asymptotic Properties of MLE (Part 3)

6. Genetics examples: Hardy-Weinberg Equilibrium

Generating Random Numbers

Chapter 8: Sampling distributions of estimators Sections

Chapter 5: Statistical Inference (in General)

Decision theoretic estimation of the ratio of variances in a bivariate normal distribution 1

CSC 411: Lecture 08: Generative Models for Classification

may be of interest. That is, the average difference between the estimator and the truth. Estimators with Bias(ˆθ) = 0 are called unbiased.

Non-informative Priors Multiparameter Models

1. You are given the following information about a stationary AR(2) model:

1 Bayesian Bias Correction Model

5.3 Statistics and Their Distributions

BIO5312 Biostatistics Lecture 5: Estimations

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

IEOR E4703: Monte-Carlo Simulation

Much of what appears here comes from ideas presented in the book:

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Actuarial Mathematics and Statistics Statistics 5 Part 2: Statistical Inference Tutorial Problems

Lecture 10: Point Estimation

Exercise. Show the corrected sample variance is an unbiased estimator of population variance. S 2 = n i=1 (X i X ) 2 n 1. Exercise Estimation

Chapter 7. Inferences about Population Variances

Statistical Tables Compiled by Alan J. Terry

Bayesian Normal Stuff

8.1 Estimation of the Mean and Proportion

Week 1 Quantitative Analysis of Financial Markets Distributions B

MTH6154 Financial Mathematics I Stochastic Interest Rates

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

Statistical Inference and Methods

Chapter 5. Continuous Random Variables and Probability Distributions. 5.1 Continuous Random Variables

Multi-armed bandit problems

Business Statistics 41000: Probability 3

Asymptotic Methods in Financial Mathematics

Window Width Selection for L 2 Adjusted Quantile Regression

MATH 3200 Exam 3 Dr. Syring

Chapter 5. Sampling Distributions

Optimally Thresholded Realized Power Variations for Lévy Jump Diffusion Models

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

Bivariate Birnbaum-Saunders Distribution

Modeling of Price. Ximing Wu Texas A&M University

χ 2 distributions and confidence intervals for population variance

CS340 Machine learning Bayesian statistics 3

The Multinomial Logit Model Revisited: A Semiparametric Approach in Discrete Choice Analysis

Monetary Economics Final Exam

Quantitative Risk Management

Machine Learning for Quantitative Finance

Parameter uncertainty for integrated risk capital calculations based on normally distributed subrisks

An Improved Skewness Measure

Conjugate Models. Patrick Lam

Chapter 4 Variability

Calibration of Interest Rates

STAT 111 Recitation 4

GPD-POT and GEV block maxima

Sampling Distribution

Back to estimators...

Worst-Case Value-at-Risk of Non-Linear Portfolios

Homework Problems Stat 479

Strategies for Improving the Efficiency of Monte-Carlo Methods

Transcription:

Estimation after Model Selection Vanja M. Dukić Department of Health Studies University of Chicago E-Mail: vanja@uchicago.edu Edsel A. Peña* Department of Statistics University of South Carolina E-Mail: pena@stat.sc.edu ENAR 2003 Talk March 31, 2003 Tampa Bay, FL Research support from NSF

Motivating Situations Suppose you have a random sample X = (X 1, X 2,..., X n ) (possibly censored) from an unknown distribution F which belongs to either the Weibull class or the gamma class. What is the best way to estimate F(t) or some other parameter of interest? Suppose it is known that the unknown df F belongs to either of p models M 1, M 2,..., M p, which are possibly nested. What is the best way of estimating a parameter common to each of these models?

Intuitive Strategies Strategy I: Utilize estimators developed under larger model M, or implement a fully nonparametric approach. Strategy II (Classical): [Step 1 (Model Selection):] Choose most plausible model using the data, possibly via information measures. [Step 2 (Inference):] Use estimators in the chosen sub-model, but with these estimators still using the same data X. Strategy III (Bayesian): Determine adaptively (i.e., using X) the plausibility of each of the sub-models, and form a weighted combination of the sub-model estimators or tests. Referred also as model averaging.

Relevance and Issues What are the consequences of first selecting a sub-model and then performing inference such as estimation or testing hypothesis, with these two steps utilizing the same sample data (i.e., double-dipping)? Is it always better to do model-averaging, that is, a Bayesian framework, or equivalently, under what circumstances is model averaging preferable over a classical two-step approach? When the number of possible models increases, would it be better to simply utilize a wider, possibly nonparametric, model?

A Concrete Gaussian Model Data: X (X 1, X 2,..., X n ) IID F M = { N(µ, σ 2 ) : µ R, σ 2 > 0 } Uniformly minimum variance unbiased (UMVU) estimator of σ 2 is the sample variance ˆσ 2 UMV U = S2 = 1 n 1 n i=1 (X i X) 2. Decision-theoretic framework with loss function (ˆσ L 1 (ˆσ 2,(µ, σ 2 2 σ 2 ) 2 )) =. σ 2

Risk function: For the quadratic loss L 1, ( Risk(ˆσ 2 ) = Variance ˆσ 2 σ 2 ) + [ Bias ( )] ˆσ 2 2 σ 2 S 2 is not the best. Dominated by ML and the minimum risk equivariant (MRE) estimators: ˆσ 2 MLE = 1 n n i=1 (X i X) 2 ˆσ 2 MRE = ( n n + 1 ) ˆσ MLE 2

Model M p : Our Test Model Suppose we do not know the exact value of µ, but we do know it is one of p possible values. This leads to model M p : M p = { N(µ, σ 2 ) : µ {µ 1,..., µ p }, σ 2 > 0 } where µ 1, µ 2,..., µ p are known constants. Under M p, how should we estimate σ 2? What are the consequences of using the estimators developed under M? Can we exploit structure of M p to obtain better estimators of σ 2?

Classical Estimators Under M p Sub-Model MLEs and MREs: ˆσ 2 i = 1 n n j=1 (X j µ i ) 2 ; ˆσ MRE,i 2 = 1 n + 2 n j=1 (X j µ i ) 2 Model Selector: ˆM = ˆM(X) ˆM = arg min 1 i pˆσ2 i = arg min 1 i p X µ i. ˆM chooses the sub-model leading to the smallest estimate of σ 2, or whose mean is closest to the sample mean.

MLE of σ 2 under M p (a two-step adaptive estimator): ˆσ p,mle 2 p = ˆσ2ˆM = I{ ˆM = i}ˆσ i 2. i=1 An alternative Estimator: Use the sub-model s MRE to obtain ˆσ 2 p,mre = ˆσ2 MRE, ˆM = p i=1 I{ ˆM = i}ˆσ 2 MRE,i. Properties of adaptive estimators not easily obtainable due to interplay between the model selector ˆM and the sub-model estimator.

Bayes Estimators Under M p Joint Prior for (µ, σ 2 ): Independent priors Prior for µ: Multinomial(1, θ) Prior for σ 2 : Inverted Gamma(κ, β) Posterior Probabilities of Sub-Models: ( θ i nˆσ 2 i /2 + β ) (n/2+κ 1) θ i (x) = p j=1 θ ( j nˆσ 2 j /2 + β ) (n/2+κ 1)

Posterior Density of σ 2 : π(σ 2 x) = C p i=1 θ i ( 1 σ 2 ) (κ+n/2) exp [ 1 σ 2 ( nˆσ 2 i /2 + β )]. Bayes (Weighted) Estimator of σ 2 : ˆσ 2 p,bayes (X) = p {( n i=1 n + 2(κ 2) θ i (X) ) ( ) ( ) } ˆσ i 2 2(κ 2) β +. n + 2(κ 2) κ 2 Non-Informative Priors: Uniform prior for sub-models: θ i = 1/p, i = 1,2,..., p; β 0.

One particular limiting Bayes estimator is: ˆσ 2 p,lb1 = p (ˆσ 2 i ) n/2 p i=1 j=1 (ˆσ2 j ) n/2 ˆσ 2 i an adaptively weighted estimator formed from the sub-model estimators. But, based on the simulation studies, a better one is that formed from the sub-model MREs: ˆσ 2 p,plb1 = ( n n + 2 ) ˆσ p,lb1 2

Comparing the Estimators R (ˆσ 2 UMV U,(µ, σ2 ) ) = 2 n 1. R (ˆσ 2 MRE,(µ, σ2 ) ) = 2 n+1. Efficiency measure relative to ˆσ 2 UMV U : Eff(ˆσ 2 : ˆσ UMV 2 U ) = R(ˆσ2 UMV U,(µ, σ2 )) R(ˆσ 2,(µ, σ 2. )) Eff(ˆσ 2 MRE : ˆσ2 UMV U ) = n+1 n 1 = 1 + 2 n 1.

Properties of M p -Based Estimators Notation: Let Z N(0,1) and with µ i0 the true mean, define = µ µ i 0 1. σ Proposition: Under M p, ˆσ i 2 d 1 ( ) = W + V 2 σ 2 n i, i = 1,2,..., p; with W and V independent, and W χ 2 n 1 ; V = Z1 n N p ( n, J 11 ).

Notation: Given, let (1) < (2) <... < (p) be the ordered values. always has a zero component. Theorem: Under M p, with ˆσ p,mle 2 d 1 = σ 2 n {W+ p I{L( (i), ) < Z < U( (i), )}(Z n (i) ) 2 i=1 L( (i), ) = U( (i), ) = n [ ] (i) + 2 (i 1) ; n [ ] (i) + (i+1). 2 ;

Mean: EpMLE( ) E = 1 2 n p p i=1 i=1 ˆσ 2 p,mle σ 2 (i) [φ(l( (i), )) φ(u( (i), ))] + 2 (i) [Φ(U( (i), )) Φ(L( (i), ))]; Case of p = 2. EpMLE( ) = 1 { ( ) n φ 2 ( ) 2 n ( ) [ n 2 1 Φ ( )]} n 2

EpMLE 0.90 0.92 0.94 0.96 0.98 1.00 4 2 0 2 4 sqrt(n) Delta /2 ˆσ 2 p,mle is negatively biased for σ2 (even though each submodel estimator is unbiased). Effect of double-dipping.

Variance: VpMLE( ) Var = 1 n 2 (1 1 n ˆσ 2 p,mle σ 2 ) + 1 p n i=1 ζ (i) (4) p i=1 ζ (i) (2) 2 ; ζ (i) (m) E { I{L( (i), ) < Z U( (i), )}(Z n (i) ) m}. These formulas enable computations of the theoretical risk functions of the classical M p -based estimators.

An Iterative Estimator Consider the Class: C = { σ 2 (c) cˆσ 2 p,mle : c 0} The risk function of σ 2 (c), which is a quadratic function in c, could be minimized wrt c. The minimizing value is c ( ) = EpMLE( )/{V pmle( ) + [EpMLE( )] 2 } Given a c, = (µ µ i0 1 p )/σ could be estimated via ˆ = (µ µ ˆM 1 p) σ(c ) This in turn could be used to obtain a new estimate of c ( )

Algorithm for σ 2 p,iter Step 0 (Initialization): Set a value for tol (say, tol = 10 8 ) and set c old = 1. Step 1: Define σ 2 = (c old )ˆσ 2 p,mle. Step 2: Compute ˆ = (µ µ ˆM 1 p)/ σ. Step 3: Compute c new = EpMLE(ˆ ) V pmle(ˆ )+[EpMLE(ˆ )] 2. Step 4: If c old c new < tol set σ 2 p,iter = σ2 then stop; else c old = c new then back to Step 1.

Impact of Number of Sub-Models Theorem: With n > 1 fixed, if as p, max 2 i p (i) (i 1) 0, (1), and (p), then Eff (ˆσ 2 p,mre : ˆσ2 MRE ) 2(n + 2) 2 (n + 1)(2n + 7) < 1. Therefore, the advantage of exploiting the structure of M p could be lost forever when p increases!

Representation: Weighted Estimators Umbrella Estimator: For α > 0, define ˆσ 2 p,lb (α) = p i=1 (ˆσ 2 i ) α p j=1 (ˆσ2 j ) α ˆσ2 i. Theorem: Under M p, ˆσ 2 p,lb (α) σ 2 d = W n {1 + H(T;α)}; T = (T 1, T 2,..., T p ) = V / W;

H(T;α) = p i=1 θ i (T;α)T 2 i ; θ i (T;α) = (1 + T2 i ) α p j=1 (1 + T2 j ) α. Even with this representation, still difficult to obtain exact expressions for the mean and variance. Developed 2nd-order approximations, but were not so satisfactory when n 15. In the comparisons, we resorted to simulations to approximate the risk function of the weighted estimators.

Some Simulation Results Figures 1 and 2 Simulated and Theoretical Risk Curves for n = 3 and n = 10 (Based on 10000 replications per )

Theoretical and/or Simulated Relative (to UMVU) Efficiency Curves Efficiency (relative to UMVU) 160 180 200 220 240 260 pmle simulated pmle theoretical pmre simulated pmre theoretical pplb1 simulated piter simulated 4 2 0 2 4 Delta

Theoretical and/or Simulated Relative (to UMVU) Efficiency Curves Efficiency (relative to UMVU) 105 110 115 120 125 130 135 pmle simulated pmle theoretical pmre simulated pmre theoretical pplb1 simulated piter simulated 4 2 0 2 4 Delta

Table: Relative efficiency (wrt UMVU) for symmetric and increasing p with limits [ 1,1] and n = 3,10,30 using 1000 replications. Except for the first set, denoted by (*), where the mean vector is {0,1}, the other mean vectors are of form [ 1 : 2 k : 1] whose p = 2 (k+1) + 1. A last letter of s on the label means theoretical, whereas an s means simulated.

n k p pmles pmlet pmres pmret pplb1s piters 3 * 2 171 170 238 232 247 238 10 * 2 118 115 139 134 133 135 30 * 2 101 104 109 111 108 109 3 0 3 208 195 219 216 260 224 10 0 3 116 120 136 134 127 129 30 0 3 111 104 115 111 114 114 3 1 5 185 185 203 199 248 212 10 1 5 114 119 119 124 120 118 30 1 5 111 106 115 110 112 113 3 2 9 188 182 198 195 243 209 10 2 9 117 118 120 120 127 123 30 2 9 102 106 104 107 103 103 3 3 17 183 181 190 194 235 200 10 3 17 111 117 118 119 123 119 30 3 17 113 105 115 106 115 115 3 4 33 184 181 193 194 239 204 10 4 33 117 117 116 119 125 121 30 4 33 102 105 105 105 105 105 3 5 65 159 181 194 194 226 199 10 5 65 124 117 120 119 132 127 30 5 65 106 105 105 105 107 107

Concluding Remarks In models with sub-models, and interest is to infer about a common parameter, possible approaches are: Approach I: Use a wider model subsuming the sub-models, possibly a fully nonparametric model. Possibly inefficient, though might be easier to ascertain properties. Approach II: A two-step approach: Select sub-model using data; then use procedure for chosen sub-model, again using same data.

Approach III: Utilize a Bayesian framework. Assign a prior to the sub-models, and (conditional) priors on the parameters within the sub-models. Leads to model-averaging. Approaches (II) and (III) are preferable over approach (I); but when the number of sub-models is large, approach (I) may provide better estimators and a simpler determination of the properties. If the sub-models are quite different and the model selector can choose the correct model easily, or the sub-models are not too different that an erroneous choice of the model by the selector will not matter much, approach (II) appears

preferable. In the in-between situation, approach (III) seems preferable. For the specific Gaussian model considered, the iterative estimator actually performed in a robust fashion. To conclude, Observe Caution! when doing inference after model selection especially when double-dipping on the data!