5 Decision Theory: Basic Concepts

Size: px

Start display at page:

Download "5 Decision Theory: Basic Concepts"

Hannah Fleming
5 years ago
Views:

1 5 Decisio Theory: Basic Cocepts Poit estimatio of a ukow parameter is geerally cosidered the most basic iferece problem. Speakig geerically, if θ is some ukow parameter takig values i a suitable parameter space Θ, the a poit estimate is a educated guess at the true value of θ. Now, of course, we do ot just guess the true value of θ without some iformatio. Iformatio comes from data; some iformatio may also come from expert opiio separate from data. I ay case, a poit estimate is a fuctio of the available sample data. To start with, we allow ay fuctio of the data as a possible poit estimate. Theory of iferece is used to separate the good estimates from the ot so good or bad estimates. As usual, we will start with a example to help us uderstad the geeral defiitios. Example 5.1. Suppose X 1,,X Poi(λ), λ > 0. Suppose we wat to estimate the parameter λ. Now, of course, λ = E λ (X 1 ), i.e., λ is the populatio mea. So, just istictively, we may wat to estimate λ by the sample mea X = X 1+ +X ; ad, ideed, X is a possible poit estimator of λ. While λ takes values i Θ = (0, ), the estimator X takes values i A = [0, ); X ca be equal to zero! So, the set of possible values of the parameter ad the set of possible values of a estimator eed ot be idetical. We must allow them to be differet sets, i geeral. Now, X is certaily ot the oly possible estimator of λ. We ca use essetially ay fuctio of the sample observatios X 1,,X to estimate the parameter λ. For example, just X 1, or X 1 + X 2 X 3, or eve seemigly poor estimators like X1 4. Ay estimator is allowed to begi with; theory will separate the good oes from the bad oes. Next, suppose X 1,,X N(µ, σ 2 ), where µ, σ are ukow parameters. So, ow, we have got a two dimesioal parameter vector, θ = (µ, σ). Suppose we wat to estimate µ. Oce agai, a possible estimator is X; a few other possible estimators are the sample media, M = media{x 1,,X }, or X 2+X 4 2, or eve seemigly poor estimators, like 100 X. Suppose, it was kow to us that µ must be oegative. The, the set of possible values of µ is Θ = [0, ). However, the istictive poit estimator X ca take ay real value. It takes values i the set A = (, ). You would otice agai that A is ot the same as Θ i this case. I geeral, A ad Θ ca be differet sets. If we wat istead to estimate σ 2, which is the populatio variace, a first thought is to use the sample variace s 2 = 1 1 (X i X) 2 (dividig by 1 rather tha seems a little odd at first glace, but has a mathematical reaso, which will be clear soo). I this example, the parameter σ 2 ad the estimator s 2 both take values i (0, ). But, if we kew that σ 2 100, say, the oce agai, Θ ad A would be differet. Ad, as always, there are may other possible estimators of σ 2, for example, 1 1 (X i M ) 2, where M is the sample media. Here is a formal defiitio of a poit estimator. 267

2 Defiitio 5.1. Let the vector of sample observatios X () = (X 1,,X ) have a joit distributio P = P ad let θ = h(p), takig values i the parameter space Θ R p, be a parameter of the distributio P. Let T(X 1,,X ) takig values i a specified set A R p be a geeral statistic, The, ay such T(X 1,,X ) is called a poit estimator of θ. The set Θ is called the parameter space, ad the set A is called the statisticia s actio space. If specific observed sample data X 1 = x 1,,X = x are available, the the particular value T(x 1,,x ) is called a estimate of θ. Thus, the word estimator applies to the geeral fuctio T(X 1,,X ), ad the word estimate applies to the value T(x 1,,x ) for specific data. I this text, we use estimator ad estimate syoymously. A stadard geeral otatio for a geeric estimate of a parameter θ is ˆθ = ˆθ(X 1,,X ) Evaluatig a Estimator ad MSE Except i rare cases, the estimate ˆθ would ot be exactly equal to the true value of the ukow parameter θ. It seems reasoable that we like a estimate ˆθ which geerally comes quite close to the true value of θ, ad dislike a estimate ˆθ which geerally misses the true value of θ by a large amout. How, are we goig to make this precise ad quatifiable? The geeral approach to this questio ivolves the specificatio of a loss ad a risk fuctio, which we will itroduce i a later sectio. For ow, we describe a very commo ad eve hugely popular criterio for evaluatig a poit estimator, the mea squared error. Defiitio 5.2. Let θ be a real valued parameter, ad ˆθ a estimate of θ. The mea squared error (MSE) of ˆθ is defied as MSE = MSE(θ, ˆθ) = E θ [(ˆθ θ) 2 ], θ Θ. If the parameter θ is p-dimesioal, θ = (θ 1,,θ p ), ad ˆθ = ˆθ 1,, ˆθ p ), the the mea squared error f ˆθ is defied as MSE = MSE(θ, ˆθ) = E θ [ ˆθ θ 2 ] = p E θ [(ˆθ i θ i ) 2 ], θ Θ Bias ad Variace It turs out that the mea squared error of a estimator has oe compoet to do with systematic error of the estimator ad a secod compoet to do with radom error of the estimator. If a estimator ˆθ routiely overestimated the parameter θ, the usually we will have ˆθ θ > 0; we thik of this as the estimator makig a systematic error. Systematic 268

3 errors ca also be made by routiely uderestimatig θ, We quatify the systematic error of a estimator by lookig at E θ (ˆθ θ) = E θ (ˆθ) θ; this is called the bias of the estimator; we deote it as b(θ). If the bias of a estimator ˆθ is always zero, b(θ) = 0 for all θ, the the estimator ˆθ is called ubiased. O the other had, the estimator ˆθ may ot make much systematic error, but still ca just be ureliable because from oe dataset to aother, its accuracy may differ wildly. This is called radom or fluctuatio error, ad we ofte quatify the radom error by lookig at the variace of the estimator, Var θ (ˆθ). A pleasat property of the MSE of a estimator is that always the MSE eatly breaks ito two compoets, oe ivolvig the bias, ad the other ivolvig the variace. You have to try to keep both of them small; large biases ad large variaces are both red sigals. Here is a bias-variace decompositio result. Theorem 5.1. Let θ be a real valued parameter ad ˆθ a estimator with a fiite variace uder all θ. The, MSE(θ, ˆθ) = Var θ (ˆθ) + b 2 (θ), θ Θ. Proof; To prove this simple theorem, we recall the elemetary probability fact that for ay radom variable U with a fiitte variace, E(U 2 ) = Var(U)+[E(U)] 2. Idetifyig U with ˆθ θ, MSE(θ, ˆθ) = Var θ (ˆθ θ) + [E(ˆθ θ)] 2 = Var θ (ˆθ) + b 2 (θ) Computig ad Graphig MSE We will ow see oe itroductory example. Example 5.2. (Estimatig a Normal Mea ad Variace) Suppose we have sample observatios X 1,,X N(µ, σ 2 ), where < µ <, σ > 0 are ukoow parameters. The parameter is two dimesioal, θ = (µ, σ). First cosider estimatio of µ, ad as a example, cosider these two estimates: X ad X +1. We will calculate the MSE of each estimate, ad make some commets. Sice E( X) = 1 sum E(X i) = 1 (µ) = µ for ay µ ad σ, the bias of X for estimatig µ is zero; E θ ( X) µ = µ µ = 0. I other words, X is a ubiased estimate of µ. Therefore, the MSE of X is just its variace, E[( X µ) 2 ] = Var( X) = Var(X 1) = σ2. Notice that the MSE of X does ot deped o µ; it oly depeds o σ 2. Now, we will fid the MSE of the other estimate X +1. For this, by our Theorem 3.1, the MSE of X +1 is E[( X + 1 µ)2 ] = Var( X X ) + [E( µ)]2 269

4 Compariso of MSE of Estimates i Example mu = ( + 1 )2 Var( X) + [ + 1 µ µ]2 = ( + 1 )2σ2 + ( )2 µ 2 = µ 2 ( + 1) 2 + σ2 ( + 1) 2. Notice that the MSE of this estimate does deped o both µ ad σ 2 µ ; 2 is the cotributio of the bias compoet i the MSE, ad σ2 is the cotributio of the variace (+1) 2 (+1) 2 compoet. For purposes of compariso, we plot the MSE of both estimates, takig = 10 ad σ = 1; the MSE of X is costat i µ, ad the MSE of X +1 is a quadratic i µ. For µ ear zero, X +1 has a smaller MSE, but otherwise, X has a smaller MSE. The graphs of the two mea squared errors cross. This is quite ofte the case i poit estimatio. Next, we cosider estimatio of σ 2. Now, we will use as our example the estimates s 2 = 1 1 (X i X) 2 ad 1 (X i X) 2. First, we will prove that s 2, i.e., the estimator that divides by 1 is a ubiased estimator of σ 2. Here is the proof. First, ote the algebraic idetity (X i X) 2 = = (Xi 2 2 XX i + X 2 ) = Xi 2 2 X 2 + X 2 = Xi 2 2 X Xi 2 X 2. X i + X 2 Therefore, E[ (X i X) 2 ] = E(X1) 2 E[ X 2 ] = (σ 2 + µ 2 ) ( σ2 + µ2 ) = ( 1)σ 2, 270

5 which gives E(s 2 ) = 1 1 ( 1)σ2 = σ 2. Therefore, the MSE of s 2, by Theorem 3,1, is the same as its variace, E[(s 2 σ 2 ) 2 ] = Var(s 2 1 ) = Var( 1 (X i X) 2 ) = 1 (X i X) 2 ( 1) 2Var(σ2 σ 2 ) = = σ 4 2σ4 ( 1) 22( 1) = 1, by usig the fact that if X 1,,X are N(µ, σ 2 ), the σ 4 ( 1) 2Var( (X i X) 2 σ 2 ) P (X i X) 2 that the variace of a χ 2 distributio is twice its degrees of freedom. σ 2 χ 2 ( 1) ad Next, we will f the MSE of the secod estimate 1 (X i X) 2. This estimate does have a bias! Its bias is Also, its variace is b(θ) = E[ 1 (X i X) 2 ] σ 2 = 1 σ2 σ 2 = σ2. Var( 1 (X i X) 2 ) = Var( 1 1 (X i X) 2 ) = ( 1)2 2 2σ 4 1 = 2σ4 ( 1) 2. Therefore, by Theorem 3.1, the MSE of our secod estimate 1 (X i X) 2 is E[( 1 (X i X) 2 σ 2 ) 2 ] = 2σ4 ( 1) 2 + σ4 (2 1)σ4 = 2 2. You may verify that this is always smaller tha 2σ4 1, the MSE of s2. So, we have the sceario that i this example, the biased estimator 1 (X i X) 2 always has a smaller MSE tha the ubiased estimator s 2. This is a special example; do ot expect this to automatically hold i aother example Loss ad Risk Fuctios Use of MSE is a special case of a more geeral formulatio of estimatio as a decisio theory problem. I this formulatio, we view the poit estimatio problem as a two perso game, player I who chooses the true value of θ from Θ, ad player II who chooses a poit estimate, possibly after observig some sample data. Player I is called ature, ad player II the statisticia. I accordace with our previous otatio, we allow the statisticia to choose the value of his estimate from a well defied set A, called the actio space. There is 271

6 aother compoet ecessary to complete the formulatio. After player I chooses θ from Θ, ad player II chooses a specific estimate for θ, say ˆθ(x 1,,x ) = a, player II has to pay player I a pealty for makig a possibly icorrect guess o the true value of θ. This is called the loss fuctio, ad is deoted as L(θ, a), θ Θ, a A. I priciple, the loss fuctio ca bbe almost ay fuctio with some atural costraits; but, i our practice of statistics, we ted to use just oe or two most of the times, because it is easier to work with them. We ote that the realized value of a estimate ˆθ(x 1,,x ) depeds o the particular data x 1,,x actually obtaied. I oe samplig experimet, our data may be such that the estimate ˆθ(x 1,,x ) comes quite close to the true value of θ, while i aother idepedet samplig experimet, the data may be such that the estimate is ot so good. We try to look at average performace of ˆθ(X 1,,X ) as a estimatig procedure, or decisio rule. This motivates the defiitio of a risk. Oe commet is ow i order. Sice may other iferece problems besides poit estimatio ca be put uder the geeral formulatio of decisio theory, we prefer to use a sigle otatio for the statisticia s decisio procedure. To be cosistet with much of the statistical literature, we use the otatio δ(x 1,,X ). I the poit estimatio cotext, δ(x 1,,X ) will mea a estimator; i aother cotext, it may mea somethig else, like a test. Defiitio 5.3. Let Θ deote ature s parameter space, A the statisticia s actio space, ad L(θ, a) a specific loss fuctio. The, the risk fuctio of a decisio procedure δ(x 1,,X ) is the average loss icurred by δ: R(θ, δ) = E θ [L(θ, δ(x 1,,X ))]. I the above, the expectatio E θ meas expectatio with respect to the joit distributio of (X 1,,X ) uder θ. Example 5.3. (Some Possible Loss Fuctios) We usually impose the coditio that L(θ, a) = 0 if a = θ (o pealty for perfect work). We also ofte take L(θ, a) to be a mootoe odecreasig fuctio of the physical distace betwee θ ad a. So, if θ ad a are real valued, we ofte let L(θ, a) = W( theta a ) for some fuctio W with W(0) = 0, ad W(x) mootoe odecreasig o [0, ). Some commo choices are: Squared error loss L(θ, a) = (a θ) 2 ; Absolute error loss L(θ, a) = a θ ; Zero-K loss L(θ, a) = KI a θ >c ; 272

7 Asymmetric loss L(θ, a) = K 1 if a < θ,= 0 if a = θ, K 2 if a > θ; Power loss L(θ, a) = a θ α, α > 0; Weighted squared error loss L(θ, a) = w(θ)(a θ) 2. I a practical problem, writig dow a loss fuctio that correctly reflects cosequeces of various decisios uder every possible circumstace is difficult ad eve impossible to do. You should treat decisio theory as a frequetly useful geeral guide as to the choice of your procedure. By far, i poit estimatio, squared error loss is the most commo, because it is the easiest to work with, ad gives reasoable geeral guide as to which procedures are good ad which are bad. Note the importat fact that Risk fuctio uder squared error loss is the same as MSE Optimality ad Priciple of Low Risk It seems oly atural that oce we have specified a loss fuctio, we should prefer decisio procedures δ that have low risk. Lower risk is cosidered such a good property, that procedures whose risk fuctios ca always be beate by some other alterative procedure are called iadmissible. I fact, it would be the best if we could fid oe decisio procedure δ 0 which has the smallest risk amog all possible decisio procedures at every possible value of θ. Ufortuately, rarely, this is possible. There are o uiformly best decisio procedures. Risk fuctios of decet decisio procedures ted to cross; sometimes, oe is better ad some other times, the other procedure is better. I the ext few sectios, we will give a elemetary itroductio to popular methods for choosig amog decisio procedures whose risk fuctios cross. But, right ow, let us see oe more cocrete risk calculatio. Example 5.4. (Estimatio of Biomial p). Suppose observatios X 1,,X are take from a Beroulli distributio with a ukow parameter p, 0 < p < 1. Suppose the statisticia s actio space is the closed iterval [0, 1], ad that we use the squared error loss fuctio (a p) 2. Let T = X i be the total umber of successes. We cosider three decisio procedures (estimators) δ 1 (X 1,,X ) = T ; δ 2 (X 1,,X ) = T ; δ 3 (X 1,,X )

8 Risk Fuctios of Three Estimators i Example p Note that all three estimators are of the form a+bt for suitable a, b. The bias of a geeral estimator of this form is b(p) = a + bp p = a + (b 1)p, ad so, the risk fuctio of a geeral estimator of this form is R(p, a + bt) = E[(a + bt p) 2 ] = b 2 (p) + Var(a + bt) = (a+(b 1)p) 2 +b 2 p(1 p) = a 2 +(b 2 +2ab 2a)p+( 2 b 2 b 2 2b+1)p 2. (1) δ 1 correspods to a = 0, b = 1, δ 2 to a = 1 +2, b = 1 +2, ad δ 3 to a = 1 2, b = 0. Substitutio ito (3) yields: p(1 p) R(p, δ 1 ) = ; 1 + ( 4)p(1 p) R(p, δ 2 ) = ( + 2) 2 ; R(p, δ 3 ) = 1 p(1 p). 4 These three risk fuctios are plotted above. Evidetly, oe of them is uiformly the best amog the three estimators. Near p = 0, 1, δ 1 is the best, ear p = 1 2, δ 3 is the best, ad i two subitervals away from 0, 1 ad away from 1 2, δ 2 is the best. We must somehow collapse each of these risk fuctios ito a sigle represetative umber, so that we ca compare umbers rather tha fuctios. We preset some traditioal ideas o how to collapse risk fuctios to a sigle represetative umber i our ext two sectios. 274

9 5.0.6 Prior Distributios ad Bayes Procedures: First Examples A ituitively attractive method to collapse a risk fuctio ito a sigle umber is to take a average of the risk fuctio with respect to a weight fuctio, say π(θ). For example, i our above biomial example, the parameter is p, ad we average the risk fuctio of ay procedure δ by calculatig 1 0 R(p, δ)π(p)dp, for a specific weight fuctio π(p). Usually, we choose π(p) to be a probability desity fuctio o [0, 1], the parameter space for p. Such weight fuctios are called prior distributios for the parameter, ad the average risk r(π, δ) = 1 0 R(p, δ)π(p)dp is called the Bayes risk of the procedure δ with respect to the prior π. The idea is to prefer a procedure δ 2 to δ 1 if r(π, δ 2 ) < r(π, δ 1 ). Sice we are o loger comparig two fuctios, but oly comparig two umbers, it is usually the case that oe of the two umbers will be strictly smaller tha the other oe, ad that will tell us which procedure is to be preferred, provided that we are sold o the choice of the specific prior distributio π that we did choose. This approach is kow as Bayesia decisio theory ad is a very importat compoet of statistical iferece. The Bayes approach will ofte give you useful ideas about which procedures may be reasoable i a give problem. We ow give the formal defiitios. Defiitio 5.4. Let G be a specific probability distributio o atures s parameter space Θ; G is called a prior distributio or simply a prior for the parameter θ. Defiitio 5.5. Let R(θ, δ) deote the risk fuctio of a geeral decisio procedure δ uder a specific loss fuctio L(θ, a). The Bayes risk of δ with respect to the prior G is defied as r(g, δ) = E G [R(θ, δ)], where the otatio E G meas a expectatio by treatig the parameter θ as a radom variable with the distributio G. If θ is a cotiuous parameter ad the prior G has a desity fuctio π, we have r(π, δ) = Θ R(θ, δ)π(θ)dθ. Remark: Although we do ot look kidly upo such procedures δ, the Bayes risk r(g, δ) ca be ifiite. Sice lower risks are preferred i geeral, we would also prefer lower Bayes risk. This raises a immediate atural questio; which particular δ makes r(g, δ) the smallest possible amog all decisio procedures? This motivates a importat defiitio. Defiitio 5.6. Give a loss fuctio L ad a prior distributio G, ay procedure δ G 275

10 such that r(g, δ G ) = if δ is called a Bayes procedure uder L ad G. r(g, δ) Remark: A Bayes procedure eed ot always exist, ad whe it exists it eed ot i geeral be uique. But, uder commoly used loss fuctios, a Bayes procedure does exist, ad it is eve uique. This is detailed later. Let us ow retur to the biomial example to illustrate these various ew cocepts. Example 5.5. (Estimatio of Biomial p). As a example, take the specific prior distributio G = U[0, 1] with π(p) = 1I 0 p 1. The, the Bayes risks of the three procedures δ 1, δ 2, δ 3 are r(π, δ 2 ) = r(π, δ 1 ) = 1 0 r(π, δ 3 ) = 1 0 p(1 p) dp = 1 6 ; 1 + ( 4)p(1 p) ( + 2) 2 dp = 1 0 [ 1 4 p(1 p)]dp = ( + 2) ; We ow see o easy algebra that if > 2, the a orderig amog the three procedures δ 1, δ 2, δ 3 has emerged: r(π, δ 2 ) < r(π, δ 1 ) < r(π, δ 3 ). Therefore, amog these three procedures, we should prefer δ 2 over the other two if > 2 ad if we use G = U[0, 1] as the prior. Compare this eat coclusio with the difficulty of reachig a coclusio by comparig their risk fuctios, which cross. Remark: It may be show that δ 2 is i fact the uique Bayes procedure i the Biomial p example, if loss is squared error ad G = U[0, 1]. It will ot be proved i this chapter. However, if you chage the choice of the prior G to some other distributio o [0, 1], the the Bayes procedure will o loger be δ 2. It will be some other procedure, depedig o exactly what the prior distributio G is; Bayes procedures are to be treated i detail i Chapter Maximum Risk ad Miimaxity A secod commo method to collapse a risk fuctio R(θ, δ) to a sigle umber is to look at the maximum value of the risk fuctio over all possible values of the parameter θ. Precisely, cosider the umber R(δ) = max R(θ, δ). θ Θ I this approach, you will prefer δ 2 to δ 1 if R(δ2 ) < R(δ 1 ). Accordig to the priciple of miimaxity, we should ultimately use that procedure δ 0 that results i the miimum 276

11 possible value of R(δ) over all possible procedures δ. That is, you miimize over δ the maximum value of R(θ, δ); hece the ame miimax. Cosider a layma s example to uderstad the cocept of miimaxity. Suppose you have to travel from a city A to a small tow B which is about 500 miles away. You ca either fly or drive. However, oly sigle egie small plaes fly to tow B, ad you are worried about their safety. O the other had, drivig to B will take you 10 hours, ad it will cost you valuable time, may be some icome, opportuities for doig other pleasurable thigs, ad it would be rather tirig. But you do ot thik that drivig to B ca cause you death. The worst possible risk of drivig to B is lower tha the worst possible risk of flyig to B, ad accordig to the miimaxity priciple, you should drive. If we preset miimaxity i such a light, it comes across as the philosophy of the timid. But, i may problems of statistical iferece, the priciple of miimaxity has ultimately resulted i a procedure that is reasoable, or that you would have probably used ayway because of other reasos. It ca also serve as a idealistic bechmark agaist which you ca evaluate other procedures. Moreover, although it is ot obvious, the approach of miimaxity ad the Bayes approach are sometimes coected. A miimax procedure is sometimes a Bayes procedure with respect to a suitable prior G, ad this sort of a coectio gives some additioal credibility to miimax procedures. Brow (1994, 2000) are two very lucidly writte articles that explai the role of miimaxity i the evolutio of statistical iferece. Let us ow revisit our biomial p example for illustratio. Example 5.6. (Estimatio of Biomial p). The plot of the risk fuctios of the three procedures δ 1, δ 2, δ 3 reveals what their maximum risks are. The maximum risk of δ 1 ad δ 2 is attaied at p = 1 2, ad the maximum risk of δ 3 is at the boudary poits p = 0, 1 (strictly speakig, the maximum is a supremum, ad the supremum is ot attaied at ay p i the ope iterval (0, 1)). Precisely, R(δ 1 ) = R( 1 2, δ 1) = 1 4 ; R(δ 2 ) = R( 1 2, δ 2) = 4( + 2) 2; We have the orderig R(δ 3 ) = lim p 0,1 R(p, δ 3) = 1 4. R(δ 2 ) < R(δ 1 ) < R(δ 3 ), ad therefore, amog the three procedures δ 1, δ 2, δ 3, accordig to the priciple of miimizig the maximum risk, we should prefer δ

12 Remark: It may be show that the overall miimax procedure amog all possible procedures is a rather obscure oe; it is ot δ 2. You will see it i Chapter 7. Note that the priciple of miimaxity i its formulatio does ot call for specificatio of a prior. Some statisticias cosider this a plus. However, as we remarked earlier, at the ed, a miimax procedure may tur out to be a Bayes procedure with respect to some prior. Very ofte, this prior is difficult or impossible to guess. Miimax procedures ca be extremely difficult to fid. It is also importat to remember that although miimaxity does ot ivolve the assumptio of a specific prior, absolutely does require the assumptio of a specific loss fuctio. Just as a example, if i our biomial p problem, we use absolute error loss p a istead of squared error loss (p = a) 2, the miimax procedure will chage, ad eve worse, o oe has ever worked it out! Assumptios Matter Statistical iferece always requires some assumptios. You have to assume somethig. How much we assume depeds o the problem, o the approach, ad perhaps o some other thigs, like how much data do we have to test our assumptios. For example, for fidig a miimax procedure, we have to make assumptios about the model, meaig the distributio of the uderlyig radom variable (ormal or Cauchy, for istace), ad we have to make a assumptio of a specific loss fuctio. I the Bayes approach, too, we will eed to make both oof these assumptios ad a extra assumptio of a specific prior G. It is importat that we at least kow what assumptios we have made, ad whe possible, make some attempt to validate those assumptios by usig the available data. Model validatio is ot a easy task, especially whe we have made assumptios which are hard to verify without a lot of data. Statisticias do ot agree o whether prior distributios o parameters ca or eve should be validated from oe s data. There are very few formal ad widely accepted methods for verifyig a chose loss fuctio. The origial idea was to elicit a problem specific loss fuctio by havig a persoal meetig with the cliet. Such efforts have geerally bee regarded as impractical, or have ot resulted i clear success. I additio, if loss fuctios are always problem specific, the we elimiate the premise of a structured theory that will apply simultaeously to may problems. So, whe it comes to the assumptio of a loss fuctio, we have geerally sided with coveiece; it is the primary reaso that squared error loss ad MSE are so dear to us as a professio. We recommed Dawid (1982), Seidefeld (1985), ad Kadae ad Wolfso (1998) for thoughtful expositio of these issues. subsubsectiorobustess as a Theme I statistical iferece, the phrase robust applies to the somewhat ebulous property of isesitivity of a particular procedure to the assump- 278

13 tios that were made. The idea of robust ifereces seems to have bee first metioed i those words i Box (1953). Here is a simple example. You will see i Chapters 6 ad 8 that for estimatig the locatio parameter µ of a uivariate ormal distributio, the sample mea X is almost uiversally regarded as the estimator that oe should use. Now, we have made some assumptios here, amely, that the sequece of our sample observatios X 1,,X are observatios from some ormal distributio, N(µ, σ 2 ). uppose ow that we did our modellig a little too carelessly. Suppose the observatios are, but from a distributio with a much heavier tail tha the ormal; e.g., suppose a much better model would have bee that X 1,,X C(µ, σ) for some µ, σ. The, as it turs out, the sample mea X is a extremely poor estimate of µ. You ca therefore argue that X is ot robust to the assumptio of ormality, whe it comes to our assumptio about the tail. We ca similarly ask ad ivestigate if X is robust to the assumptio of idepedece of our sample observatios X 1,,X. Ca we do aythig about it? The aswer depeds o our ambitio. No oe ca build a procedure which is agreably robust to every assumptio that oe has made. However, if we isolate a few assumptios, perhaps oly those that we are most usure about, the it may be possible to costruct procedures which are robust agaist those specific assumptios. For example, if we assume symmetry, but wat to be robust agaist the tail of our distributio, the we ca do somethig about it. As a illustratio, if all we wish is to be robust agaist the tail, the to estimate the locatio parameter µ of a desity o the real lie, we would be much better off usig the sample media as a estimator of µ, istead of the sample mea. But, if i additio to the tail, we wat to be robust agaist the assumptio that our desity is symmetric aroud the parameter µ, the we do ot have ay reasoable robust solutios to that problem. We just caot be simultaeously robust agaist all our assumptios. It is also importat to uderstad that robustess is a desirable goal, but ot ecessarily the first goal. I settig robustess as our first goal, we may ed up selectig a procedure which is so focused o cautio that it does oly a mediocre job i all situatios, ad ot a excellet job i ay situatio. Robustess is iheretly a murky ad elusive cocept. It is sometimes obtaiable to a limited extet. But, still, we must always pay extremely close attetio to all of our assumptios. I doig hoest iferece, it is importat that we kow ad uderstad all the assumptios we have made. For example, i a poit estimatio problem, we should ask ourselves: 1. Have we assumed that our sample observatios are from some F? 2. Have we assumed that F belogs to some specific parametric family, such as Poisso or ormal? 3. What is our assumed loss fuctio? 279

14 4. What is our assumed prior distributio? 5. Which of these assumptios are seriously questioable? 6. Do we kow how to examie if our proposed estimator is reasoably robust to those assumptios? 7. If we coclude that it is ot reasoably robust, do we kow what to do ext? We will treat robust estimatio to some extet i Chapter 7. Some additioal refereces o robust iferece are Huber (1981), Hampel et. al (1986), Portoy ad He (2000), ad Stigler (2010). 5.1 Exercises Exercise 5.1. (Skills). Suppose X 1,,X 5 are five observatios from a Beroulli distributio with parameter p. Let T = 5 X i. (a) Calculate i closed form the MSE of the followig three estimators: δ 1 (X 1,,X 5 ) = T 5 ; δ 2 (X 1,,X 5 ) = T + 2 ; 9 δ 3 (X 1,,X 5 ) = 1 2, if T = 2, 3 = T 5, if T = 0, 1, 4, 5. (b) Plot the MSE of each estimator o a sigle plot, ad commet. Exercise 5.2. (Partly Coceptual). Suppose X, Y, Z are three observatios from a ormal distributio with mea µ ad kow variace σ 2. Let U, V, W deote the smallest, the media, ad the largest amog the three observatios X, Y, Z. (a) Which of the followig three estimators of µ are ubiased, i.e., the bias is zero for all µ: V ; W; U + W ;.2U +.6V +.2W. 2 (b) Cosider the followig strage soudig estimator. Toss a fair coi. If it lads o heads, estimate µ as V ; if it lads o tails, estimate µ as ubiased? (c) Do you kow how to calculate the MSE of the estimator i part (b)? X+Y +Z 3. Is this estimator Exercise 5.3. Suppose that we have two observatios o a ukow parameter µ. The first oe is X N(µ,1), ad the secod oe is Y U[µ 1, µ + 1], where X, Y are idepedet. 280

15 (a) Fid the MSE of a estimator of µ of the geeral form ax + by. (b) Whe is such a estimator ubiased? (c) Betwee the three estimators X, Y, X+Y 2, do you have a preferece for oe of them? Justify your preferece. Exercise 5.4. (Thresholdig Estimate). Suppose X N(µ, 1). Cosider the thresholdig estimator of µ that equals zero if X 1 ad equals X if X > 1. (a) Fid the expectatio ad the secod momet of this estimator. (b) Hece fid the bias ad the variace of this estimator. (c) Plot the MSE of this estimator ad the MSE of the estimator X. Do they cross? (d) Ituitively, whe would you prefer to use the thresholdig estimate over X? Exercise 5.5. (Absolute Error Loss). Suppose X 1,,X N(µ, σ 2 ). If the loss fuctio is the absolute error loss L(µ, a) = µ a, fid the risk fuctio of X. Exercise 5.6. (Absolute Error Loss). Suppose a sigle observatio X P oi(λ). If the loss fuctio is the absolute error loss L(λ, a) = λ a, (a) Fid the risk of the estimator X whe λ = 1. (b) Fid the risk of the estimator X whe λ = 1.4. Exercise 5.7. Suppose a sigle observatio X Poi(λ). Suggest a estimator of λ 2 ad the criticize your ow choice. Exercise 5.8. (Skills). Suppose X 1,,X N(µ, σ 2 ), where σ 2 is cosidered kow. Suppose loss is squared error, ad cosider estimators of µ of the geeral form a + b X. (a) For what choices of a, b is the risk fuctio uiformly bouded as a fuctio of µ? (b) For such choices of a, b, fid the maximum risk. (c) Idetify the particular choice of a, b that gives the smallest value for the maximum risk. What is the correspodig estimator? Exercise 5.9. (Cost of Samplig). Suppose a certai Beroulli parameter is defiitely kow to be betwee.46 ad.54. It costs some c dollars to take oe Beroulli observatio. Would you care to sample, or estimate p to be.5 without botherig to take ay data? Exercise (Sigal plus Backgroud). Suppose a latet variable Y Poi(λ), λ > 0; Y caot be directly measured. Istead we measure X = Y + B, where B Poi(2), ad Y, B are idepedet. I a physical experimet, your observed value of X tured out to be zero. What would be your estimate of λ? Exercise (Bayes Risks). Suppose X 1,,X Poi(λ), ad suppose λ has a stadard expoetial prior distributio. Take a squared error loss fuctio. (a) Calculate the Bayes risk of each of the followig estimators of λ: The costat estimate 1; X; X i

16 (b) Which of the three estimators has the smallest Bayes risk? The largest Bayes risk? Exercise (Bayes Risks). Suppose X N(µ,1) ad Y U[µ 1, µ + 1], ad that X, Y are idepedet. Suppose µ has a stadard ormal prior distributio. Take a squared error loss fuctio. (a) Calculate the Bayes risk of each of the followig estimators of µ: X, Y, X + Y 2 (b) Which estimator has a smaller Bayes risk? Exercise (Bayes Risks). Suppose X 1,,X N(µ, σ 2 ), where σ 2 is cosidered kow. Suppose loss is squared error. Calculate the Bayes risk of the estimator +1 X for each of the followig prior distributios G: G = U[ 3, 3];N(0, 1); C(0, 1). Exercise (Uiform Distributios). Suppose X 1,,X U[0, θ], θ > 0. Take a squared error loss fuctio. (a) Fid the risk fuctio of each of the followig three estimators of θ: where X () is the sample maximum. 2 X; X () ; + 1 X (), (b) Calculate the Bayes risk of each of these estimators if θ has a geeral gamma prior distributio, θ G(α, λ). (c) Which estimator has the smallest Bayes risk? Does this deped o the values of α, λ? Exercise (Maximum Risk). Suppose X N(µ,1), ad it is kow that 1 µ 1. Suppose loss is squared error. (a) Calculate the risk fuctio of each of the followig estimators: X; X ; the costat estimator zero. 2 (b) Fid the maximum risk of each estimator. Commet. (c) Cosider estimators of the geeral form a+bx, < a <, b 0. For which choice of a, b is the maximum risk miimized? Exercise (Maximum Risk). Suppose X 1,,X 16 are sixtee observatios from a Beroulli distributio with parameter p, 0 < p < 1. Let T = 16 X i. (a) Calculate the risk fuctio of each of the followig estimators: T 16 ; T (b) Fid the maximum risk of each estimator. Commet.. 282

17 5.2 Refereces Berger, J. (2010). Statistical Decisio Theory ad Bayesia Aalysis, Spriger, New York. Bickel, P. ad Doksum, K. (2001). Mathematical Statistics, Basic Ideas ad Selected Topics, Vol.I, Pretice Hall, NJ Box, G.E.P. (1953). Noormality ad tests of variace, Biometrika, 40, Brow, L. (1994). Miimaxity, more or less, I Stat. Dec. Theory ad Rel. Topics, S.S. Gupta ad J. Berger Eds., Spriger, New York. Brow, L. (2000). A essay o statistical decisio theory, JASA, 95, DasGupta, A. (2008). Asymptotic Theory of Statistics ad Probability, Spriger, New York. Dawid, A.P. (1982). The well calibrated Bayesia, with discussio, JASA, 77, Hampel, F., Rochetti, E., Rousseeuw, P., ad Stahel, W. (1986). Robust Statistics: The Approach Based o Ifluece Fuctios, Wiley, New York. Huber, P. (1981). Robust Statistics, Wiley, New York. Johstoe, I. (2012). Fuctio Estimatio, Gaussia Sequece Models, ad Wavelet Shrikage, Cambridge Uiv. Press, Cambridge, Forthcomiig. Kadae, J. ad Wolfso, L. (1998). Experieces i elicitatio, The Statisticia, 47, 1, Lehma, E. ad Casella, G. Theory of Poit Estimatio, Spriger, New York. Portoy, S. ad He, X. (2000). A robust jourey to the ew milleium, JASA, 95, Seidefeld, T. (1985). Calibratio, coherece, ad scorig rules, Phil. Sci., 52, Youg, A. ad Smith, R. (2010). Essetials of Statistical Iferece, Cambridge Uiv. Press, Cambridge. Stigler, S. (2010). The chagig history of robustess, Amer. Statist., 64,

5. Best Unbiased Estimators

5. Best Unbiased Estimators Best Ubiased Estimators http://www.math.uah.edu/stat/poit/ubiased.xhtml 1 of 7 7/16/2009 6:13 AM Virtual Laboratories > 7. Poit Estimatio > 1 2 3 4 5 6 5. Best Ubiased Estimators Basic Theory Cosider agai