Parametric Density Estimation: Maximum Likelihood Estimation

Parametric Desity stimatio: Maimum Likelihood stimatio C6

Today Itroductio to desity estimatio Maimum Likelihood stimatio

Itroducto Bayesia Decisio Theory i previous lectures tells us how to desig a optimal classifier if we kew: P(c i ) (priors) P( c i ) (class-coditioal desities) Ufortuately, we rarely have this complete iformatio!

Probability desity methods Parametric methods assume we kow the shape of the distributio, but ot the parameters. Two types of parameter estimatio: Maimum Likelihood stimatio Bayesia stimatio No parametric methods the form of the desity is etirely determied by the data without ay model.

Idepedece Across Classes We have traiig data for each class salmo sea bass salmo salmo sea bass sea bass Whe estimatig parameters for oe class, will oly use the data collected for that class reasoable assumptio that data from class c i gives o iformatio about distributio of class c j estimate parameters for distributio of salmo from estimate parameters for distributio of bass from

Idepedece Across Classes For each class c i we have a proposed desity p i ( c i ) with ukow parameters θ i which we eed to estimate Sice we assumed idepedece of data across the classes, estimatio is a idetical procedure for all classes To simplify otatio, we drop sub-idees ad say that we eed to estimate parameters θ for desity p() the fact that we eed to do so for each class o the traiig data that came from that class is implied

Maimum Likelihood Parameter stimatio Parameters θ are ukow but fied (i.e. ot radom variables). Give the traiig data, choose the parameter value θ that makes the data most probable (i.e., maimizes the probability of obtaiig the sample that has actually bee observed)

Maimum Likelihood Parameter stimatio We have desity p() which is completely specified by parameters θ [θ,, θ k ] If p() is N(, σ ) the θ [, σ ] To highlight that p() depeds o parameters θ we will write p( θ) Note overloaded otatio, p( θ) is ot a coditioal desity Let D{,,, } be the idepedet traiig samples i our data If p() is N(, σ ) the,,, are iid samples from N(, σ )

Maimum Likelihood Parameter stimatio Cosider the followig fuctio, which is called likelihood of θ with respect to the set of samples D k p( D θ ) p( θ ) k k F( θ ) Maimum likelihood estimate (abbreviated ML) of θ is the value of θ that maimizes the likelihood fuctio p(d θ) ˆ θ arg ma θ ( ) p( D θ )

ML Parameter stimatio vs. ML Classifier Recall ML classifier fied data decide class c i which maimizes p( c i ) Compare with ML parameter estimatio fied data choose θ that maimizes p(d θ) ML classifier ad ML parameter estimatio use the same priciples applied to differet problems

Maimum Likelihood stimatio (ML) Istead of maimizig p(d θ), it is usually easier to maimize l(p(d θ)) Sice log is mootoic ˆ θ arg ma θ arg ma θ ( ) p(d θ ) ( ) l p(d θ ) p(d θ) l(p(d θ)) To simplify otatio, l(p(d θ))l(θ) ˆ θ arg ma L k θ θ k θ k k ( θ ) arg ma l p( θ ) arg ma l p( θ ) k

ML: Maimizatio Methods Maimizig L(θ) ca be solved usig stadard methods from Calculus Let θ (θ, θ,, θ p ) t ad let θ be the gradiet operator θ θ, θ,..., Set of ecessary coditios for a optimum is: θ l 0 Also have to check that θ that satisfies the above coditio is maimum, ot miimum or saddle poit. Also check the boudary of rage of θ θ p t

ML ample: Gaussia with ukow Fortuately for us, most of the ML estimates of ay desities we would care about have bee computed Let s go through a eample ayway Let p( ) be N(,σ ) that is σ is kow, but is ukow ad eeds to be estimated, so θ ˆ arg ma L( ) arg ma l p( k ) k ( ) k arg ma l ep k πσ σ arg ma k l πσ ( ) k σ

ML ample: Gaussia with ukow arg ma ( L( )) arg ma k l πσ ( ) k σ d d σ ( L( )) ( ) 0 k k ˆ k k 0 k k Thus the ML estimate of the mea is just the average value of the traiig data, very ituitive! average of the traiig data would be our guess for the mea eve if we did t kow about ML estimates

ML for Gaussia with ukow, σ Similarly it ca be show that if p(,σ ) is N(, σ ), that is both mea ad variace are ukow, the agai very ituitive result ˆ σˆ ( ) k ˆ k k Similarly it ca be show that if p(,σ) is N(, Σ), that is is a multivariate Gaussia with both mea ad covariace matri ukow, the ˆ k ˆ ( )( ) Σ k ˆ k ˆ k k k t

How to Measure Performace of ML? How good is a ML estimate? or actually ay other estimate of a parameter? θ ˆ θ The atural measure of error would be But θ ˆ θ is radom, we caot compute it before we carry out eperimets We wat to say somethig meaigful about our estimate as a fuctio of θ A way to solve this difficulty is to average the error, i.e. compute the mea absolute error θˆ (,,..., ) [ θ ˆ θ ] θ ˆ θ p dd... d

How to Measure Performace of ML?s It is usually much easier to compute a almost equivalet measure of performace, the mea squared error: ( θ θˆ ) Do a little algebra, ad use Var(X)(X )-((X)) ( θ ˆ θ ) Var( ˆ θ ) + ( ( ˆ θ ) θ ) variace estimator should have low variace bias epectatio should be close to the true θ

How to Measure Performace of ML? ( θ ˆ θ ) Var( ˆ θ ) + ( ( ˆ θ ) θ ) variace bias ideal case bad case bad case p( θˆ ) p( θˆ ) p( θˆ ) ( ˆ θ ) θ θ ( θˆ ) ( ˆ θ ) θ o bias low variace large bias low variace o bias high variace

Let s compute the bias for ML estimate of the mea Bias ad Variace for ML of the Mea How about variace of ML estimate of the mea? [ ] k k ˆ Thus this estimate is ubiased! ( ) [ ] [ ] i j j i i j j i i i i i ) )( ( ) )( ( ) ( ˆ σ σ [ ] k k k Thus variace is very small for a large umber of samples (the more samples, the smaller is variace) Thus the ML of the mea is a very good estimator

Bias ad Variace for ML of the Mea Suppose someoe claims they have a ew great estimator for the mea, just take the first sample! ˆ Thus this estimator is ubiased: However its variace is: [( ) ] ˆ ( ) [ ] σ ( ) ( ) ˆ p( θˆ ) Thus variace ca be very large ad does ot improve as we icrease the umber of samples ( ˆ θ ) θ o bias high variace

ML Bias for Mea ad Variace How about ML estimate for the variace? k k [ ] ˆ σ ( ˆ ) σ σ Thus this estimate is biased! This is because we used ˆ istead of true Bias 0 as ifiity, asymptotically ubiased Ubiased estimate σˆ k ( k ˆ ) Variace of ML of variace ca be show to go to 0 as goes to ifiity

ML for Uiform distributio U[0,θ] X is U[0,θ ] if its desity is /θ iside [0,θ] ad 0 otherwise (uiform distributio o [0,θ] ) p ( θ ) F ( θ ) θ 3 θ 3 θ The likelihood is F ( θ ) k θ 0 if θ ma{,..., } p( ) k θ k if θ < ma{,..., } Thus k ˆ θ arg ma θ k p ( k θ ) ma{,..., This is ot very pleasig sice for sure θ should be larger tha ay observed! }