Cogsci 118B. Virginia de Sa. Maximum Likelihood estimation, Bayesian Parameter estimation

Size: px

Start display at page:

Download "Cogsci 118B. Virginia de Sa. Maximum Likelihood estimation, Bayesian Parameter estimation"

Mabel Simmons
5 years ago
Views:

1 Cogsci 118B 1 Virginia de Sa Maximum Likelihood estimation, Bayesian Parameter estimation

2 $ Density Estimation 2 Consider a classification task. If we know the densities of each class, it is easy to pick a decision boundary. "!# (') %'& *,+ -/ :6 ;=<?>7@7A9B7CDA9E9F?GDHIFDH9G?JKJKLMFD@7N7O:E9A9E9@7N7GDH=>7PM@7Q7G?Q7E9H9E9A9<RO7CDN7JKE9A9<RS9T7N:FDA9E9@7N7JUJKB7@:VWA9B7C >7PM@7Q7G?Q7E9H9E9A9<XO7CDN7JKE9A9<X@7SZY[CDGDJKT7PME\N7]^G^>7GDP_A9E9FDT7H9GDP1S9CDGDA9T7PMCXàGDH9T:CUbdcDe fagdhri9j7gxk7ldi9i9gdmmhne9ope9h q ldi9gdc?r7mms/tvu9wvx9y"zn{m D}7{_ D~K D7 9~= \ 7 ƒ 9 9 D 7 97 D~ ~= :ˆŠ ƒ 7~K 7Œv 9 7 ƒ 9 1 XŽD :{M? D~1 p 9? 7 Š 7 D~ ŽD{M 9 7 ƒ ˆ9ˆ9 D{M?7ŽD D 7 97 D~K~ :ˆ:}7 :}7 7 9 D 9 9 7:~v :ˆ: \ 1 [ \ D}7 D~ 7ˆ 7~K 7 = D7~ 9 9 ˆ9 77ŽD 9 9 7:~v?{M 1: 7{M p D 9š 9 D D 7Œœ? ~' 9 7 ž D{M? ž 77 7 D{Ÿ D DŽa ŽD 7{M a ž 9~ œ {M 7 [ œ " 9Ža 7 D{M ž œ = 7 7?Œœ D 9 D{Ÿ D œª= D{M 9Œ D7 «1 a? 9 «= D 9 7{_ DŒ' ± D²9²9³?Ḿµ«Š 9 D Ķ¹9º7»D D²9¹\¼7µ3½ ¾Š 7ÀÁDÂMÃ9ÄDÅ7ÆUÇ ÈÊÉ7Ë7Ë7ÌÎÍ:Ï=Ð Ñ7Ò7Ó«ÔÖÕ9 9ØDÏ ÙÛÚDÑ7Ó7ÜKÝ Þ9Ó7ßDà But how do we learn the densities? One choice is to do some sort of histogram method. These methods are called non-parametric. Another choice is to assume that the density is of a certain form (e.g. Gaussian) and find the best fitting parameters. These methods are called parametric

3 How do we fit the parameters? 3 We will consider two different schools of thought Maximum Likelihood estimation: There is a fixed but unknown parameter vector. Our best estimate (the maximum likelihood estimate) of the unknown parameter vector is the one that has the highest probability of generating the data Bayesian estimation: Treat the parameter vector as a random variable. We have a prior distribution for the parameters, and then after looking at the data, compute a posterior density over the parameters. We do not pick a most likely parameter vector but a density over parameter vectors. The whole density is used to make classification and other inference decisions.

4 Review of Bayes Theorem 4 So far we have talked about Probability of a Class given the data P (ω j x) = p(x ω j)p (ω j ) p(x) P (ω j ) = prior probability of ω j P (x) = evidence P (ω j x) = posterior probability of ω j P (x ω j ) = likelihood of ω j with respect to x Here we talk about Probability of a parameter vector given the data P (θ D) = p(d θ)p (θ) p(d) p(θ) = prior probability of θ

5 p(d) = evidence 5 p(θ D) = posterior probability of θ p(d θ) = likelihood of θ with respect to x

6 Parametric Estimation 6 Assume we know the form of a probability density but not some or all of the parameters of the functional form. We estimate the parameter vector θ from samples of the data Maximum likelihood estimate of θ is ˆθ that maximizes p(data θ) Usually we assume p(d θ) = n k=1 p(x(k) θ) by assuming independence of the samples x (k)

7 Parametric Estimation 6 Assume we know the form of a probability density but not some or all of the parameters of the functional form. We estimate the parameter vector θ from samples of the data Maximum likelihood estimate of θ is ˆθ that maximizes p(data θ) Usually we assume p(d θ) = n k=1 p(x(k) θ) by assuming independence of the samples x (k) = p(x (1) θ)p(x (2) θ),...p(x (n) θ)

8 % P $ P C L Maximum Likelihood Estimation 7 ;/< =+>-?/@1A!B 3( /819!: &(' )+*-,/.10!2 DFE!GIH J#K! #" MNO QSR TVUXWZY\[^]!_^] `1a^bdc!e^fVgihkjif^aVlmanenoplqlrbistbihkjiuSc!hvjiw!x^w!xng\f^e^w!x^cylqw!xze^x^b\{^w! }bix^lmw!e^xn~six^enoxxve^h jilmlm ^ }bi{ c!e ^bƒ{^hvj opx!h/e^ jƒ Xji ^lmlmw!jix e^ ˆjƒf^jth/c!w! i ^u!jth-sšjih/w!jix^ tbi~ n ^ĉ ^x^ix^enoxx Œbijixn Ž e^ ^hde^ c!a^b w!x^ ^x^w!cyb x^ ^ Œ nbih\e^ ijix^{^w!{^jtc!b lmen ^hv ib {^wylmc!h/w! ^ ^c!w!e^xnlvjihvb lma^enopx w!x {^jtlma^bi{ u!w!x^bilm `1a^bŒ Œw!{^{^u!bŒ ^gi ^h/bœlma^enopl c!a^bœu!w!ibiu!w!a^ene^{š q œ žnÿ i -Œ! ^ ^ i y! ^ ^!ª^«š }«ii ^ F! p«ª^i ƒœ±t«t²/³š!t²/µi«œ ^ ^ Œ ^«i²ˆ ^!²vi! ^! nµœ ^ ^! ^! m¹!ª^!! yºi«i!!ª^ ^ n ƒ X ^ ^! ƒ ^«Œ± «i²/³œ ^i²/²/ #»1ª^«± iý ^«!ª^i Œi¼t! Œ!½i«i ¾!ª^«!!ºt«i!!ª^ ^ ^! p Œi²/ºi«t ÁÀ ÂnÃÅÄ!Æ ÇiÈ!ÉrÊ Ë}ÇiÌiÄ!Ë}Ä!ÍiÎiÉ¾Æ!Ï^Î È!Ê^ÐiÇiÑvÄ!Æ!Ï^ËÒÊ^Ó Æ!Ï^Î È!Ä!ÔiÎtÈ!Ä!Ï^Ê^Ê^ÕŠÖ Æ!Ï^ÇiÆ Ä!ÉmØ Æ!Ï^Î È!Ê^ÐiÙvÈ!Ä!ÔiÎiÈ!Ä!Ï^Ê^ÊnÕÛÚrÜ!ÝnÞrß àmá^âãpä åiæ æ!á^ç è^â^æ!æyâ^éœê ëpâ^æyçìæyá^åiæ çiítçtä æ!á^â^î^ïiá}æ!á^çiðpñ!â^â^òpàró!éœó!ñ!åiô ß^æ!á^çpñ!ó!òiçiñyó!á^â^â^õpöÅ ø ù únûýü!þˆþmÿ iþ!ü!#"%$ &(' ) *,+-(-/ , ;:=<?>@ ABC DAE F ADGIHJ(H6K@ L MBNF L4F K O P QR SUTVWSTVR X X YUZ%[ \^] _ `bacbd e fhgui je k gk almafnuo pd cqafmnsrt d ufae dvgd ovafcbgjpguw gcbd evcqamxd ay z ugd up{} ~je } auƒw gjo {} t o gˆ} ;pfpj Š{}Œ gj fˆ}gd o g Žao {} fme j Î š q qœ ž œÿ } ª «±}²(³µ º¹»¼² ½ ¾ ÀqÁ Â ÃÄ

9 Maximum Likelihood Estimation 8 log-likelihood function l(θ) ln p(d θ) ˆθ = argmax θ l(θ) to find ˆθ solve for θ in θ l = 0 θ l = n k=1 θ ln p(x (k) θ) The log-likelihood is the logarithm of the probability density function but it is interpreted as a function of θ for given data whereas the probility density function is thought of as a function over the sample space for given parameter θ [Hofmann class notes]

10 Maximum Likelihood Estimation: Example 9 Bernoulli random variable two possible outcomes 0,1 P (x = 1) = ρ P (x = 0) = 1 ρ

11 Maximum Likelihood Estimation: Example 9 Bernoulli random variable two possible outcomes 0,1 P (x = 1) = ρ P (x = 0) = 1 ρ p(x = s) = ρ s (1 ρ) (1 s)

12 Maximum Likelihood Estimation: Example 9 Bernoulli random variable two possible outcomes 0,1 P (x = 1) = ρ P (x = 0) = 1 ρ p(x = s) = ρ s (1 ρ) (1 s) l(ρ) = n k=1 s(k) ln ρ + (1 s (k) )ln(1 ρ)

13 Maximum Likelihood Estimation: Example 9 Bernoulli random variable two possible outcomes 0,1 P (x = 1) = ρ P (x = 0) = 1 ρ p(x = s) = ρ s (1 ρ) (1 s) l(ρ) = n k=1 s(k) ln ρ + (1 s (k) )ln(1 ρ) l ρ = n k=1 s(k) /ρ (1 s (k) )/(1 ρ)

14 Maximum Likelihood Estimation: Example 9 Bernoulli random variable two possible outcomes 0,1 P (x = 1) = ρ P (x = 0) = 1 ρ p(x = s) = ρ s (1 ρ) (1 s) l(ρ) = n k=1 s(k) ln ρ + (1 s (k) )ln(1 ρ) l ρ = n k=1 s(k) /ρ (1 s (k) )/(1 ρ) setting ρ l = 0

15 Maximum Likelihood Estimation: Example 9 Bernoulli random variable two possible outcomes 0,1 P (x = 1) = ρ P (x = 0) = 1 ρ p(x = s) = ρ s (1 ρ) (1 s) l(ρ) = n k=1 s(k) ln ρ + (1 s (k) )ln(1 ρ) l ρ = n k=1 s(k) /ρ (1 s (k) )/(1 ρ) setting ρ l = 0 gives ˆρ = n k=1 s(k) /n

16 Maximum Likelihood Estimation: Example 10 Gaussian with unknown µ and σ 2 l(µ, σ) = n k=1 1 2 ln 2π ln σ 1 2 σ 2 (x (k) µ) 2

17 Maximum Likelihood Estimation: Example 10 Gaussian with unknown µ and σ 2 l(µ, σ) = n k=1 1 2 ln 2π ln σ 1 2 σ 2 (x (k) µ) 2 l µ = n k=1 1 σ 2 (x (k) µ)

18 Maximum Likelihood Estimation: Example 10 Gaussian with unknown µ and σ 2 l(µ, σ) = n k=1 1 2 ln 2π ln σ 1 2 σ 2 (x (k) µ) 2 l µ = n k=1 1 (x (k) µ) σ 2 l σ = n k=1 1 σ + (x(k) µ) 2 σ 3

19 Maximum Likelihood Estimation: Example 10 Gaussian with unknown µ and σ 2 l(µ, σ) = n k=1 1 2 ln 2π ln σ 1 2 σ 2 (x (k) µ) 2 l µ = n k=1 1 (x (k) µ) σ 2 l σ = n k=1 1 σ + (x(k) µ) 2 σ 3 ˆµ = P nk=1 x (k) n

20 Maximum Likelihood Estimation: Example 10 Gaussian with unknown µ and σ 2 l(µ, σ) = n k=1 1 2 ln 2π ln σ 1 2 σ 2 (x (k) µ) 2 l µ = n k=1 1 (x (k) µ) σ 2 l σ = n k=1 1 σ + (x(k) µ) 2 σ 3 ˆµ = P nk=1 x (k) n ˆσ 2 = 1 n n k=1 (x(k) ˆµ) 2

21 Bayesian Parameter Estimation 11 Assume the form of the density p(x θ) is known but θ is not known exactly Initial knowledge about θ is presented as a prior density p(θ) We have n samples drawn independently from the unknown real probability density p(x) p(x D) = p(x, θ D)dθ

22 Bayesian Parameter Estimation 11 Assume the form of the density p(x θ) is known but θ is not known exactly Initial knowledge about θ is presented as a prior density p(θ) We have n samples drawn independently from the unknown real probability density p(x) p(x D) = = p(x, θ D)dθ p(x θ)p(θ D)dθ p(θ D) = p(d θ)p(θ) R p(d θ)p(θ) p(d θ) = k i=1 p(x(k) θ)

23 Context/Aside 12 What will we do with p(x D, (ω i ))? P (ω i x, D) = = p(x ω i, D)P (ω i D) c j=1 p(x ω j, D)P (ω j D) p(x ω i, D)P (ω i ) c j=1 p(x ω j, D)P (ω j )

24 Example of Bayesian learning 13 MAP estimate of parameter Full Bayesian inference

25 Bayesian Learning 14 A conjugate prior is a prior density that is in the same form as the posterior density.

26 Bayesian Learning Example Gaussian density 15 Assume p(x µ) N(µ, σ 2 ) where σ 2 is known p(µ) N(µ 0, σ 2 0) p(µ D) = p(d µ)p(µ) P (D)

27 Bayesian Learning Example Gaussian density 15 Assume p(x µ) N(µ, σ 2 ) where σ 2 is known p(µ) N(µ 0, σ 2 0) p(µ D) = p(d µ)p(µ) P (D) = αp(d µ)p(µ)

28 Bayesian Learning Example Gaussian density 15 Assume p(x µ) N(µ, σ 2 ) where σ 2 is known p(µ) N(µ 0, σ 2 0) p(µ D) = p(d µ)p(µ) P (D) = αp(d µ)p(µ) n = α p(x (k) µ)p(µ) k=1

29 Bayesian Learning Example Gaussian density 15 Assume p(x µ) N(µ, σ 2 ) where σ 2 is known p(µ) N(µ 0, σ 2 0) p(µ D) = p(d µ)p(µ) P (D) = αp(d µ)p(µ) n = α p(x (k) µ)p(µ) = α k=1 n k=1 1 e.5((x(k) µ) σ ) 2 1 e.5((µ µ 0 ) σ ) 2 0 (2π)σ (2π)σ0

30 Bayesian Learning Example Gaussian density 15 Assume p(x µ) N(µ, σ 2 ) where σ 2 is known p(µ) N(µ 0, σ 2 0) p(µ D) = p(d µ)p(µ) P (D) = αp(d µ)p(µ) n = α p(x (k) µ)p(µ) = α k=1 n k=1 1 e.5((x(k) µ) σ ) 2 1 e.5((µ µ 0 ) σ ) 2 0 (2π)σ (2π)σ0 = α e.5 P n k=1 ( µ x(k) σ ) 2 +( µ µ 0 σ 0 ) 2

31 Bayesian Learning Example Gaussian density 15 Assume p(x µ) N(µ, σ 2 ) where σ 2 is known p(µ) N(µ 0, σ 2 0) p(µ D) = p(d µ)p(µ) P (D) = αp(d µ)p(µ) n = α p(x (k) µ)p(µ) = α k=1 n k=1 1 e.5((x(k) µ) σ ) 2 1 e.5((µ µ 0 ) σ ) 2 0 (2π)σ (2π)σ0 = α e.5 P n k=1 ( µ x(k) σ ) 2 +( µ µ 0 σ 0 ) 2 = α e.5[( n σ 2+ 1 σ 0 2 )µ 2 2( 1 P nk=1 σ 2 x (k) + µ 0)µ] σ 2 0

32 n 1 p(µ D) = (2πσn ) e.5(µ µ σn )2 16 where 1 σ 2 n = n σ σ 2 0 and µ n = n µ ˆ σn 2 σ 2 n + µ 0 σ0 2 This gives µ n = nσ2 0 µ ˆ nσ0 2 n + σ2 µ +σ2 nσ σ2 and σ 2 n = σ2 0 σ2 nσ 2 0 +σ2

33 5 4 3 ' ( < 9 Bayesian Learning 17! = >?@ACB D EGF H I I I H JGK L :; 9: " # $ % & 1 2 / , ) * MON PRQTSVUWYXZYX []\_^a`cbed\cfhg` \cijfydfykmlyn opy`mqr`c\cfhln fylyijqs\cg tydbeoiudvywyodlyfybxdfhlyf`\cfytyoztlytydqr`cfybedlyfybe{ }p`~ylybo`cijdlyi tydbeoijdvwyodlyfs`cbeodqr\co`cb\cij`tg\cv`cg`ctsv ^TopY`TfYwYqrv`ci lynoiu\cdfdfyktbe\cqs~yg`cb wbe`ctsdfsopy`t` beodqr\codlyfy{ _ijlqrƒ ]d apy\cijts h{ ˆwYtY\c Š}`co`ciO c{œˆ\ ijo \cfyth ˆ\_ cdtsžt{ colij c _ c c j s cšeše œy c žy VŸ Y c j c Y ª «Y Y Y h ±³²µÝ s º¹»¼c±T½º¾cÝ Y eà Á YÂcÃ

34 Computing Class-conditional density 18 p(x D) = = p(x µ)p(µ D)dµ 1 (2π)σ e.5((x µ) σ ) 2 1 (2π)σn e.5((x µ n) σn )2 dµ = 1 2πσσ n e.5 (x µn) 2 σ 2 +σ 2 n f(σ, σ n )

35 Bayesian Method vs Maximum Likelihood 19 Maximum likelihood: Compute ˆθ = argmax θ p(data θ) then p(x data) = p(x ˆθ) Bayesian estimation: Compute p(θ data) then p(x data) = p(θ data)p(x θ)dθ Maximum Likelihood approach is usually simpler, less computationally expensive and gives a more easily understood solution Maximum Likelihood returns a conditional density of the assumed parametric form Bayesian methods use more of the available information (and are likely to be more useful when training data is very sparse)

36 Sources of Classification Error 20 Bayes Error Due to overlap between the classes in the input space Model Error Due to not picking the correct class of models Estimation Error Due to insufficient training data

39 23 IQ tests If you take an IQ test you ll get a score that, on average (over many tests) will be your IQ But because of noise on any one test the score will often be a few points lower or higher than your true IQ. SCORE IQ ~ N(IQ,10 2 ) Copyright 2001, Andrew W. Moore Gaussians: Slide 52

40 24 Assume You drag your kid off to get tested She gets a score of 130 Yippee you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter. P(X< 130 µ= 100,σ 2 = 15 2 ) = P(X< 2 µ= 0,σ 2 = 1) = erf(2) = Copyright 2001, Andrew W. Moore Gaussians: Slide 53

41 Assume You drag your kid off to get tested She gets a score of 130 You are thinking: Well sure the test isn t accurate, so Yippee you screech she might and have an start IQ of 120 deciding or she how might have an 1Q of 140, but the to casually refer most to her likely IQ membership given the evidence of the top 2% of IQs in score= your 130 Christmas is, of course, newsletter P(X< 130 µ= 100,σ 2 = 15 2 ) = P(X< 2 µ= 0,σ 2 = 1) = erf(2) = Can we trust this reasoning? Copyright 2001, Andrew W. Moore Gaussians: Slide 54

42 26 Maximum Likelihood IQ IQ~ N(100,15 2 ) S IQ ~ N(IQ, 10 2 ) S= 130 IQ mle The MLE is the value of the hidden parameter that makes the observed data most likely In this case = arg max iq p( s = 130 iq) IQ mle =130 Copyright 2001, Andrew W. Moore Gaussians: Slide 55

43 27 IQ~ N(100,15 2 ) S IQ ~ N(IQ, 10 2 ) S= 130 BUT. IQ mle The MLE is the value of the hidden parameter that makes the observed data most likely In this case = arg IQ max iq mle p( s =130 = 130 iq) This is not the same as The most likely value of the parameter given the observed data Copyright 2001, Andrew W. Moore Gaussians: Slide 56

45 29 IQ~ N(100,15 2 ) S IQ ~ N(IQ, 10 2 ) S= 130 Question: What is IQ (S= 130)? Which tool or tools? U V U V U V V + X + Y Chain Rule U V Copyright 2001, Andrew W. Moore Gaussians: Slide 58 X X Y Conditionalize Marginalize Matrix A Multiply U AX U V

47 Working IQ~ N(100,15 2 ) S IQ ~ N(IQ, 10 2 ) S= 130 IF U V µ ~ N µ u v S, S uu T uv S S uv vv T 1 µ u v = µ u + S uvs vv ( V µ v ( AV, S u v ) and V ~ N( µ v S vv ) IF U V ~ N, THEN ) 31 Question: What is IQ (S= 130)? THEN U ~ V N ( µ, S ), with S T AS + = vva S T ( AS vv ) u v AS S vv vv Copyright 2001, Andrew W. Moore Gaussians: Slide 60

48 32 Your pride and joy s posterior IQ If you did the working, you now have p(iq S= 130) If you have to give the most likely IQ given the score you should give IQ map = arg max p( iq s = iq 130) where MAP means Maximum A-posteriori Copyright 2001, Andrew W. Moore Gaussians: Slide 61

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ. 9 Point estimation 9.1 Rationale behind point estimation When sampling from a population described by a pdf f(x θ) or probability function P [X = x θ] knowledge of θ gives knowledge of the entire population.