EE64 Digital Image Processing II: Purdue University VISE - October 9, 004 The EM Algorithm. Suffient Statistics and Exponential Distributions Let p(y θ) be a family of density functions parameterized by θ Ω, and let Y be a random object with a density function from this family. Definition: A statistic is any function T (Y ) of the data Y. Definition: We say that a statistic T (Y ) is a sufficient statistic for θ if there exist functions g(, ) and h( ) such that for all y IR and θ Ω. p(y θ) h(y) g(t (y), θ) () If T (Y ) is a sufficient statistic for θ where θ parameterizes the distribution for Y, then the ML estimator of θ must be a function of T (Y ). To see this, notice that ˆθ ML arg max p(y θ) arg max arg max arg max f(t (y)) log p(y θ) log h(y) + log g(t (y), θ)} log g(t (y), θ) for some function f( ). Example : Let Y n } be i.i.d. random variables with distribution (µ, ). Define the following statistic corresponding to the sample mean of the random variables. t y n By writing the density function for the sequence Y as p(y µ) } π (y n µ)
EE64 Digital Image Processing II: Purdue University VISE - October 9, 004 π π π π yn (y n µ) (y n y n µ + µ ) (t µ µ ) y n + t / } } (t / µ), where we can see that it has the form of equation (). Therefore, t is a sufficient statistic for the parameter µ. Computing the ML estimate yeilds the following. ˆµ ML arg max µ arg max µ log p(y µ) } (t / µ) arg min µ (t / µ) t Many commonly used distributions such as Gaussian, onential, Poisson, Bernoulli, and binomial have a structure which maes them particularly useful. These distributions are nown as onential families and have the following special property. Definition: A family of density functions p(y θ) for y R and θ Ω is said to be a -parameter onential family if there exist functions g(θ) IR, s(y), d(θ) and statistic T (y) IR such that p(y θ) < g(θ), T (y) > +d(θ) + s(y)} () for all y IR and θ Ω where <, > denotes the inner product. We refer to T (y) as the natural sufficient statistic or natural statistic for the onential distribution. Example : Let Y n } be i.i.d. random variables with distribution (µ, σ ). Define the following statistics corresponding to the sample mean
EE64 Digital Image Processing II: Purdue University VISE - October 9, 004 3 and variance of the random variables. p(y µ, σ ) t t y n yn Then we may write the density function for Y in the following form. } πσ ( πσ ) σ (y n µ) ( πσ ) σ σ (y n µ) (y n y n µ + µ ) ( πσ ) σ t + µ σ t } σ µ σ t + µ [ µ σ, σ Using the following definitions g(θ) T (y) σ t ] t t [ µ σ, σ t t d(θ) σ µ log(πσ ) s(y) 0 ] } σ µ log(πσ ) σ µ log(πσ ) we can see that p(y µ, σ ) has the form of equation () with suffient statistic T (y). With some calculations it may be easily shown that the ML estimates of µ and σ are given by ˆµ ML t ˆσ ML t ( t )
EE64 Digital Image Processing II: Purdue University VISE - October 9, 004 4. General Formulation of EM Update One reason that the EM algorithm is so useful is that for many practical situations the distributions are onential, and in this case the EM updates have a particularly simple form. Let Y is the observed or incomplete data and let X be the unobserved data, and assume that the joint density of (Y, X) is from and onential family with parameter vector θ. Then we now that p(y, x θ) < g(θ), T (y, x) > +d(θ) + s(y, x)} for some sufficient statistic T (y, x). Assuming the ML estimate of θ exists, then it is given by θ ML arg max < g(θ), T (y, x) > +d(θ)} (3) f(t (y, x)) (4) where f( ) is some function of the dimensional suffient statistic for the onential density. Recalling the form of the Q function, we have Q(θ, θ) E [log p(y, X θ ) Y y, θ] where Y is the observed data and X is the unnown data. Since our objective is to maximize Q with respect to θ, we only need to now the function Q within a constant that is not dependent on θ. Therefore, we have Q(θ, θ) E [log p(y, X θ ) Y y, θ] E [< g(θ ), T (y, X) > +d(θ ) + s(y, X) Y y, θ] < g(θ ), T > +d(θ ) + constant were T E [T (y, X) Y y, θ] is the conditional ectation of the sufficient statistic T (y, x). update of the EM algorithm is then given by the recursion A single θ arg max θ Ω Q(θ, θ) (5) arg max θ Ω < g(θ ), T > +d(θ ) f( T )
EE64 Digital Image Processing II: Purdue University VISE - October 9, 004 5 Intuitively, we see that the EM update has the same form as the computation of the ML estimate, but with the ected value of the statistic replacing the actual statistic. Example 3: Let X n } be i.i.d. random variables with P X n 0} π 0 and P X n } π π 0. Let Y n } be conditionally i.i.d random variables given X, and let the conditional distribution of Y n given X n be Gaussian (µ Xn, σx ) where µ 0, µ, σ 0, and σ are parameters of the distribution. Then the complete set of parameters for the density of (Y, X) are given by θ [µ 0, µ, σ 0, σ, π 0 ]. Define the statistics t, t, δ(x n ) y n δ(x n ) y nδ(x n ) where 0, } and δ( ) is a Kronier delta function. We now that if both Y and X are nown then the ML estimates are given by ˆµ t, (6) ˆσ t ( ), t, (7) ˆπ. (8) We can ress the density function for p(y x, θ) by starting with the ressions derived in example for each of the two classes corresponding to X n 0 and X n. p(y x, θ) 0 0 µ σ µ σ, σ t, t,, σ, µ σ µ log(πσ ), σ t, t, log(πσ )
EE64 Digital Image Processing II: Purdue University VISE - October 9, 004 6 0 0 µ σ, σ, µ σ µ σ, σ, µ σ log(πσ ) log(πσ ),, t, t, t, t, The distribution for X also has onential form with p(x θ) π 0 0 π 0 log π This yields to joint density for (Y, X) with the following form. p(y, x θ) p(y x, θ)p(x θ) 0 µ σ, σ, µ σ log(πσ ) + log π, Therefore, we can see that (Y, X) have an onential density function. Using the result of equation (5), we now that the EM update must have the form of equations (6), (7), and (8). Where the statistic T (Y, X) is replace with its conditional ectation. t, t, ˆµ t, (9) ˆσ t, t, (0) where t, ˆπ P X n Y y, θ} y n P X n Y y, θ} ()
EE64 Digital Image Processing II: Purdue University VISE - October 9, 004 7 t, y np X n Y y, θ}