The EM algorithm for HMMs Michael Collins February 22, 2012
Maximum-Likelihood Estimation for Fully Observed Data (Recap from earlier) We have fully observed data, x i,1... x i,m, s i,1... s i,m for i = 1... n. The likelihood function is n L(θ) = log p(x i,1... x i,m, s i,1... s i,m ; θ) i=1 Maximum-likelihood estimates of transition probabilities are n t(s i=1 s) = count(i, s s ) n i=1 s count(i, s s ) Maximum-likelihood estimates of emission probabilities are e(x s) = n i=1 count(i, s x) n i=1 x count(i, s x)
Maximum-Likelihood Estimation for Partially Observed Data We have partially observed data, x i,1... x i,m for i = 1... n. Note we do not have state sequences. The likelihood function is n L(θ) = log p(x i,1... x i,m, s 1... s m ; θ) s 1...s m i=1 We can maximize this function using EM... (the algorithm will converge to a local maximum of the likelihood function)
An Example Suppose we have an HMM with two states (k = 2) and 4 possible emissions (a, b, x, y) and our (partially observed) training data consists of the following counts of 4 different sequences (no other sequences are seen): a x (100 times) a y (100 times) b x (100 times) b y (100 times) What are the maximum-likelihood estimates for the HMM?
Forward and Backward Probabilities Define α[j, s] to be the sum of probabilities of all paths ending in state s at position j in the sequence, for j = 1... m and s {1... k}. More formally: α[j, s] = s 1...s j 1 [ t(s 1 )e(x 1 s 1 ) ( j 1 k=2 t(s k s k 1 )e(x k s k ) ) t(s s j 1 )e(x j s) Define β[j, s] for s {1... k} and j {1... (m 1)} to be the sum of probabilities of all paths starting with state s at position j and going to the end of the sequence. More formally: β[j, s] = s j+1...s m t(s j+1 s)e(x j+1 s j+1 ) m k=j+2 t(s k s k 1 )e(x k s k ) ]
Recursive Definitions of the Forward Probabilities Initialization: for s = 1... k α[1, s] = t(s)e(x 1 s) For j = 2... m: α[j, s] = (α[j 1, s ] t(s s ) e(x j s)) s {1...k}
Recursive Definitions of the Backward Probabilities Initialization: for s = 1... k β[m, s] = 1 For j = m 1... 1: β[j, s] = (β[j + 1, s ] t(s s) e(x j+1 s )) s {1...k}
The Forward-Backward Algorithm Given these definitions: p(x 1... x m, S j = s; θ) = s 1...s m:s j =s p(x 1... x m, s 1... s m ; θ) = α[j, s] β[j, s] Note: we ll assume the special definition that β[m, s] = 1 for all s
The Forward-Backward Algorithm Given these definitions: p(x 1... x m, S j = s, S j+1 = s ; θ) = s 1...s m:s j =s,s j+1 =s p(x 1... x m, s 1... s m ; θ) = α[j, s] t(s s) e(x j+1 s ) β[j + 1, s ] Note: we ll assume the special definition that β[m, s] = 1 for all s
Things we can Compute Using Forward-Backward Probabilities The probability of any sequence: p(x 1... x m ; θ) = = s s 1...s m p(x 1... x m, s 1... s m ; θ) α[m, s] The probability of any state transition: p(x 1... x m, S j = s, S j+1 = s ; θ) = p(x 1... x m, s 1... s m ; θ) s 1...s m:s j =s,s j+1 =s = α[j, s] t(s s) e(x j+1 s ) β[j + 1, s ]
Things we can Compute Using Forward-Backward Probabilities (continued) The conditional probability of any state transition: p(s j = s, S j+1 = s x 1... x m ; θ) = α[j, s] t(s s) e(x j+1 s ) β[j + 1, s ] α[m, s] s The conditional probability of any state at any position: p(s j = s x 1... x m ; θ) = α[j, s] β[j, s] α[m, s] s
Things we can Compute Using Forward-Backward Probabilities (continued) Define count(i, s s ; θ) to be the expected number of times the transition s s is seen in the training example x i,1, x i,2,..., x i,m, for parameters θ. Then count(i, s s ; θ) = m 1 j=1 p(s j = s, S j+1 = s x i,1... x i,m ; θ) (We can compute p(s j = s, S j+1 = s x i,1... x i,m ; θ) using the forward-backward probabilities, see previous slide)
Things we can Compute Using Forward-Backward Probabilities (continued) For completeness, a formal definition of count(i, s s ; θ): count(i, s s ; θ) = s 1...s m p(s 1... s m x i,1... x i,m ; θ)count(s s, s 1... s m ) where count(s s, s 1... s m ) is the number of times the transition s s is seen in the sequence s 1... s m
Things we can Compute Using Forward-Backward Probabilities (continued) Define count(i, s z; θ) to be the expected number of times the state s is paired with the emission z in the training sequence x i,1, x i,2,..., x i,m, for parameters θ. Then count(i, s z; θ) = m p(s j = s x i,1... x i,m ; θ)[[x i,j = z]] j=1 (We can compute p(s j = s x i,1... x i,m ; θ) using the forward-backward probabilities, see previous slides)
The EM Algorithm for HMMs Initialization: set initial parameters θ 0 to some value For t = 1... T : Use the forward-backward algorithm to compute all expected counts of the form count(i, s s ; θ t 1 ) or count(i, s z; θ t 1 ) Update the parameters based on the expected counts: n t t (s i=1 s) = count(i, s s ; θ t 1 ) n i=1 s count(i, s s ; θ t 1 ) n e t i=1 (x s) = count(i, s x; θt 1 ) n i=1 x count(i, s x; θt 1 )
The Initial State Probabilities For simplicity I ve omitted the estimates for the initial state parameters t(s), but these are simple to derive in a similar way to the transition and the emission parameters For completeness, the expected counts are: count(i, s; θ t 1 ) = α[1, s] β[1, s] α[m, s] s (the expected number of times state s is seen as the initial state) The parameter updates are then t t (s) = n i=1 count(i, s; θt 1 ) n