The IBM Translation Models. Michael Collins, Columbia University

The IBM Translaton Models Mchael Collns, Columba Unversty

Recap: The Nosy Channel Model Goal: translaton system from French to Englsh Have a model p(e f) whch estmates condtonal probablty of any Englsh sentence e gven the French sentence f. Use the tranng corpus to set the parameters. A Nosy Channel Model has two components: p(e) p(f e) the language model the translaton model Gvng: and p(e f) = p(e, f) p(f) = p(e)p(f e) p(e)p(f e) e argmax e p(e f) = argmax e p(e)p(f e)

Roadmap for the Next Few Lectures IBM Models 1 and 2 Phrase-based models

Overvew IBM Model 1 IBM Model 2 EM Tranng of Models 1 and 2

IBM Model 1: Algnments How do we model p(f e)? Englsh sentence e has l words e 1... e l, French sentence f has m words f 1... f m. An algnment a dentfes whch Englsh word each French word orgnated from Formally, an algnment a s {a 1,... a m }, where each a {0... l}. There are (l + 1) m possble algnments.

IBM Model 1: Algnments e.g., l = 6, m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton One algnment s {2, 3, 4, 5, 6, 6, 6} Another (bad!) algnment s {1, 1, 1, 1, 1, 1, 1}

Algnments n the IBM Models We ll defne models for p(a e, m) and p(f a, e, m), gvng p(f, a e, m) = p(a e, m)p(f a, e, m) Also, p(f e, m) = a A p(a e, m)p(f a, e, m) where A s the set of all possble algnments

A By-Product: Most Lkely Algnments Once we have a model p(f, a e, m) = p(a e)p(f a, e, m) we can also calculate for any algnment a p(a f, e, m) = p(f, a e, m) p(f, a e, m) a A For a gven f, e par, we can also compute the most lkely algnment, a = arg max p(a f, e, m) a Nowadays, the orgnal IBM models are rarely (f ever) used for translaton, but they are used for recoverng algnments

An Example Algnment French: le consel a rendu son avs, et nous devons à présent adopter un nouvel avs sur la base de la premère poston. Englsh: the councl has stated ts poston, and now, on the bass of the frst poston, we agan have to gve our opnon. Algnment: the/le councl/consel has/à stated/rendu ts/son poston/avs,/, and/et now/présent,/null on/sur the/le bass/base of/de the/la frst/premère poston/poston,/null we/nous agan/null have/devons to/a gve/adopter our/nouvel opnon/avs./.

IBM Model 1: Algnments In IBM model 1 all allgnments a are equally lkely: p(a e, m) = 1 (l + 1) m Ths s a maor smplfyng assumpton, but t gets thngs started...

IBM Model 1: Translaton Probabltes Next step: come up wth an estmate for p(f a, e, m) In model 1, ths s: m p(f a, e, m) = t(f e a ) =1

e.g., l = 6, m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton a = {2, 3, 4, 5, 6, 6, 6} p(f a, e) = t(le the) t(programme program) t(a has) t(ete been) t(ms mplemented) t(en mplemented) t(applcaton mplemented)

IBM Model 1: The Generatve Process To generate a French strng f from an Englsh strng e: Step 1: Pck an algnment a wth probablty 1 (l+1) m Step 2: Pck the French words wth probablty p(f a, e, m) = m t(f e a ) =1 The fnal result: p(f, a e, m) = p(a e, m) p(f a, e, m) = 1 (l + 1) m m t(f e a ) =1

An Example Lexcal Entry Englsh French Probablty poston poston 0.756715 poston stuaton 0.0547918 poston mesure 0.0281663 poston vue 0.0169303 poston pont 0.0124795 poston atttude 0.0108907... de la stuaton au nveau des négocatons de l omp...... of the current poston n the wpo negotatons... nous ne sommes pas en mesure de décder,... we are not n a poston to decde,...... le pont de vue de la commsson face à ce problème complexe.... the commsson s poston on ths complex problem.

Overvew IBM Model 1 IBM Model 2 EM Tranng of Models 1 and 2

IBM Model 2 Only dfference: we now ntroduce algnment or dstorton parameters q(, l, m) = Probablty that th French word s connected Defne to th Englsh word, gven sentence lengths of e and f are l and m respectvely p(a e, m) = where a = {a 1,... a m } Gves p(f, a e, m) = m q(a, l, m) =1 m q(a, l, m)t(f e a ) =1

An Example l = 6 m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton a = {2, 3, 4, 5, 6, 6, 6} p(a e, 7) = q(2 1, 6, 7) q(3 2, 6, 7) q(4 3, 6, 7) q(5 4, 6, 7) q(6 5, 6, 7) q(6 6, 6, 7) q(6 7, 6, 7)

An Example l = 6 m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton a = {2, 3, 4, 5, 6, 6, 6} p(f a, e, 7) = t(le the) t(programme program) t(a has) t(ete been) t(ms mplemented) t(en mplemented) t(applcaton mplemented)

IBM Model 2: The Generatve Process To generate a French strng f from an Englsh strng e: Step 1: Pck an algnment a = {a 1, a 2... a m } wth probablty m q(a, l, m) =1 Step 3: Pck the French words wth probablty m p(f a, e, m) = t(f e a ) The fnal result: =1 p(f, a e, m) = p(a e, m)p(f a, e, m) = m q(a, l, m)t(f e a ) =1

Recoverng Algnments If we have parameters q and t, we can easly recover the most lkely algnment for any sentence par Gven a sentence par e 1, e 2,..., e l, f 1, f 2,..., f m, defne for = 1... m a = arg max a {0...l} q(a, l, m) t(f e a ) e = And the program has been mplemented f = Le programme a ete ms en applcaton

Overvew IBM Model 1 IBM Model 2 EM Tranng of Models 1 and 2

The Parameter Estmaton Problem Input to the parameter estmaton algorthm: (e (k), f (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence Output: parameters t(f e) and q(, l, m) A key challenge: we do not have algnments on our tranng examples, e.g., e (100) = And the program has been mplemented f (100) = Le programme a ete ms en applcaton

Parameter Estmaton f the Algnments are Observed Frst: case where algnments are observed n tranng data. E.g., e (100) = And the program has been mplemented f (100) = Le programme a ete ms en applcaton a (100) = 2, 3, 4, 5, 6, 6, 6 Tranng data s (e (k), f (k), a (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence, each a (k) s an algnment Maxmum-lkelhood parameter estmates n ths case are trval: t ML (f e) = Count(e, f) Count(e) q ML (, l, m) = Count(, l, m) Count(, l, m)

Input: A tranng corpus (f (k), e (k), a (k) ) for k = 1... n, where f (k) = f (k) 1... f m (k) k, e (k) = e (k) 1... e (k), a (k) = a (k) 1... a (k) m k. Algorthm: Set all counts c(...) = 0 For k = 1... n For = 1... mk, For = 0... l k, l k c(e (k), f (k) ) c(e (k), f (k) ) + δ(k,, ) c(e (k) ) c(e (k) ) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) where δ(k,, ) = 1 f a (k) =, 0 otherwse. Output: t ML (f e) = c(e,f) c(e), q ML(, l, m) = c(,l,m) c(,l,m)

Parameter Estmaton wth the EM Algorthm Tranng examples are (e (k), f (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence The algorthm s related to algorthm when algnments are observed, but two key dfferences: 1. The algorthm s teratve. We start wth some ntal (e.g., random) choce for the q and t parameters. At each teraton we compute some counts based on the data together wth our current parameter estmates. We then re-estmate our parameters wth these counts, and terate. 2. We use the followng defnton for δ(k,, ) at each teraton: δ(k,, ) = lk q(, l k, m k )t(f (k) e (k) ) =0 q(, l k, m k )t(f (k) e (k) )

Input: A tranng corpus (f (k), e (k) ) for k = 1... n, where f (k) = f (k) 1... f m (k) k, e (k) = e (k) 1... e (k) l k. Intalzaton: Intalze t(f e) and q(, l, m) parameters (e.g., to random values).

For s = 1... S Set all counts c(...) = 0 For k = 1... n For = 1... mk, For = 0... l k where c(e (k), f (k) ) c(e (k), f (k) ) + δ(k,, ) c(e (k) ) c(e (k) ) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) q(, l k, m k )t(f (k) e (k) ) δ(k,, ) = lk =0 q(, l k, m k )t(f (k) e (k) ) Recalculate the parameters: t(f e) = c(e, f) c(e) q(, l, m) = c(, l, m) c(, l, m)

The EM Algorthm for IBM Model 1 For s = 1... S Set all counts c(...) = 0 For k = 1... n For = 1... mk, For = 0... l k where δ(k,, ) = c(e (k), f (k) ) c(e (k), f (k) ) + δ(k,, ) c(e (k) ) c(e (k) ) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) 1 (1+l k ) lk =0 1 (1+l k ) (k) t(f e (k) ) (k) t(f e (k) ) = t(f (k) e (k) ) e (k) ) lk =0 t(f (k) Recalculate the parameters: t(f e) = c(e, f)/c(e)

δ(k,, ) = lk q(, l k, m k )t(f (k) e (k) ) =0 q(, l k, m k )t(f (k) e (k) ) e (100) = And the program has been mplemented f (100) = Le programme a ete ms en applcaton

Justfcaton for the Algorthm Tranng examples are (e (k), f (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence The log-lkelhood functon: L(t, q) = n log p(f (k) e (k) ) = n log a p(f (k), a e (k) ) k=1 k=1 The maxmum-lkelhood estmates are arg max L(t, q) t,q The EM algorthm wll converge to a local maxmum of the log-lkelhood functon

Summary Key deas n the IBM translaton models: Algnment varables Translaton parameters, e.g., t(chen dog) Dstorton parameters, e.g., q(2 1, 6, 7) The EM algorthm: an teratve algorthm for tranng the q and t parameters Once the parameters are traned, we can recover the most lkely algnments on our tranng examples e = And the program has been mplemented f = Le programme a ete ms en applcaton