The IBM Translation Models. Michael Collins, Columbia University

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013

Data Mining Linear and Logistic Regression

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

TCOM501 Networking: Theory & Fundamentals Final Examination Professor Yannis A. Korilis April 26, 2002

Parallel Prefix addition

Scribe: Chris Berlind Date: Feb 1, 2010

Elements of Economic Analysis II Lecture VI: Industry Supply

/ Computational Genomics. Normalization

SIMPLE FIXED-POINT ITERATION

Linear Combinations of Random Variables and Sampling (100 points)

Lecture 7. We now use Brouwer s fixed point theorem to prove Nash s theorem.

CHAPTER 3: BAYESIAN DECISION THEORY

Equilibrium in Prediction Markets with Buyers and Sellers

OPERATIONS RESEARCH. Game Theory

3: Central Limit Theorem, Systematic Errors

MgtOp 215 Chapter 13 Dr. Ahn

A Case Study for Optimal Dynamic Simulation Allocation in Ordinal Optimization 1

Problem Set 6 Finance 1,

Financial mathematics

The Hiring Problem. Informationsteknologi. Institutionen för informationsteknologi

Finite Math - Fall Section Future Value of an Annuity; Sinking Funds

A Set of new Stochastic Trend Models

Tree-based and GA tools for optimal sampling design

Random Variables. b 2.

ECONOMETRICS - FINAL EXAM, 3rd YEAR (GECO & GADE)

Multifactor Term Structure Models

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME

Quiz on Deterministic part of course October 22, 2002

Production and Supply Chain Management Logistics. Paolo Detti Department of Information Engeneering and Mathematical Sciences University of Siena

Introduction to game theory

Chapter 5 Student Lecture Notes 5-1

3/3/2014. CDS M Phil Econometrics. Vijayamohanan Pillai N. Truncated standard normal distribution for a = 0.5, 0, and 0.5. CDS Mphil Econometrics

Numerical Analysis ECIV 3306 Chapter 6

Graphical Methods for Survival Distribution Fitting

iii) pay F P 0,T = S 0 e δt when stock has dividend yield δ.

IND E 250 Final Exam Solutions June 8, Section A. Multiple choice and simple computation. [5 points each] (Version A)

Problems to be discussed at the 5 th seminar Suggested solutions

Trivial lump sum R5.0

Single-Item Auctions. CS 234r: Markets for Networks and Crowds Lecture 4 Auctions, Mechanisms, and Welfare Maximization

Creating a zero coupon curve by bootstrapping with cubic splines.

Bid-auction framework for microsimulation of location choice with endogenous real estate prices

Physics 4A. Error Analysis or Experimental Uncertainty. Error

Trivial lump sum R5.1

2) In the medium-run/long-run, a decrease in the budget deficit will produce:

Games and Decisions. Part I: Basic Theorems. Contents. 1 Introduction. Jane Yuxin Wang. 1 Introduction 1. 2 Two-player Games 2

Efficient Project Portfolio as a Tool for Enterprise Risk Management

A Bayesian Classifier for Uncertain Data

Price and Quantity Competition Revisited. Abstract

Tests for Two Ordered Categorical Variables

Parsing beyond context-free grammar: Tree Adjoining Grammar Parsing I

COMPARISON OF THE ANALYTICAL AND NUMERICAL SOLUTION OF A ONE-DIMENSIONAL NON-STATIONARY COOLING PROBLEM. László Könözsy 1, Mátyás Benke 2

Jeffrey Ely. October 7, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

occurrence of a larger storm than our culvert or bridge is barely capable of handling? (what is The main question is: What is the possibility of

Bayesian belief networks

Tests for Two Correlations

Foundations of Machine Learning II TP1: Entropy

arxiv: v1 [math.nt] 29 Oct 2015

CS 286r: Matching and Market Design Lecture 2 Combinatorial Markets, Walrasian Equilibrium, Tâtonnement

Centre for International Capital Markets

Notes are not permitted in this examination. Do not turn over until you are told to do so by the Invigilator.

II. Random Variables. Variable Types. Variables Map Outcomes to Numbers

OCR Statistics 1 Working with data. Section 2: Measures of location

Monte Carlo Rendering

Economic Design of Short-Run CSP-1 Plan Under Linear Inspection Cost

A Comparison of Statistical Methods in Interrupted Time Series Analysis to Estimate an Intervention Effect

ISyE 2030 Summer Semester 2004 June 30, 2004

Global Optimization in Multi-Agent Models

Rare-Event Estimation for Dynamic Fault Trees

Homework 9: due Monday, 27 October, 2008

Microeconomics: BSc Year One Extending Choice Theory

CHAPTER 9 FUNCTIONAL FORMS OF REGRESSION MODELS

Introduction to PGMs: Discrete Variables. Sargur Srihari

Members not eligible for this option

YORK UNIVERSITY Faculty of Science Department of Mathematics and Statistics MATH A Test #2 November 03, 2014

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY

EC 413 Computer Organization

Supplementary material for Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Cofactorisation strategies for the number field sieve and an estimate for the sieving step for factoring 1024-bit integers

Maximum Likelihood Estimation of Isotonic Normal Means with Unknown Variances*

Maturity Effect on Risk Measure in a Ratings-Based Default-Mode Model

- contrast so-called first-best outcome of Lindahl equilibrium with case of private provision through voluntary contributions of households

Chapter 3 Student Lecture Notes 3-1

Clearing Notice SIX x-clear Ltd

You Owe Me. Ulrike Malmendier and Klaus M. Schmidt Online Appendix. (ii) Maximin preferences: The decision maker has maximin preferences if

Random Variables. 8.1 What is a Random Variable? Announcements: Chapter 8

A Utilitarian Approach of the Rawls s Difference Principle

Utilitarianism. Jeffrey Ely. June 7, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Analysis of Variance and Design of Experiments-II

Welfare Aspects in the Realignment of Commercial Framework. between Japan and China

Appendix - Normally Distributed Admissible Choices are Optimal

Cracking VAR with kernels

Consumption Based Asset Pricing

2.1 Rademacher Calculus... 3

Applications of Myerson s Lemma

Which of the following provides the most reasonable approximation to the least squares regression line? (a) y=50+10x (b) Y=50+x (d) Y=1+50x

FORD MOTOR CREDIT COMPANY SUGGESTED ANSWERS. Richard M. Levich. New York University Stern School of Business. Revised, February 1999

Finance 402: Problem Set 1 Solutions

Hewlett Packard 10BII Calculator

A FRAMEWORK FOR PRIORITY CONTACT OF NON RESPONDENTS

Heterogeneity in Expectations, Risk Tolerance, and Household Stock Shares

Transcription:

The IBM Translaton Models Mchael Collns, Columba Unversty

Recap: The Nosy Channel Model Goal: translaton system from French to Englsh Have a model p(e f) whch estmates condtonal probablty of any Englsh sentence e gven the French sentence f. Use the tranng corpus to set the parameters. A Nosy Channel Model has two components: p(e) p(f e) the language model the translaton model Gvng: and p(e f) = p(e, f) p(f) = p(e)p(f e) p(e)p(f e) e argmax e p(e f) = argmax e p(e)p(f e)

Roadmap for the Next Few Lectures IBM Models 1 and 2 Phrase-based models

Overvew IBM Model 1 IBM Model 2 EM Tranng of Models 1 and 2

IBM Model 1: Algnments How do we model p(f e)? Englsh sentence e has l words e 1... e l, French sentence f has m words f 1... f m. An algnment a dentfes whch Englsh word each French word orgnated from Formally, an algnment a s {a 1,... a m }, where each a {0... l}. There are (l + 1) m possble algnments.

IBM Model 1: Algnments e.g., l = 6, m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton One algnment s {2, 3, 4, 5, 6, 6, 6} Another (bad!) algnment s {1, 1, 1, 1, 1, 1, 1}

Algnments n the IBM Models We ll defne models for p(a e, m) and p(f a, e, m), gvng p(f, a e, m) = p(a e, m)p(f a, e, m) Also, p(f e, m) = a A p(a e, m)p(f a, e, m) where A s the set of all possble algnments

A By-Product: Most Lkely Algnments Once we have a model p(f, a e, m) = p(a e)p(f a, e, m) we can also calculate for any algnment a p(a f, e, m) = p(f, a e, m) p(f, a e, m) a A For a gven f, e par, we can also compute the most lkely algnment, a = arg max p(a f, e, m) a Nowadays, the orgnal IBM models are rarely (f ever) used for translaton, but they are used for recoverng algnments

An Example Algnment French: le consel a rendu son avs, et nous devons à présent adopter un nouvel avs sur la base de la premère poston. Englsh: the councl has stated ts poston, and now, on the bass of the frst poston, we agan have to gve our opnon. Algnment: the/le councl/consel has/à stated/rendu ts/son poston/avs,/, and/et now/présent,/null on/sur the/le bass/base of/de the/la frst/premère poston/poston,/null we/nous agan/null have/devons to/a gve/adopter our/nouvel opnon/avs./.

IBM Model 1: Algnments In IBM model 1 all allgnments a are equally lkely: p(a e, m) = 1 (l + 1) m Ths s a maor smplfyng assumpton, but t gets thngs started...

IBM Model 1: Translaton Probabltes Next step: come up wth an estmate for p(f a, e, m) In model 1, ths s: m p(f a, e, m) = t(f e a ) =1

e.g., l = 6, m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton a = {2, 3, 4, 5, 6, 6, 6} p(f a, e) = t(le the) t(programme program) t(a has) t(ete been) t(ms mplemented) t(en mplemented) t(applcaton mplemented)

IBM Model 1: The Generatve Process To generate a French strng f from an Englsh strng e: Step 1: Pck an algnment a wth probablty 1 (l+1) m Step 2: Pck the French words wth probablty p(f a, e, m) = m t(f e a ) =1 The fnal result: p(f, a e, m) = p(a e, m) p(f a, e, m) = 1 (l + 1) m m t(f e a ) =1

An Example Lexcal Entry Englsh French Probablty poston poston 0.756715 poston stuaton 0.0547918 poston mesure 0.0281663 poston vue 0.0169303 poston pont 0.0124795 poston atttude 0.0108907... de la stuaton au nveau des négocatons de l omp...... of the current poston n the wpo negotatons... nous ne sommes pas en mesure de décder,... we are not n a poston to decde,...... le pont de vue de la commsson face à ce problème complexe.... the commsson s poston on ths complex problem.

Overvew IBM Model 1 IBM Model 2 EM Tranng of Models 1 and 2

IBM Model 2 Only dfference: we now ntroduce algnment or dstorton parameters q(, l, m) = Probablty that th French word s connected Defne to th Englsh word, gven sentence lengths of e and f are l and m respectvely p(a e, m) = where a = {a 1,... a m } Gves p(f, a e, m) = m q(a, l, m) =1 m q(a, l, m)t(f e a ) =1

An Example l = 6 m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton a = {2, 3, 4, 5, 6, 6, 6} p(a e, 7) = q(2 1, 6, 7) q(3 2, 6, 7) q(4 3, 6, 7) q(5 4, 6, 7) q(6 5, 6, 7) q(6 6, 6, 7) q(6 7, 6, 7)

An Example l = 6 m = 7 e = And the program has been mplemented f = Le programme a ete ms en applcaton a = {2, 3, 4, 5, 6, 6, 6} p(f a, e, 7) = t(le the) t(programme program) t(a has) t(ete been) t(ms mplemented) t(en mplemented) t(applcaton mplemented)

IBM Model 2: The Generatve Process To generate a French strng f from an Englsh strng e: Step 1: Pck an algnment a = {a 1, a 2... a m } wth probablty m q(a, l, m) =1 Step 3: Pck the French words wth probablty m p(f a, e, m) = t(f e a ) The fnal result: =1 p(f, a e, m) = p(a e, m)p(f a, e, m) = m q(a, l, m)t(f e a ) =1

Recoverng Algnments If we have parameters q and t, we can easly recover the most lkely algnment for any sentence par Gven a sentence par e 1, e 2,..., e l, f 1, f 2,..., f m, defne for = 1... m a = arg max a {0...l} q(a, l, m) t(f e a ) e = And the program has been mplemented f = Le programme a ete ms en applcaton

Overvew IBM Model 1 IBM Model 2 EM Tranng of Models 1 and 2

The Parameter Estmaton Problem Input to the parameter estmaton algorthm: (e (k), f (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence Output: parameters t(f e) and q(, l, m) A key challenge: we do not have algnments on our tranng examples, e.g., e (100) = And the program has been mplemented f (100) = Le programme a ete ms en applcaton

Parameter Estmaton f the Algnments are Observed Frst: case where algnments are observed n tranng data. E.g., e (100) = And the program has been mplemented f (100) = Le programme a ete ms en applcaton a (100) = 2, 3, 4, 5, 6, 6, 6 Tranng data s (e (k), f (k), a (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence, each a (k) s an algnment Maxmum-lkelhood parameter estmates n ths case are trval: t ML (f e) = Count(e, f) Count(e) q ML (, l, m) = Count(, l, m) Count(, l, m)

Input: A tranng corpus (f (k), e (k), a (k) ) for k = 1... n, where f (k) = f (k) 1... f m (k) k, e (k) = e (k) 1... e (k), a (k) = a (k) 1... a (k) m k. Algorthm: Set all counts c(...) = 0 For k = 1... n For = 1... mk, For = 0... l k, l k c(e (k), f (k) ) c(e (k), f (k) ) + δ(k,, ) c(e (k) ) c(e (k) ) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) where δ(k,, ) = 1 f a (k) =, 0 otherwse. Output: t ML (f e) = c(e,f) c(e), q ML(, l, m) = c(,l,m) c(,l,m)

Parameter Estmaton wth the EM Algorthm Tranng examples are (e (k), f (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence The algorthm s related to algorthm when algnments are observed, but two key dfferences: 1. The algorthm s teratve. We start wth some ntal (e.g., random) choce for the q and t parameters. At each teraton we compute some counts based on the data together wth our current parameter estmates. We then re-estmate our parameters wth these counts, and terate. 2. We use the followng defnton for δ(k,, ) at each teraton: δ(k,, ) = lk q(, l k, m k )t(f (k) e (k) ) =0 q(, l k, m k )t(f (k) e (k) )

Input: A tranng corpus (f (k), e (k) ) for k = 1... n, where f (k) = f (k) 1... f m (k) k, e (k) = e (k) 1... e (k) l k. Intalzaton: Intalze t(f e) and q(, l, m) parameters (e.g., to random values).

For s = 1... S Set all counts c(...) = 0 For k = 1... n For = 1... mk, For = 0... l k where c(e (k), f (k) ) c(e (k), f (k) ) + δ(k,, ) c(e (k) ) c(e (k) ) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) q(, l k, m k )t(f (k) e (k) ) δ(k,, ) = lk =0 q(, l k, m k )t(f (k) e (k) ) Recalculate the parameters: t(f e) = c(e, f) c(e) q(, l, m) = c(, l, m) c(, l, m)

The EM Algorthm for IBM Model 1 For s = 1... S Set all counts c(...) = 0 For k = 1... n For = 1... mk, For = 0... l k where δ(k,, ) = c(e (k), f (k) ) c(e (k), f (k) ) + δ(k,, ) c(e (k) ) c(e (k) ) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) c(, l, m) c(, l, m) + δ(k,, ) 1 (1+l k ) lk =0 1 (1+l k ) (k) t(f e (k) ) (k) t(f e (k) ) = t(f (k) e (k) ) e (k) ) lk =0 t(f (k) Recalculate the parameters: t(f e) = c(e, f)/c(e)

δ(k,, ) = lk q(, l k, m k )t(f (k) e (k) ) =0 q(, l k, m k )t(f (k) e (k) ) e (100) = And the program has been mplemented f (100) = Le programme a ete ms en applcaton

Justfcaton for the Algorthm Tranng examples are (e (k), f (k) ) for k = 1... n. Each e (k) s an Englsh sentence, each f (k) s a French sentence The log-lkelhood functon: L(t, q) = n log p(f (k) e (k) ) = n log a p(f (k), a e (k) ) k=1 k=1 The maxmum-lkelhood estmates are arg max L(t, q) t,q The EM algorthm wll converge to a local maxmum of the log-lkelhood functon

Summary Key deas n the IBM translaton models: Algnment varables Translaton parameters, e.g., t(chen dog) Dstorton parameters, e.g., q(2 1, 6, 7) The EM algorthm: an teratve algorthm for tranng the q and t parameters Once the parameters are traned, we can recover the most lkely algnments on our tranng examples e = And the program has been mplemented f = Le programme a ete ms en applcaton