Computer Vision Group Prof. Daniel Cremers. 7. Sequential Data

Size: px

Start display at page:

Download "Computer Vision Group Prof. Daniel Cremers. 7. Sequential Data"

Gerard Powers
5 years ago
Views:

1 Group Prof. Daniel Cremers 7. Sequential Data

2 Bayes Filter (Rep.) We can describe the overall process using a Dynamic Bayes Network: This incorporates the following Markov assumptions: (measurement) (state)!2 Group

3 Bayes Filter Without Actions Removing the action variables we obtain: Notation differs from Bishop! Discrete Variables This incorporates the following Markov assumptions: (measurement) (state)!3 Group

4 A Model for Sequential Data Observations in sequential data should not be modeled as independent variables such as: z 1 z 2 z 3 z 4 z 5 Examples: weather forecast, speech, handwritten text, etc. The observation at time t depends on the observation(s) of (an) earlier time step(s): z 1 z 2 z 3 z 4 z 5!4 Group

5 A Model for Sequential Data z 1 z 2 z 3 z 4 z 5 The joint distribution is therefore (d-sep): p(z 1...z n ) = p(z 1 ) ny i=2 p(z i z i 1 ) However: often data depends on several earlier observations (not just one) z 1 z 2 z 3 z 4 z 5!5 Group

6 A Model for Sequential Data z 1 z 2 z 3 z 4 z 5 p(z 1...z n ) = p(z 1 )p(z 2 z 1 ) ny p(z i z i 1, z i 2 ) Problem: number of stored parameters grows exponentially with the order of the Markov chain Question: can we model dependency of all previous observations with a limited number of parameters? i=3!6 Group

7 A Model for Sequential Data Idea: Introduce hidden (unobserved) variables: x 1 x 2 x 3 x 4 x 5 z 1 z 2 z 3 z 4 z 5!7 Group

8 A Model for Sequential Data Idea: Introduce hidden (unobserved) variables: x 1 x 2 x 3 x 4 x 5 z 1 z 2 z 3 z 4 z 5 Now we have: dsep(x n, {x 1,...,x n 2 }, x n 1 ), p(x n x 1,...,x n 2, x n 1 ) = p(x n x n 1 ) But: dsep(z n, {z 1,...,z n 2 }, z n 1 ), p(z n z 1,...,z n 2, z n 1 ), p(z n z n 1 ) And: number of parameters is nk(k-1) + const.!8 Group

9 Example Place recognition for mobile robots 3 different states: corridor, room, doorway Problem: misclassifications Idea: use information from previous time step!9 Group

10 General Formulation of an HMM 1.Discrete random variables Observation variables: {z n }, n = 1..N Discrete state variables (unobservable): {x n }, n = 1..N Number of states K: x n є{1 K} 2.Transition model p(x i x i-1 ) Model Parameters θ Markov assumption (x i only depends on x i-1 ) Represented as a K K transition matrix A Initial probability: p(x 0 ) repr. as π 1, π 2, π 3 3.Observation model p(z i x i ) with parameters φ Observation only depends on the current state Example: output of a local place classifier!10 Group

11 The Trellis Representation time k=1 A 11 A 11 k=2 k=3 A 33 A 33 n-2 n-1 n!11 Group

12 Application Example (1) Given an observation sequence z 1,z 2,z 3 Assume that the model parameters θ =(A, π, φ) are known What is the probability that the given observation sequence is actually observed under this model, i.e. the data likelihood p(z θ)? If we are given several different models, we can choose the one with highest probability Expressed as a supervised learning problem, this can be interpreted as the inference step (classification step)!12 Group

13 Application Example (2) Based on the data likelihood we can solve two different kinds of problems: Filtering: computes p(x n z 1:n ), i.e. state probability only based on previous observations Smoothing: computes p(x n z 1:N ), state probability based on all observations (including those from the future) 1 filtered 1 smoothed Filtering p(loaded) 0.5 Smoothing p(loaded) roll number roll number!13 Group

14 Application Example (3) Given an observation sequence z 1,z 2,z 3 Assume that the model parameters θ =(A, π, φ) are known What is the state sequence x 1,x 2,x 3 that explains best the given observation sequence? In the case of place recognition: which is the sequence of truly visited places that explains best the sequence of obtained place labels (classifications)?!14 Group

15 Application Example (4) Given an observation sequence z 1,z 2,z 3 What are the optimal model parameters θ =(A, π, φ)? This can be interpreted as the training step It is in general the most difficult problem!15 Group

16 Summary: 4 Operations on HMMs 1. Compute data likelihood p(z θ) from a known model Can be computed with the forward algorithm 2. Filtering or Smoothing of the state probability Filtering: forward algorithm Smoothing: forward-backward algorithm 3. Compute optimal state sequence with a known model Can be computed with the Viterbi-Algorithm 4. Learn model parameters for an observation sequence Can be computed using Expectation-Maximization (or Baum-Welch)!16 Group

17 The Forward Algorithm Goal: compute p(z θ) (we drop θ in the following) p(z 1,...,z n ) = p(z 1,...,z n, x n ) =: (x n ) x n x n!17 Group

18 The Forward Algorithm Goal: compute p(z θ) (we drop θ in the following) p(z 1,...,z n ) = We can calculate α recursively: p(z 1,...,z n, x n ) =: (x n ) x n x n (x n ) = p(z n x n ) x n 1 (x n 1 )p(x n x n 1 )!18 Group

19 The Forward Algorithm Goal: compute p(z θ) (we drop θ in the following) p(z 1,...,z n ) = We can calculate α recursively: p(z 1,...,z n, x n ) =: (x n ) x n x n (x n ) = p(z n x n ) x n 1 (x n 1 )p(x n x n 1 ) This is (almost) the same recursive formula as we had in the first lecture!!19 Group

20 The Forward Algorithm Goal: compute p(z θ) (we drop θ in the following) p(z 1,...,z n ) = We can calculate α recursively: p(z 1,...,z n, x n ) =: (x n ) x n x n (x n ) = p(z n x n ) x n 1 (x n 1 )p(x n x n 1 ) This is (almost) the same recursive formula as we had in the first lecture! Filtering: p(x n z 1,...,z n ) = p(z 1,...,z n, x n ) p(z 1,...,z n ) = (x n) P x n (x n )!20 Group

21 The Forward-Backward Algorithm As before we set We also define (x n ) = p(z 1,...,z n, x n ) (x n ) = p(z n+1,...,z N x n ) e.g. n = 5: x 1 x 2 x 3 x 4 x n x N 1 x N z 1 z 2 z 3 z 4 z n z N 1 z N!21 Group

22 The Forward-Backward Algorithm As before we set We also define (x n ) = p(z 1,...,z n, x n ) (x n ) = p(z n+1,...,z N x n ) This can be recursively computed (backwards): (x n 1 ) = p(z n,...,z N x n 1 ) = p(x n, z n,...,z N x n 1 ) x n = p(z n+1,...,z N x n, z n, x n 1 )p(x n, z n x n 1 ) x n = p(z n+1,...,z N x n )p(z n x n 1, x n )p(x n x n 1 ) x n = (x n )p(z n x n )p(x n x n 1 ) x n!22 Group

23 The Forward-Backward Algorithm As before we set We also define (x n ) = p(z 1,...,z n, x n ) (x n ) = p(z n+1,...,z N x n ) This can be recursively computed (backwards): (x n ) = x n+1 (x n+1 )p(z n+1 x n+1 )p(x n+1 x n ) This is also known as the message-passing algorithm ( sum-product )! forward messages αn (vector of length K) backward messages βn (vector of length K)!23 Group

24 Smoothing with Forward-Backward First we compute p(x n, z 1,...,z N ): p(x n, z 1,...,z N ) = p(z 1,...,z N x n )p(x n ) = p(z 1,...,z n x n )p(z n+1,...,z N x n )p(x n ) = p(z 1,...,z n, x n )p(z n+1,...,z N x n ) = (x n ) (x n )!24 Group

25 Smoothing with Forward-Backward First we compute p(x n, z 1,...,z N ): p(x n, z 1,...,z N ) = (x n ) (x n ) with that we can compute p(z 1,...,z N ): p(z 1,...,z N ) = p(x n, z 1,...,z N ) = (x n ) (x n ) x n x n!25 Group

26 Smoothing with Forward-Backward First we compute p(x n, z 1,...,z N ): p(x n, z 1,...,z N ) = (x n ) (x n ) with that we can compute p(z 1,...,z N ): p(z 1,...,z N ) = and finally: p(x n, z 1,...,z N ) = (x n ) (x n ) x n x n p(x n z 1,...,z N ) = p(x n, z 1,...,z N ) p(z 1,...,z N ) = (x n) (x n ) P x n (x n ) (x n )!26 Group

27 2. Computing the Most Likely States Goal: find a state sequence x 1,x 2,x 3 that maximizes the probability p(,z θ) Define (x n ) = max x 1,...,x n 1 p(x 1,...x n z 1,...z n ) This is the probability of state j by taking the most probable path. x n 1 x n x n+1 z n 1 z n z n+1!27 Group

28 2. Computing the Most Likely States Goal: find a state sequence x 1,x 2,x 3 that maximizes the probability p(,z θ) Define (x n ) = max x 1,...,x n 1 p(x 1,...x n z 1,...z n ) This can be computed recursively: (x n ) = max x n 1 (x n 1 )p(x n x n 1 )p(z n, x n ) we also have to compute the argmax: (x n ) = arg max x n 1 (x n 1 )p(x n x n 1 )p(z n, x n )!28 Group

29 Initialize: δ(x 0 ) = p(x 0 ) p(z 0 x 0 ) ψ(x 0 ) = 0 Compute recursively for n=1 N: δ(x n )= p(z n x n ) max [δ(x n-1 ) p(x n x n-1 )] ψ(x n )= argmax [δ(x n-1 ) p(x n x n-1 )] On termination: p(z, θ) = max δ(x N ) x N * = argmax δ(x N ) x N x n-1 Backtracking: x * n = ψ(x n+1 ) The Viterbi algorithm x N x n-1!29 Group

30 3. Learning the Model Parameters Given an observation sequence z 1,z 2,z 3 Find optimal model parameters θ= π,a,φ We need to maximize the likelihood p(z θ) Can not be solved in closed form Iterative algorithm Baum-Welch : a special case of the Expectation Maximization (EM) algorithm!30 Group

31 3. Learning the Model Parameters Idea: instead of maximizing we maximize the expected log likelihood: p(z 1,...,z N ) = p(z 1,...,z N, x 1,...,x N ) p(x 1,...,x N z 1,...,z N, ) log p(z 1,...,z N, x 1,...,x N ) it can be shown that this is a lower bound of the actual log-likelihood p(z θ) this is the general idea of the Expectation- Maximization (EM) algorithm!31 Group

32 The Baum-Welsh algorithm E-Step (assuming we know π,a,φ, i.e. θ old ) Define the posterior probability of being in state i at step k: Define γ(x n )= p(x n Z)!32 Group

33 The Baum-Welsh algorithm E-Step (assuming we know π,a,φ, i.e. θ old ) Define the posterior probability of being in state i at step k: Define γ(x n )= p(x n z1,,zn) It follows that γ(x n )= α(x n ) β(x n ) / p(z)!33 Group

34 The Baum-Welsh algorithm E-Step (assuming we know π,a,φ, i.e. θ old ) Define the posterior probability of being in state i at step k: Define γ(x n )= p(x n z1,,zn) It follows that γ(x n )= α(x n ) β(x n ) / p(z) Define ξ(x n-1,x n )= p(x n-1,x n Z) It follows that ξ(x n-1,x n )= α(x n-1 )p(z n x n )p(x n x n-1 )β(x n ) / p(z)!34 Group

35 The Baum-Welsh algorithm Note: γ(x n ) is a vector of length K; each entry γk(x n ) represents the probability that the state at time n is equal to k {1, K} Thus: The expected number of transitions from state k in the sequence is N i=1 k(x i )!35 Group

36 The Baum-Welsh algorithm Note: γ(x n ) is a vector of length K; each entry γk(x n ) represents the probability that the state at time n is equal to k {1, K} Thus: The expected number of transitions from state k in the sequence is N i=1 k(x i ) Similarly: The expected number of transitions from state j to state k in the sequence is N 1 i=1 j,k (x i, x i+1 )!36 Group

37 The Baum-Welsh algorithm With that we can compute new values for π,a,φ: k = k (x 1 ) A j,k = P N 1 i=1 j,k(x i, x i+1 ) P N i=1 j(x i ) ' j,k = P N i=1 j(x i ) k,xt P N i=1 j(x i ) here, we need forward and backward step! This is done until the likelihood does not increase anymore (convergence)!37 Group

38 The Baum-Welsh Algorithm - Summary Start with an initial estimate of θ=(π,a,φ) e.g. uniformly and k-means for φ Compute messages (E-Step) Compute new θ=(π,a,φ) (M-step) Iterate E and M until convergence In each iteration one full application of the forward-backward algorithm is performed Result gives a local optimum For other local optima, the algorithm needs to be started again with new initialization!38 Group

39 Summary HMMs are a way to model sequential data They assume discrete states Three possible operations can be performed with HMMs: Data likelihood, given a model and an observation Most likely state sequence, given a model and an observation Optimal Model parameters, given an observation Appropriate scaling solves numerical problems HMMs are widely used, e.g. in speech recognition!39 Group

Notes on the EM Algorithm Michael Collins, September 24th 2005

Notes on the EM Algorithm Michael Collins, September 24th 2005 1 Hidden Markov Models A hidden Markov model (N, Σ, Θ) consists of the following elements: N is a positive integer specifying the number of