Introduction to Sequential Monte Carlo Methods

Introduction to Sequential Monte Carlo Methods Arnaud Doucet NCSU, October 2008 Arnaud Doucet () Introduction to SMC NCSU, October 2008 1 / 36

Preliminary Remarks Sequential Monte Carlo (SMC) are a set of methods allowing us to approximate virtually any sequence of probability distributions. SMC are very popular in physics where they are used to compute eigenvalues of positive operators, the solution of PDEs/integral equations or simulate polymers. We focus here on Applications of SMC to Hidden Markov Models (HMM) for pedagogical reasons... In the HMM framework, SMC are also widely known as Particle Filtering/Smoothing methods. Arnaud Doucet () Introduction to SMC NCSU, October 2008 2 / 36

Markov Models We model the stochastic processes of interest as a discrete-time Markov process fx k g k1. fx k g k1 is characterized by its initial density and its transition density X 1 µ () X k j (X k 1 = x k 1 ) f ( j x k 1 ). We introduce the notation x i :j = (x i, x i+1,..., x j ) for i j. We have by de nition p (x ) = p (x 1 ) = µ (x 1 ) n k=2 n k=2 p (x k j x 1:k 1 ) f (x k j x k 1 ) Arnaud Doucet () Introduction to SMC NCSU, October 2008 3 / 36

Observation Model We do not observe fx k g k1 ; the process is hidden. We only have access to another related process fy k g k1. We assume that, conditional on fx k g k1, the observations fy k g k1 are independent and marginally distributed according to Formally this means that Y k j (X k = x k ) g ( j x k ). p (y j x ) = n g (y k j x k ). k=1 Arnaud Doucet () Introduction to SMC NCSU, October 2008 4 / 36

Figure: Graphical model representation of HMM Arnaud Doucet () Introduction to SMC NCSU, October 2008 5 / 36

Tracking Example Assume you want to track a target in the XY plane then you can consider the 4-dimensional state X k = (X k,1, V k,1, X k,2, V k,2 ) T The so-called constant velocity model states that i.i.d. X k = AX k 1 + W k, W k N (0, Σ), ACV 0 1 T A =, A 0 A CV =, CV 0 1 Σ = σ 2 ΣCV 0 T, Σ 0 Σ CV = 3 /3 T 2 /2 CV T 2 /2 T We obtain that f (x k j x k 1 ) = N (x k ; Ax k 1, Σ). Arnaud Doucet () Introduction to SMC NCSU, October 2008 6 / 36

Tracking Example (cont.) The observation equation is dependent on the sensor. Simple case so Y k = CX k + DE k, E k i.i.d. N (0, Σ e ) g (y k j x k ) = N (y k ; Cx k, Σ e ). Complex realistic case (Bearings-only-tracking) Y k = tan 1 Xk,2 i.i.d. + E k, E k N X k,1 so g (y k j x k ) = N 0, σ 2 y k ; tan 1 xk,2, σ 2. x k,1 Arnaud Doucet () Introduction to SMC NCSU, October 2008 7 / 36

Stochastic Volatility We have the following standard model X k = φx k 1 + V k, V k i.i.d. N 0, σ 2 so that We observe f (x k j x k 1 ) = N x k ; φx k 1, σ 2. Y k = β exp (X k /2) W k, W k i.i.d. N (0, 1) so that g (y k j x k ) = N y k ; 0, β 2 exp (x k ). Arnaud Doucet () Introduction to SMC NCSU, October 2008 8 / 36

Inference in HMM Given a realization of the observations Y = y, we are interested in inferring the states X. We are in a Bayesian framework where Prior: p (x ) = µ (x 1 ) n k=2 Likelihood: p (y j x ) = Using Bayes rule, we obtain f (x k j x k 1 ), n g (y k j x k ) k=1 p (x j y ) = p (y j x ) p (x ) p (y ) where the marginal likelihood is given by Z p (y ) = p (y j x ) p (x ) dx. Arnaud Doucet () Introduction to SMC NCSU, October 2008 9 / 36

Sequential Inference in HMM In particular, we will focus here on the sequential estimation of p (x j y ) and p (y ); that is at each time n we want update our knowledge of the hidden process in light of y n. There is a simple recursion relating p (x 1 j y 1 ) to p (x j y ) given by p (x j y ) = p (x 1 j y 1 ) f (x nj x n 1 ) g (y n j x n ) p (y n j y 1 ) where Z p (y n j y 1 ) = g (y n j x n ) f (x n j x n 1 ) p (x n 1 j y 1 ) dx n. We will also simply write p (x j y ) p (x 1 j y 1 ) f (x n j x n 1 ) g (y n j x n ). Arnaud Doucet () Introduction to SMC NCSU, October 2008 10 / 36

In many papers/books in the literature, you will nd the following two-step prediction-updating recursion for the marginals so-called ltering distributions p (x n j y ) which is a direct consequence. Prediction Step p (x n j y 1 ) = Updating Step = = Z Z Z p (x n j y 1 ) dx n 1 p (x n j x n 1, y 1 ) p (x n 1 j y 1 ) dx n 1 f (x n j x n 1 ) p (x n 1 j y 1 ) dx n 1. p (x n j y ) = g (y nj x n ) p (x n j y 1 ) p (y n j y 1 ) Arnaud Doucet () Introduction to SMC NCSU, October 2008 11 / 36

(Marginal) Likelihood Evaluation We have seen that Z p (y ) = p (y j x ) p (x ) dx. We also have the following decomposition where p (y k j y 1:k 1 ) = p (y ) = p (y 1 ) = = Z Z Z n k=2 p (y k, x k j y 1:k p (y k j y 1:k 1 ) 1 ) dx k g (y k j x k ) p (x k j y 1:k 1 ) dx k g (y k j x k ) f (x k j x k 1 ) p (x k 1 j y 1:k 1 ) dx k 1 We have broken" an high dimensional integral into the product of lower dimensional integrals. Arnaud Doucet () Introduction to SMC NCSU, October 2008 12 / 36

Closed-form Inference in HMM We have closed-form solutions for Finite state-space HMM; i.e. E = fe 1,..., e p g as all integrals are becoming nite sums Linear Gaussian models; all the posterior distributions are Gaussian; e.g. the celebrated Kalman lter. A whole reverse engineering literature exists for closed-form solutions in alternative cases... In many cases of interest, it is impossible to compute the solution in closed-form and we need approximations, Arnaud Doucet () Introduction to SMC NCSU, October 2008 13 / 36

Standard Approximations for Filtering Distributions Gaussian approximations: Extended Kalman lter, Unscented Kalman lter. Gaussian sum approximations. Projection lters, Variational approximations. Simple discretization of the state-space. Analytical methods work in simple cases but are not reliable and it is di cult to diagnose when they fail. Standard discretization of the space is expensive and di cult to implement in high-dimensional scenarios. Arnaud Doucet () Introduction to SMC NCSU, October 2008 14 / 36

Breakthrough At the beginning of the 90 s, the optimal ltering area was considered virtually dead; there had not been any signi cant progress for years then... Gordon, N.J. Salmond, D.J. Smith, A.F.M. "Novel approach to nonlinear/non-gaussian Bayesian state estimation", IEE Proceedings F: Radar and Signal Processing, vol. 140, no. 2, pp. 107-113, 1993. This article introduces a simple method which relies neither on a functional approximation nor a deterministic grid. This paper was ignored by most researchers for a few years... Arnaud Doucet () Introduction to SMC NCSU, October 2008 15 / 36

Monte Carlo Sampling. Importance Sampling. Sequential Importance Sampling. Sequential Importance Sampling with Resampling. Arnaud Doucet () Introduction to SMC NCSU, October 2008 16 / 36

Monte Carlo Sampling Assume for the time being that you are interested in estimating the high-dimensional probability density p (x j y ) = p (x, y ) p (y ) p (x, y ) where n is xed. A Monte Carlo approximation consists of sampling a large number N of i.i.d. random variables X (i) i.i.d. p (x j y ) and build the following approximation bp (x j y ) = 1 N N δ (i) X (x ) i=1 where δ a (x ) is the delta-dirac mass which is such that Z 1 if a 2 A E δ a (x ) dx = n, 0 otherwise. A Arnaud Doucet () Introduction to SMC NCSU, October 2008 17 / 36

Issues with Standard Monte Carlo Sampling There are standard methods to sample from classical distributions such as Beta, Gamma, Normal, Poisson etc. We will not detail them here although we will rely on them. Problem 1: For most problems of interest, we cannot sample from p (x j y ). Problem 2: Even if we could sample exactly from p (x j y ), then the computational complexity of the algorithm would most likely increase with n: we want here an algorithm of xed computational complexity at each time step. To summarize, we cannot use standard MC sampling in our case and, even if we could, this would not solve our problem... Arnaud Doucet () Introduction to SMC NCSU, October 2008 18 / 36

Importance Sampling Importance Sampling (IS). We have p (x j y ) = p (y j x ) p (x ), p (y ) Z p (y ) = p (y j x ) p (x ) dx Generally speaking, we have for a so-called importance distribution q (x j y ) such that selected such that p (x j y ) > 0 ) q (x j y ) > 0 p (x j y ) = w (x, y ) q (x j y ), p (y ) Z p (y ) = w (x, y ) q (x j y ) dx where the unnormalized importance weight is w (x, y ) = p (x, y ) q (x j y ) p (x j y ) q (x j y ). Arnaud Doucet () Introduction to SMC NCSU, October 2008 19 / 36

Monte Carlo IS Estimates It is easy to sample from p (x ) thus we can build the standard MC approximation bp (x j y ) = 1 N N δ (i) X (x ) where X (i) i.i.d. p (x ). i=1 We plug these approximations in the IS identities to obtain Z p (y ) = p (y j x ) p (x ) dx, ) bp (y ) = 1 N N p y j X (i). i=1 bp (y ) is an unbiased estimate of p (y ) with variance Z 1 p 2 (y j x ) p (x ) dx 1. N Arnaud Doucet () Introduction to SMC NCSU, October 2008 20 / 36

We also get an approximation of the posterior using p (x j y ) = bp (x j y ) = = = p (y j x ) p (x ) R p (y j x ) p (x ) dx p (y j x ) bp (x ) R p (y j x ) bp (x ) dx 1 N N i=1 p y j X (i) δ (i) X (x ) N i=1 1 N N i=1 p W n (i) δ (i) X (x ) y j X (i) where the normalized importance weights are W n (i) = p y j X (i). N j=1 p y j X (j) Arnaud Doucet () Introduction to SMC NCSU, October 2008 21 / 36

Assume we are interested in computing E p( x jy )(ϕ), then we can use the estimate E bp( x jy )(ϕ) = W n (i) ϕ. N i=1 X (i) This estimate is biased for a nite N but is asymptotically consistent with lim N E bp( x jy N! )(ϕ) E p( x jy )(ϕ) Z p = 2 (x j y ) ϕ (x ) E p (x ) p( x jy )(ϕ) dx and ) N p N E bp( x jy )(ϕ) 0, MSE = bias 2 {z } O (N 2 ) Z p 2 (x j y ) p (x ) E p( x jy )(ϕ) ϕ (x ) 2 E p( x jy )(ϕ) dx. + variance {z } so asymptotic bias is irrelevant. O (N 1 ) Arnaud Doucet () Introduction to SMC NCSU, October 2008 22 / 36

Summary of Our Progresses Problem 1: For most problems of interest, we cannot sample from p (x j y ). Problem 1 solved : We use an IS approximation of p (x j y ) that relies on the IS prior distribution p (x ). Problem 2: Even if we could sample exactly from p (x j y ), then the computational complexity of the algorithm would most likely increase with n: we want here an algorithm of xed computational complexity at each time step. Problem 2 not solved yet: If at each time step n, we need to obtain new samples from p (x ) then the algorithm computational complexity will increase at each time step. Arnaud Doucet () Introduction to SMC NCSU, October 2008 23 / 36

Sequential Importance Sampling (SIS) To avoid having computational e orts increasing over time, we use the fact that p (x ) {z } IS at time n = p (x 1 ) f (x n j x n 1 ) {z } {z } IS at time n 1 New sampled component = µ (x 1 ) n k=2 f (x k j x k 1 ). In practical terms, this means that at time n 1, we have already sampled X (i) 1 p (x 1) and that to obtain at time n samples/particles X (i) p (x ), we just need to sample X n (i) X (i) n 1 f x n j X (i) n 1 and set = ( X (i) 1 {z }, X n (i) {z} previously sampled paths X (i) new sampled component Arnaud Doucet () Introduction to SMC NCSU, October 2008 24 / 36 )

Now, whatever being n, we have only one component X n to sample! However, can we compute our IS estimates of p (y ) and the target p (x j y ) recursively? Remember that where W (i) n We have p bp (y ) = 1 N bp (x j y ) = N p y j X (i), i=1 N i=1 W n (i) δ (i) X (x ), y j X (i), N i=1 W n (i) = 1. p (y j x ) = p (y 1 j x 1 ) g (y n j x n ) Arnaud Doucet () Introduction to SMC NCSU, October 2008 25 / 36

Sequential Importance Sampling Algorithm At time 1, Sample N particles X (i) 1 µ (x 1 ) and compute W (i) 1 g y 1 j X (i) 1. At time n, n 2 Sample N particles X n (i) f x n j X (i) n 1 and compute W (i) n W (i) n 1.g y n j X n (i). Arnaud Doucet () Introduction to SMC NCSU, October 2008 26 / 36

Practical Issues The algorithm can be easily parallelized. The computational complexity does not increase over time. n o It is not necessary to store the paths if we are only interested X (i) in n approximating o p (x n j y ) as the weights only depends on X (i) n! Arnaud Doucet () Introduction to SMC NCSU, October 2008 27 / 36

Example of Applications Consider the following model X k = 0.5X k 1 + 25X k 1 1 + Xk 2 + 8 cos (1.2k) + V k 1 = ϕ (X k 1 ) + V k Y k = X 2 k 20 + W k, where X 1 N (0, 1), V k i.i.d. N 0, 2.5 2 and W k i.i.d. N (0, 1). Arnaud Doucet () Introduction to SMC NCSU, October 2008 28 / 36

60 Histogram of log(importance weights) 50 40 30 20 10 Figure: Histogram of log by one single particle. 0 65 60 55 50 45 40 35 30 25 20 15 p y 1:100 j X (i ) 1:100. The approximation is dominated Arnaud Doucet () Introduction to SMC NCSU, October 2008 29 / 36

Summary SIS is an attractive idea: sequential and parallelizable, reduces the design of an high-dimensional proposal to the design of a sequence of low-dimensional proposals. SIS can only work for moderate size problems. Is there a way to partially x this problem? Arnaud Doucet () Introduction to SMC NCSU, October 2008 30 / 36

Resampling n o Problem: As n increases, the variance of p y j X (i) increases and all the mass is concentrated on a few random samples/particles as W (i 0) n bp (x j y ) = N i=1 1 and W (i) n 0 for i 6= i 0. W n (i) δ (i) X (x ) δ (i X 0 ) (x ) Intuitive KEY idea: Kill in a principled way the particles with low weights W n (i) (relative to 1/N) and multiply the particles with high weights W n (i) (relative to 1/N). Rationale: If a particle at time n has a low weight then typically it will still have a low weight at time n + 1 (though I can easily give you a counterexample) and you want to focus your computational e orts on the promising parts of the space. Arnaud Doucet () Introduction to SMC NCSU, October 2008 31 / 36

At time n, IS provides the following approximation of p (x j y ) bp (x j y ) = N i=1 W n (i) δ (i) X (x ). The simplest resampling schemes consists of sampling N times ex (i) bp (x j y ) to build the new approximation ep (x j y ) = 1 N N δ X e (i) (x ). i=1 n o The new resampled particles X e (i) are approximately distributed according to p (x j y ) but statistically dependent Theoretically much more di cult to study. Arnaud Doucet () Introduction to SMC NCSU, October 2008 32 / 36

Sequential Importance Sampling Resampling Algorithm At time 1, Sample N particles X (i) 1 µ (x 1 ) and compute Resample W (i) 1 g y 1 j X (i) 1. n o n X (i) 1, W (i) 1 to obtain new particles also denoted At time n, n 2 Sample N particles X n (i) f x n j X (i) n 1 and compute Resample W (i) n g y n j X n (i). n o n X (i), W n (i) to obtain new particles also denoted X (i) 1 X (i) o. o. Arnaud Doucet () Introduction to SMC NCSU, October 2008 33 / 36

We also have Z p (y n j y 1 ) = g (y n j x n ) f (x n j x n 1 ) p (x n 1 j y 1 ) dx n so bp (y n j y 1 ) = 1 N N g y n j X n (i). i=1 Perhaps surprisingly, it can be shown that if we de ne bp (y ) = bp (y 1 ) n k=2 bp (y k j y 1:k 1 ) then E [bp (y )] = p (y ). Arnaud Doucet () Introduction to SMC NCSU, October 2008 34 / 36

Example (cont.) Consider again the following model X k = 0.5X k 1 + 25X k 1 1 + Xk 2 + 8 cos (1.2k) + V k 1 Y k = X 2 k 20 + W k, where X 1 N (0, 1), V k i.i.d. N 0, 2.5 2 and W k i.i.d. N (0, 1). Arnaud Doucet () Introduction to SMC NCSU, October 2008 35 / 36

Advanced SMC Methods I have presented the most basic algorithm. In practice, practitioners often select an IS distribution q (x n j y n, x n 1 ) 6= f (x n j x n 1 ). In such cases, we have W n (i) f X n (i) X (i) n 1 g y n j X n (i) q y n, X (i) X (i) n n 1 Better resampling steps have been developed. Variance reduction can also be developed. SMC methods can be used to sample from virtually any sequence of distributions. Arnaud Doucet () Introduction to SMC NCSU, October 2008 36 / 36