Adaptive Monte Carlo Integration. James Neufeld

Size: px

Start display at page:

Download "Adaptive Monte Carlo Integration. James Neufeld"

Morgan Foster
5 years ago
Views:

1 Adaptve Monte Carlo Integraton by James Neufeld A thess submtted n partal fulfllment of the requrements for the degree of Doctor of Phlosophy Department of Computng Scence Unversty of Alberta c James Neufeld, 2016

2 Abstract Monte Carlo methods are a smple, effectve, and wdely deployed way of approxmatng ntegrals that prove too challengng for determnstc approaches. Ths thess presents a number of contrbutons to the feld of adaptve Monte Carlo methods. That s, approaches that automatcally adjust the behavour of the samplng algorthm to better sut the targeted ntegrand. The frst of such contrbutons s the ntroducton of a new method, antthetc Markov chan samplng, whch mproves samplng effcency through the use of smulated Markov chans. These chans effectvely gude the samplng toward more nfluental regons of the ntegrand (modes). We demonstrate that ths approach leads to unbased estmators and offers sgnfcant mprovements over standard approaches on challengng mult-modal ntegrands. We next consder the complementary task of effcently allocatng computaton between a set of unbased samplers through observatons of ther past performance. Here, we show that ths problem s equvalent to the well known stochastc mult-armed bandt problem and, as a result, exstng algorthms and theoretcal guarantees transfer mmedately whch gves rse to new results for the adaptve Monte Carlo settng. We then extend ths framework to cover an mportant practcal condton, where each ndvdual sampler (bandt arm) may take a random amount of computaton tme to produce a sample. Here, we agan show that exstng bandt algorthms can be appled through the use of a smple samplng trck, and prove new results whch boundng the regret for any such algorthm from above. Lastly, we consder the task of combnng a set of unbased Monte Carlo estmators, wth unque varances and samples szes, nto a sngle estmator. We show that upperconfdence approaches smlar to those used n the mult-armed bandt lterature lead to estmators that mprove on exstng solutons both theoretcally and n practce. Interestngly, each of these contrbutons may be appled n parallel and n complement to one another to produce any number of hghly adaptable, robust, and practcal Monte Carlo ntegraton algorthms.

3 Acknowledgements Naturally, a PhD thess s only ever attrbuted to a sngle author. Whle ths tradton celebrates the hard work and perseverance of the student, t unfortunately does not reflect collaboratve nature of the work. As such, n ths document the frst person plural s used as an acknowledgement to the many ndvduals that made ths work possble. In partcular, I want to thank my supervsors Dale Schuurmans and Mchael Bowlng for ther techncal know-how, gudance, and nvaluable advce throughout my graduate studes. Also, I am exceedngly grateful to my collaborators Csaba Szepesvár and András György who s expertse, optmsm, and nsghtful suggestons were nstrumental n translatng many hgh level deas nto concrete contrbutons. Thank you as well to my many frends and colleagues at the Unversty of Alberta who made my graduate studes such a treasured experence. Of course, thank you to my wonderful parents and ssters who helped n far too many ways to count, and my lovng wfe Thea and three amazng chldren Page, Breanna, and Olver who made ths whole effort worthwhle.

4 Table of Contents 1 Introducton Contrbutons Background Varance Reducton Importance Samplng Antthetc Varates Stratfed Samplng Adaptve Importance Samplng Populaton Monte Carlo Dscusson Markov Chan Monte Carlo Sequental Monte Carlo Sequental Monte Carlo Samplers Adaptve SMCS Adaptve Stratfed Samplng Summary Varance Reducton va Antthetc Markov Chans Approach Unbasedness Varance Analyss Parameterzaton Expermental Evaluaton Sn Functon Bayesan k-mxture Model Problem 3: Robot Localzaton Dscusson Adaptve Monte Carlo va Bandt Allocaton Background on Bandt Problems Adaptve Monte Carlo Setup Reducton to Stochastc Bandts Implementatonal Consderatons Expermental Evaluaton Arm Synthetc Experments Opton Prcng Dscusson Adaptve Monte Carlo wth Non-Unform Costs Non-Unform Cost Formulaton Boundng the MSE-Regret Dscusson Expermental Evaluaton Adaptve Antthetc Markov Chan Samplng Tunng Adaptve SMCS v

5 5.3.3 Adaptvely Tunng Annealed Importance Samplng Dscusson of Emprcal Fndngs Dscusson Weghted Estmaton of a Common Mean Weghted Estmator Formulaton Boundng MSE-Regret Non-Determnstc (Bandt) Formulaton Expermental Evaluaton Arm Fxed Allocaton Problem Bandt Experments Dscusson Concludng Remarks 101 A AMCS 111 A.1 Proof of Lemma A.2 Proof of Lemma A.3 Proof of Lemma A.4 Proof of Lemma B Monte Carlo Bandts 116 B.1 Proof of Lemma C Monte Carlo Bandts Wth Costs 118 C.1 Proof for Lemma C.2 Proof for Lemma C.3 Proof of Lemma C.4 KL-Based Confdence Bound on Varance D Weghted Estmaton of a Common Mean 123 D.1 Proof of Theorem D.2 Proof of Theorem D.3 Proof of Theorem D.4 Concentraton Inequaltes D.4.1 The Hoeffdng-Azuma Inequalty D.4.2 Concentraton of the Sample Varance v

6 Lst of Fgures 3.1 Log-lkelhood functon of poston gven sensor readngs n a Bayesan robot localzaton problem, 2 of 3 dmensons shown Graphcal model outlnng the dependences between the sampled varables, here the postve chan (X (1),..., X (M) ) s shown on the rght of X (0) whle the negatve chan (X ( 1),..., X ( N) ) s shown on the left. Any varables correspondng to ndces greater than M or less than N are not sampled by the algorthm Cost-adjusted varance (log-scale) for the varous methods on the sn(x) 999 functon. GIS referes to the orgnal greedy mportance samplng approach and GIS-A the extended verson usng the threshold acceptance functon Cost-adjusted varance (log scale) for the dfferent approaches on the Bayesan k-means task. Mssng data ponts are due to the fact that trals where the fnal estmate (emprcal mean) s ncorrect by a factor of 2 or greater are automatcally removed. From left to rght the three plots ndcate performance on the same problem but wth an ncreasng number of observed tranng samples 15, 35, and 70 respectvely Left, the map used for the robot smulator wth 6 dfferent robot poses and correspondng laser measurements (for n = 12). Rght, a 2d mage where %blue s proportonal to the log-lkelhood functon usng the observatons shown at poston A, here pxel locatons correspond to robot (x, y) poston whle the orentaton remans fxed Relatve cost-adjusted varance for the dfferent approaches on the robot localzaton task for 6 dfferent postons (p) and 3 dfferent laser confguratons (n = #laser readngs) The average regret for the dfferent bandt approaches are shown (averaged over 2000 runs) for unform, truncated normal, and scaled Bernoull payout dstrbutons. Error bars gve the 99% emprcal percentles Top left:tle plot ndcatng whch approach acheved lowest regret (averaged over 2000 runs) at tme step 5000 n the 2-arm scaled Bernoull settng. X-axs s the varance of the frst dstrbuton, and Y-axs s the addtonal varance of the second dstrbuton. Top rght: Plot llustratng the expected number of suboptmal selectons for the hghlghted case (dashed red crcle n top left plot). Error bars ndcate 99% emprcal percentles, Y-axs s log scale. Bottom left: Correspondng tle plot taken at tme step Bottom rght: Correspondng plot to for tme horzon of 10 6, and for a dfferent parameter settng (dashed red crcle n bottom left plot). Note that the X and Y axes are log scale Top: Regret curves for the dfferent adaptve strateges when estmatng the prce of a European caplet opton at dfferent strke prces (s). Error bars are 99% emprcal percentles. The dashed lne s an excepton and gves the MSE-Regret of the Rao-Blackwellzed PMC estmator wth error bars delneatng 2-standard error. Bottom: Bar graphs showng the varance of each sampler (each θ -value) normalzed by the varance best sampler n the set: y = V(ˆµ θ ) / mn j V (ˆµ θj ) v

7 5.1 Tle plots showng the relatve cost-adjusted varance (RCAV), σ 2 δ, for dfferent parameterzatons of AMCS on the varous kdnapped robot scenaros used n Secton The measures are relatve to the best performng parameterzaton on that problem whch s ndcated by the small black crcle (here, RCAV = 1.0). The red trangle ndcates the constant arm used n the bandt experments n Fg Regret curves for the dfferent stochastc MAB algorthms for adaptve- AMCS on the kdnapped robot task. These curves show the performance for a fxed poston (#6) as the number of laser sensors are vared (12, 18, 24). For reference we plot the that would be acheved for a fxed parameter choce (marked by the red trangles n Fg. 5.1) whch s labelled as Arm #8. Error bars represent 99% emprcal densty regons Tle plots llustratng the cost-adjusted varance for each parameterzaton of the adaptve SMCS approach. Values are relatve to the best performng parameterzaton on that problem whch s ndcated by the small black crcle (here, RCAV = 1.0). The red trangle ndcates the constant arm used n the bandt experments n Fg Regret curves for the varous bandt methods for the adaptve SMCS settng. The regret obtaned by pullng only a sngle arm, wth populaton sze 500 and 16 MCMC steps, s labeled Arm #10. X-axs s measured n CPU tme as reported by the Java VM and error bars gve 99% emprcal densty regons Average number of suboptmal arm pulls curves for the varous bandt methods for the adaptve SMCS settng. X-axs s measured n CPU tme as reported by the Java VM and error bars gve 99% emprcal densty regons. Results are collected over 100 ndependent smulatons Tle plots llustratng the cost-adjusted varance for each parameterzaton of the AIS approach. Values are relatve to the best performng parameterzaton on that problem whch s ndcated by the small black crcle (here, RCAV = 1.0). The red trangle ndcates the constant arm used n the bandt experments n Fg Regret curves for the varous bandt methods for the AIS settng. The regret obtaned by pullng only a sngle arm, wth 100 annealng steps and 2 MCMC steps, s labeled Arm #12. X-axs s measured n CPU tme as reported by the Java VM and error bars gve 99% emprcal densty regons. Results are collected over 100 ndependent smulatons Normalzed varance (log scale) of the determnstc allocaton scheme for dfferent allocaton parameter γ and arm 2 varance/cost parameter D Average regret (y-axs, log-scale), and 99% emprcal densty regons, are plotted for each weghted estmator. Values are computed over smulatons Results for the Cox-Ingersoll-Ross model problem settng wth a strke prce (s) of The leftmost plot gves the regret for the unformly weghted estmator and s dentcal to the rghtmost plot n Fg. 4.3 wth the excepton of the log-scale. The regret for the optmal nverse-varance weghted estmator s next, followed by the UCB-W estmator and the Graybll- Deal (GD) estmator. Results are averaged over smulatons and error bars delneate the are 99% emprcal percentles. The 2 curves for the PMC method gve the performance for the Rao-Blackwellzed (dashed) estmator as well as the weghted estmator n queston (sold), that s usng PMC only to allocate samples as a bandt method Regret curves for the weghted estmators for the adaptve AMCS settng detaled n Secton 5.3.1, specfcally the 24 laser, poston #6, settng. The leftmost plot above shows the regret for the orgnal (unform) estmator and corresponds exactly to the rghtmost plot n Fg From left to rght the regret for the optmally weghted estmator and the UCB-W estmator are next followed fnally by the Graybll-Deal estmator. Ths rghtmost plot s n log-scale v

8 6.4 Regret curves for the bandt approaches usng the dfferent weghted estmators for the adaptve SMCS settng for the 4-dmensonal Gaussan mxture as descrbed n Secton The leftmost plot gves the regret for the unform estmator and corresponds to the mddle plot n Fg. 5.4, n log scale. The regret when usng the optmally weghted estmator, the UCB-W estmator, and the Graybll-Deal estmator are also shown Regret curves for the bandt approaches usng the dfferent weghted estmators for tunng AIS for a logstc regresson model wth T=10 tranng examples, as descrbed n Secton The leftmost plot gves the regret for the unform estmator and corresponds to the mddle plot n Fg. 5.7, n log scale. The regret when usng the optmally weghted estmator, the UCB-W estmator, and the Graybll-Deal estmator are also shown v

9 Table of Notaton Quantty Notaton Comment Random varable X, Y uppercase Constant varable n, k, m lowercase Sample space X, Y uppercase scrpt Probablty trple (X, P, B) sample space, probablty dstrbuton, and σ-algebra Dstrbuton functon (cdf) Π(x), Π(x θ), G(x), Π θ (x) Densty functon (pdf) π(x), π(x θ), g(x), π θ (x) Parameter µ, θ, ζ, µ(p) lowercase Greek letters, arguments omtted when clear Parameter space Θ, Ω uppercase Greek letters Estmator ˆθ, ˆµ, ˆµ(X1,..., X n ), Ẑ symbol corresponds to the parameter beng estmated, arguments omtted when clear Indcaton I{x > y}, I{x A} evaluates to 1 f the predcate s true, 0 otherwse Expectaton E[X], E[h(X)], E π [X] Varance V(X), V(h(X)), V π (X) Probablty P(X > y), P π (X > y) Defnton f(x) := x 2 Sequence (X 1,..., X n ), X 1:n x

10 Chapter 1 Introducton Anyone who consders arthmetcal methods of producng random dgts s, of course, n a state of sn John von Neumann In ths thess we explore the computaton of expected values, n partcular through the use of Monte Carlo methods. Specfcally, gven a random varable X X, dstrbuted accordng to a probablty measure admttng the densty functon π, our nterest surrounds expectatons expressed as: E π [h(x)] := h(x)π(x)dx, (1.1) where h : X R s a bounded measurable functon assumed only to be evaluable at any pont n the doman. Although ths formulaton s mathematcally concse and wdespread n ts applcaton, effcent computaton of the soluton remans a notorously challengng task. Indeed, the above expectaton appears n a number unque and nterestng settngs across many dscplnes n scence and engneerng. For nstance, h mght represent the prce of a fnancal opton and π our uncertanty about the underlyng asset. Or h mght defne the payoff functon for a hgh-stakes game of chance whle π descrbes the lkelhood of ndvdual games states. Alternatvely, h may represent the local concentraton of a partcular dssolved compound whle π reflects the flud dynamcs for the entre soluton. For any of these settngs effcent computaton of the ntegral n Eq. (1.1) s lkely a necessary step n makng consequental, and possbly tme senstve, decsons. What typcally makes the task of solvng ths ntegral so challengng s the fact that the exact analytcal forms of h and π are not often avalable. As a result, the symbolc ntegraton approaches from one s calculus textbook cannot be appled. Ultmately, for most 1

11 practcal scenaros, approxmatng the soluton numercally s the only vable approach. Numercal ntegraton technques generally fall nto one of two categores, whch we wll refer to as quadrature methods and Monte Carlo methods. Importantly, we fnd the crtcal dstncton between these classes s not the use of randomzaton, but whether some form of nterpolaton s attempted between evaluated ponts on the ntegrand. The use of nterpolaton s a double-edged sword as t can offer enormous advantages for problems wth smoothly varyng h and π n lower dmensons, but s challengng to apply to non-smooth or hgher dmensonal ntegrands. At the very least, an nterpolaton approach must evaluate the ntegrand at a mnmum of 2 d locatons, where d s the dmensonalty of X. Ths exponental sample complexty lmts the applcablty of ths approach when t comes to larger and more mpactful problems. Monte Carlo ntegraton approaches, on the other hand, approxmate ntegrals by queryng the ntegrand at random, or quas-random, locatons and return a weghted emprcal sum (average) of these ponts. For nstance, f we are able to effcently collect n samples, X 1,..., X n, dstrbuted accordng to π, the ntegral n Eq. (1.1) may then be approxmated the estmator ˆµ MC n := 1 n n h(x ). ] Lettng µ := E[h(X)] we observe that the above estmator s unbased: E [ˆµ MC = µ and =1 has a mean squared error (MSE) gven by E[ (ˆµ MC n µ ) 2 ] = 1 n V(h(X)). Ths straghtforward analyss establshes a convergence rate of O(n 1 ) whch, crtcally, s ndependent of the dmensonalty of the ntegral. In ths way, the Monte Carlo ntegraton method can be seen to break the nfamous curse of dmensonalty that plagues quadrature methods; at least n theory. Despte ths powerful theoretcal result, nave Monte Carlo mplementatons reman hghly neffcent for moderately complex problems. Further, there s no sngle Monte Carlo approach that can be appled effectvely across dscplnes and problems. For ths reason, a large varety of unque approaches have been ntroduced n the lterature each explotng specfc characterstcs of the problem at hand. In ths thess we explore Monte Carlo methods that, n some form or another, alter ther samplng behavour automatcally dependng on the propertes of the suppled ntegral wth the ntenton of mprovng the accuracy of the fnal estmate. In ths way, we lessen the requrement that practtoners understand, mplement, and experment wth a large number of specalzed technques for any new problem. n 2

12 1.1 Contrbutons Ths thess ncludes a number of nterrelated and novel contrbutons to the feld of adaptve Monte Carlo methods organzed nto four man chapters summarzed below. Formulaton of the Antthetc Markov Chan Method In Chapter 3 we present an extended treatment of the antthetc Markov chan samplng (AMCS) method orgnally presented n Neufeld et al. (2015). Ths approach s a combnaton of sequental Monte Carlo samplng methods and the method of antthetc varates. More smply, AMCS reduces approxmaton error by effectvely searchng out regons of the ntegrand that exhbt large changes n magntude (peaks) through the use of smulated Markov chans. By averagng wthn these regons, much of the varablty of the ntegrand can be removed, resultng n a reduced approxmaton error of the fnal estmate. The AMCS estmator s shown to be unbased (Theorem 3.1) and expressons for the varance and sample error, consderng the computatonal footprnt, are derved. Fnally, we provde explct parameterzatons for AMCS (Markov kernels and stoppng rules) and emprcally demonstrate ther utlty on non-trval machne learnng tasks that challenge exstng methods. Formulaton of the Learnng to Select Monte Carlo Samplers Framework Chapter 4 ntroduces a new adaptve Monte Carlo framework n whch a decson makng agent s presented wth a set of K unbased Monte Carlo samplers for the same statstc (µ) and tasked wth choosng whch samplers to draw samples from n order to compute µ most effcently; ths work was orgnally publshed n Neufeld et al. (2014). We formulate an expresson for the MSE of any gven allocaton polcy for ths sequental allocaton task and defne a regret formula expressng ths error n relaton to the optmal polcy. Under ths noton of regret, we prove ths problem s equvalent to the classc stochastc mult-armed bandt problem (Theorem 4.1 and Theorem 4.2). As a drect result of ths reducton, we are able to show that the regret upper bounds for many of the standard bandt approaches apply mmedately to ths settng (Corollary 4.1), n addton to the classc La-Robbns lower bound (Theorem 4.3). Lastly, we demonstrate that the exstng bandt algorthms can sgnfcantly outperform exstng populaton-based adaptve Monte Carlo methods n standard adaptve mportance samplng tasks. 3

13 Extensons to Monte Carlo Samplers Wth Non-Unform Costs In the adaptve Monte Carlo framework detaled n Chapter 4 t s assumed that each of the underlyng samplers requre the same amount of computaton to produce a sngle sample. However, for many of the more sophstcated Monte Carlo technques (such as AMCS) the per-sample computatonal cost can be unknown, stochastc, or vary wth the parameterzaton of the method. In order to apply the bandt-based adaptve samplng routnes to these more sophstcated samplers we extend the prevous formulaton to account for non-unform (stochastc) costs n Chapter 5. Ultmately, under mld techncal assumptons, we show that through a straghtforward samplng trck an Õ( t) regret s achevable by standard bandt approaches n ths settng (Theorem 5.1). We go on to show that these technques can be used to develop an adaptve-amcs varant that can outperform any fxed AMCS varant on the same sute of problems used n Chapter 3. We do, however, uncover a nterestng negatve result (lnear regret) whch can occur when there s more than one optmal sampler; we show that ths problem wll regularly surface when selectng between standard sequental Monte Carlo samplng algorthms. We go on to show that ths poor performance stems from the smplfyng decson to use the emprcal (unweghted) average of the sampled values as the fnal estmate. An Upper-Confdence Approach to the Weghted Estmaton of a Common Mean In Chapter 6 we consder the general task of constructng a weghtng for a convex combnaton of unbased estmators n order to produce a sngle estmate that mnmzes the MSE. Ths formulaton s applcable to a number of practcal settngs but s unquely useful n addressng the ssues uncovered n the non-unform cost settng mentoned above. We frst demonstrate that by weghtng each estmator nversely proportonal to ts varance, one recovers the unque mnmum varance unbased estmator as well as a mnmax estmator (Theorem 6.1). Usng ths approach as an optmal comparson we construct a regret formulaton for ths task and ntroduce the UCB-W estmator whch uses an upper-confdence estmate of the sample varance to construct a weghtng. We go on to show that ths estmator acheves a Õ( t) regret n the case where samples are selected determnstcally (Theorem 6.2) or accordng to a random stoppng rule (Theorem 6.3); thus generalzng the regret bounds proved n Chapter 5 to cover the edge-cases that resulted n lnear regret. We evaluate the proposed UCB-W estmator and show that t offers, frankly, massve savngs n both the unform cost and non-unform cost bandt settngs and by far out-performs the exstng Graybll-Deal estmator whch weghts each estmator usng ts sample varance. 4

14 Chapter 2 Background Monte Carlo s an extremely bad method; t should be used only when all alternatve methods are worse Alan D. Sokal Monte Carlo ntegraton approches represent a unquely powerful class of algorthm yet, at the same tme, the algorthms are typcally straghtforward to understand, mplement, and extend. Indeed, the ease at whch problem-specfc extensons are constructed has resulted n a number of unque Monte Carlo algorthms represented n the lterature. In ths chapter, we provde a hgh level overvew of the more popular approaches from whch many of the more sophstcated methods are based. An mportant precursor to understandng the tradeoffs between dfferent Monte Carlo ntegraton approaches s to dentfy a means by whch to assess the qualty of a gven estmator. For any dstrbuton Π P belongng to the probablty trple (X, P, B), parameter space Θ, and a parameter of nterest µ : P Θ, we defne an estmator as a mappng ˆµ : X n Θ. Note that for the purposes of ths thess we wll consder the sample space to be X = R d, B to be the Borel σ-algebra, and P to be the Lebesgue measure unless otherwse specfed. Addtonally, n the sequel we make use of the shorthand ˆµ n := ˆµ n (X 1,..., X n ), wth (X 1,..., X n ) Π, and µ := µ(π) when the arguments are clear from the avalable context. In ths notaton, we can evaluate the qualty of an estmator usng a loss functon L(ˆµ n, µ) and defne the rsk of the estmator as the expected loss: R(ˆµ n, µ) := E[L(ˆµ n, µ)]. Whle there are a varety of dfferent loss functons and parameter spaces consdered n the lterature, n ths thess we focus on the most common settng where Θ = R and L s the L2 loss functon: L(x, y) = (x y) 2. Here, the rsk reduces to the well known mean squared error (MSE) measure, denoted as MSE(ˆµ n, µ) := E [ (ˆµ n µ) 2]. 5

15 An mportant property of the MSE s the bas-varance decomposton MSE(ˆµ n ) = E[ˆµ n µ] 2 +E [ (ˆµ n E[ˆµ n ]) 2] =: bas(ˆµ n, µ) 2 +V(ˆµ n ), whch gves rse to the common convergence measures used n the evaluaton of any Monte Carlo method, unbasedness, when E[µ n ] = µ, and consstency, when lm n P( µ n µ ε) = 0. In the context of Monte Carlo samplers, unbasedness s generally consdered the stronger of the two condtons as t often mples the latter. For nstance, n the usual settng, where (X 1,..., X n ) are..d. dstrbuted, the basc MC estmator µ MC n s unbased and has MSE gven by 1 n V(X) = O(n 1 ), whch can be used (together wth Chebyshev s nequalty) to establsh consstency. 2.1 Varance Reducton Many of the more popular Monte Carlo procedures use a sequence of..d. samples and lead to unbased estmators. Therefore, technques for reducng the approxmaton error typcally nvolve reducng the varance of ndvdual samples and, as a result, many of the classc computaton-savng Monte Carlo technques are referred to as varance reducton methods. In ths secton we revew some of these technques as they are the foundaton for many of the more powerful approaches we wll later consder Importance Samplng In many cases approxmaton of the ntegral n Eq. (1.1) by drectly samplng from the target dstrbuton (π) s computatonally nfeasble. Ths occurs ether due to the fact that generatng samples s too expensve or because π s poorly matched to h and ndvdual samples are hghly varable. The latter case typcally results when π assgns low probablty to regons where the target functon h has large magntude. Importance samplng s one of the standard ways of addressng these ssues. The approach s straghtforward; one generates samples from a proposal dstrbuton π 0 that s nexpensve to smulate and, deally, assgns hgh lkelhood to the hgh magntude or mportant regons of h. The samplng bas ntroduced by ths change of measure s removed through the applcaton of an mportance weght. Specfcally, gven a sequence of n..d. random varables (X 1,..., X n ) dstrbuted accordng to π 0, the mportance samplng estmator s gven by ˆµ IS n := 1 n n w(x )h(x ), (2.1) =1 where w(x) = π(x) π 0 (x) defnes the mportance weghtng functon. 6

16 The above estmator s unbased and consstent provded that supp(π) supp(π 0 ) and V π0 (w(x)) <. Addtonally, the method wll result n a lower MSE whenever V π0 (w(x)h(x)) < V π (h(x)), ndeed for non-negatve h when the proposal densty s proportonal to the ntegrand, π 0 = π := hπ/ h(x)π(x)dx, we have V π0 (w(x)h(x)) = 0. Unfortunately, usng ths proposal s not possble n practce snce the necessary normalzng constant for π s equal to the orgnal unknown (µ). Though, ths dentty does gve the practtoner an ndcaton as to how to select an effectve proposal. In some cases, the target dstrbuton π can be evaluated only up to an unknown normalzng constant ζ, whch often must be approxmated n order to get an accurate estmate of µ. Also, n some settngs the ζ quantty s of ndependent nterest, for example the classc task of Bayesan model comparson (see Robert (2010)). An nterestng advantage of mportance samplng s that t can be used to approxmate both µ and ζ from the same set of samples. Specfcally, f we denote the un-normalzed densty as ˆπ := π ζ unbased estmator wth w(x) = ˆζ IS n := 1 n we can use the n w(x ), (2.2) =1 ˆπ(x) π 0 (x). Usng ths estmate we can then approxmate µ wth the weghted mportance samplng estmator ˆµ W IS n := 1 n n =1 w(x )h(x ) ˆζ IS n. (2.3) As a result of the dvson by the random quantty ˆζ, ths estmator cannot be sad to be unbased. However, the bas can be shown to decrease at a rate of O(n 1 ) (Powell and Swann, 1966). Also, even n cases where the normalzng constant s known, the weghted mportance samplng estmator wll often outperform the unbased verson. A prmary challenge wth deployng mportance samplng s constructng proposals that assgn suffcent densty to the tal regons of π. If, for example, the tals of the proposal converge to zero faster than the target dstrbuton, then the mportance weghts wll dverge and V(w(X)) wll not be fnte. Moreover, because these problematc regons are never sampled n practce, t s effectvely mpossble to dagnose the problem numercally; the unsuspectng practtoner wll observe consstent, steady, convergence to the same ncorrect estmate even between multple runs. To address ths dffculty, t s common to employ a so-called defensve samplng strategy where a heavy-taled dstrbuton, such as a multvarate Student densty, s added to π 0 wth a small mxng coeffcent (see Robert and Casella (2005)). However, whle ths approach generally ensures the varance wll be fnte, 7

17 n practce t often does lttle to ensure the varance wll not be prohbtvely hgh. Consequently, engneerng proposal denstes and numercally dagnosng the convergence of an IS estmate, especally n hgh dmensons, remans a serous practcal concern. It s worth notng also that ths problem s often exacerbated by attempts to adapt the proposal to the ntegrand usng prevously sampled ponts, as we explan n Secton Antthetc Varates One of the most straghtforward varance reducton technques s the method of antthetc varates. The approach works n the followng way: suppose we have two sets of correlated random varables, (X 1,..., X n ) and (Y 1,..., Y m ), such that E[X] = E[Y ] = µ, X X j, Y Y j, and X Y then the estmator ˆµ AV n := 1 n n =1 X + Y 2 s unbased and has varance gven by V (ˆµ AV n ) = 1 4n V(X + Y ) = 1 4n (V(X) + V(Y ) + Cov(X, Y )), where t s understood that X X and Y Y for any. Ths mples that the estmator ˆµ AV n wll offer a reducton over the vanlla Monte Carlo estmator (usng 2n samples) whenever X and Y are negatvely correlated. For example, suppose that X Unform(a, b) and that h s a monotoncally ncreasng or decreasng functon. If we defne the antthetc varate as Y = b + a X t s clear that E[h(Y )] = E[h(X)] = µ and snce h a monotonc functon, t s straghtforward to establsh Cov(h(X), h(y )) < 0. However, as wth mportance samplng, practcal scenaros where the practtoner has enough a pror knowledge to desgn varance-reducng antthetc transforms are not as common as one mght lke Stratfed Samplng Stratfed samplng s a varance reducton technque that s used frequently n classc statstcs settngs such as opnon pollng or populaton estmaton. In such cases, the sample populaton s often endowed wth a natural partton based on characterstcs such age, sex, ethncty, etc. The general dea of ths approach s to approxmate an average for each of the subsets separately, payng more attenton to the dffcult ones, and combne these estmates n proporton to ther respectve szes. Suppose our sample space X decomposes nto a partton (X 1,..., X k ) that permts sep- 8

18 arate Monte Carlo approxmatons of the ntegral on each doman; that s, X h(x)π(x)dx = k λ h(x)π (x) = X =1 k λ µ, where λ gves the probablty of each regon (strata) under π and π denotes the densty proportonal to π wthn ths regon. As wth π, t s requred that each π permt effcent pont-wse evaluaton. If we let ˆµ,n denote an unbased Monte Carlo estmator for strata µ at tme 1 n then the stratfed samplng estmator s gven by ˆµ SS n = k λ ˆµ,n. =1 Ths estmator s unbased and, assumng each estmator uses n ndependent samples (chosen determnstcally wth n = k =1 n ) wth varance σ 2, has a varance gven by ) V (ˆµ SS n = k =1. Assumng the varances are known a pror we can mnmze the λ 2 σ2 n error by selectng the sample szes for each strata so that they are each approxmated unformly well. In partcular, varance s mnmzed wth the parameterzaton n whch gves ( k ) 2 ( ) V ˆµ SS =1 λ σ n :=. n =1 λ σ If all the varances are equal the above estmator has the same varance as the vanlla Monte Carlo estmate however f the varances across the strata are unequal the estmator can consderably more effcent. One practcal challenge, however, s that an obvous parttonng (stratfcaton) of the sample space s not always apparent, or the practtoner may not have a good understandng of the varablty wthn each strata, whch would permt effectve sample allocatons. Though, n some cases t may be worthwhle to tune these paramaters automatcally whle samplng; n Secton 2.5 we revew adaptve stratfed samplng approaches whch do just that. 2.2 Adaptve Importance Samplng In ths secton we revew ways n whch an mportance samplng proposal can be confgured automatcally from sampled data, as opposed to beng fxed by the practtoner ahead of tme. In settngs where the target dstrbuton s unmodal and twce dfferentable, a straghtforward approach s to ft the proposal densty usng moment matchng around the global 1 Here tme refers to the cumulatve number of samples drawn for all strata. 9

19 maxmum of the target densty: x := arg max x log(π(x)). Specfcally, by defnng the proposal densty as a convenent parametrc form, such as a multvarate normal or Student densty, t can be ft to a sngle pont by settng the mean equal to x and covarance equal to the nverse of the Hessan matrx of log(π(x)) at x (alternatvely known as the observed Fsher nformaton matrx I). In the case where π s well represented by the chosen parametrc form ths method produces, very quckly, a near optmal mportance samplng dstrbuton. Though ths approach s obvously applcable to restrcted class of ntegraton problems, for some bg data Bayesan nference tasks, ths approach can be motvated by the Bayesan central lmt theorem. Ths theorem essentally states that the posteror dstrbuton converges to N (x, I) as the number of observatons ncreases (see Sec. 4, Ghosh et al. (2006)). Coupled wth data effcent optmzaton routnes, such as stochastc gradent descent, ths approach may come n useful n a number of contexts. For more complex dstrbutons, partcularly those wth multple local modes, extendng ths approach rases the queston of how much computatonal effort should be spent searchng out varous modes versus actual samplng. A common way to address ths tradeoff s to contnually re-optmze the proposal densty after drawng a new sample, as s done n the parametrc adaptve mportance samplng (PAIS) scheme of Oh and Berger (1993). In partcular, ths PAIS approach uses a mxture of k multvarate Student denstes wth fxed degree of freedom ν and parameters λ = {(c, µ, Σ )} k =1 for a proposal densty, that s, π λ (x) := k c t ν (x; µ, Σ ), =1 where c 0 and c = 1. The objectve s to determne the parameterzaton (λ) that mnmzes the varance of the mportance weghts V πλ (w(x)), or equvalently E πλ [ w(x) 2 ], on past data. Ths s done n ncremental fashon where at each tme-step t the algorthm draws a new sample, X t π λt, then solves the optmzaton λ t+1 = arg mn λ t j=0 ( ) π(xj ) 2 π λ (X j ) π λ (X j ) π λj (X j ). (2.4) The fnal approxmaton, then, s gven by (2.1) or (2.2) wth w(x) = π(x)/π λt (x). Ths objectve functon, however, can make t dffcult to formulate effcent optmzaton routnes (even n the case k = 1). The effcent mportance samplng (EIS) approach of Rchard and Zhang (2007) addresses ths concern through the use of an alternatve heurstc, specfcally, the pseudo-dvergence d(π, π λ ; α, λ) := (log(π(x)) α log(π λ (x))) 2 π(x)dx. 10

20 As wth the prevous approach, d can be approxmated wth an IS estmate usng prevous samples. Ths results n the optmzaton λ t+1 = arg mn λ mn α t j=0 (log(π(x j )) α log(π λ (X j ))) 2 π(x j) π λj (X j ). Ths objectve can be sgnfcantly easer to work wth for some parameterzatons of π λ. In partcular, f π λ s a k = 1 mxture of exponental famly dstrbutons, the optmzaton reduces to a least squares problem. In ths way, the EIS approach s smlar to the method of varatonal Bayes: a determnstc approxmaton approach where one attempts to mnmze the Kullback-Lebler dvergence between the target and a parametrc densty (see Bshop et al. (2006)). An mportant lmtaton nherent to PAIS approaches s that the target densty s rarely well approxmated by a sngle exponental famly dstrbuton. As a result, the use a mxture dstrbuton (settng k > 1) s common. However, n ths settng fttng the proposal s smlar to solvng the (NP-hard) k-means clusterng problem at each tme-step. Even f ths optmzaton could be solved effcently, t s not always obvous a pror what settng of k, or what underlyng parametrc dstrbutons, mght lead to a sutable approxmaton of the ntegrand Populaton Monte Carlo The populaton Monte Carlo (PMC) algorthm (Cappe et al., 2004) s a clever PAIS varant that partally addresses both the ssues of solvng a non-convex optmzaton and specfyng the number of modes (k) manually. These problems are sde-stepped through the use of kernel densty estmaton on the target dstrbuton, as opposed to a drectly optmzng a parametrc form. Specfcally, a fxed Markov transton kernel k δ (x, x ) (defned below) parameterzed by a bandwdth parameter δ s defned. Defnton 2.1 (Markov kernel). A Markov kernel k on the probablty trple (X, P, B) s a functon K : X B R havng the followng propertes.. for each fxed A B the functon x k(x, A) s Borel measurable.. for each fxed x X the functon A k(x, A) s a probablty measure. Addtonally as s common n the Monte Carlo lterature, for any x, x X we let k(x, x ) denotes the condtonal densty of the transton from x to x ; that s, for any A B and we have P(X A x) = A k(x, x )dx. 11

21 The PMC approach proceeds as follows: gven an ntal populaton of samples (X (0) 1,..., X n (0) ) d π 0 ( ), the PMC proposal densty at tme-step t + 1 s parameterzed as a mxture densty: π t+1 (x) := 1 Z ( β n =1 w t (X (t) )k δ (x, X (t) ) + (1 β)t ν (x; λ, Σ) ). (2.5) Where, w t (x) = π(x) π t(x), t ν s a defensve samplng Student dstrbuton wth fxed parameters (ν, λ, Σ) and mxng coeffcent β [0, 1], and Z s a known normalzng constant. After executng the PMC procedure s outlned n Algorthm 1 and collectng the set of samples {X (0:m 1) 1:n }, one may use the PMC estmator gven by ˆµ n,m P MC := 1 m 1 nm t=0 n =1 w t (X (t) )h(x (t) ). Algorthm 1 Populaton Monte Carlo (PMC) 1: for t {0,..., m 1} 2: for {1,..., n} 3: Sample X (t) π t ( ); 4: Compute w t (X (t) ) = π(x(t) ) ; π t(x (t) ) 5: end for 6: end for Despte the fact that the samples drawn at each successve teraton are correlated, the PMC estmator can be shown to be unbased through repeated applcatons of the tower rule. Crtcally, however, ths statstcal dependence prevents one from achevng consstency as m. Instead n order to provde any theoretcal guarantees one must rely on the fact that ths estmate converges at a O(n 1 ) rate. That s, the sze of the populaton whch must be held n memory must tend toward nfnty to acheve convergence, whch obvously presents practcal challenges. The PMC algorthm may also employ a resamplng procedure where at each tme-step t the populaton, (X (t) 0,..., X(t) n ), s resampled n proporton to the weghts, (w t (X (t) 0 ),..., w t (X n (t) (t) (t) )). Specfcally, gven a set of resampled ponts, denoted as ( X 0,..., X n ), the proposal s defned as π t+1 (x) := 1 Z ( β N k δ (x, =1 X (t) ) + (1 β)t ν (x; λ, Σ) ). (2.6) In ths lght, the PMC algorthm s very smlar to sequental Monte Carlo (SMC) approaches (see Doucet et al. (2001)), whch often beneft tremendously from resamplng. 12

22 However smlar, these methods dffer n one crtcal aspect: the mportance weght n a SMC mplementaton at tme t s gven as a product of prevous mportance weghts, w t (X (t) ) = t k=0 π(x (k) ) π k (X (k) ) );2 whle the weghts n the PMC sampler do not nvolve products. Consequently, the resamplng procedure offers far less practcal advantage n the PMC settng than the SMC settng. In fact, one can observe that the samplng from the proposal (2.5) s procedurally dentcal to samplng from (2.6), that s to say, the resamplng step s already mplct. The sequental nature of the PMC algorthm permts an addtonal form of adaptaton where the kernel bandwdth δ can be tuned alongsde the proposal. Ths s the approach taken by the d-kernel populaton Monte Carlo algorthm (Douc et al., 2007a). Ths adaptaton s acheved by favourng the kernels, out of a dscrete set, whch have hstorcally lead to hgher mportance weghts. Gven a set of kernels, (k 1,..., k l ), (havng dfferent bandwdth parameters, for example), we redefne the kernel densty functon used n (2.5) as a mxture of denstes k (t) (x, x ) := Where the mxture coeffcents (α (t) 1 l j=1 α (t) j α(t) k j (x, x ). set proportonal to the sum of all past samplng weghts, that s: α (t+1) j := α (t) j +,..., α(t) ) are ntalzed unformly and subsequently l n =1 w t (X (t) )I{k (t) = k j }. Where the ndcator I{k (t) = k j } evaluates to 1 f densty k j s used to generate sample X (t). Whle the d-kernel PMC method consders only a dscrete set of kernels, these deas can be combned wth the optmzaton routnes used n PAIS. Ths s done n order to adapt both the mxture coeffcents as well as the parameters for each kernel (bandwdth), as s done n the PMC-E algortm of Cappé et al. (2008). Both these adaptve approaches can be shown to asymptotcally converge to a proposal densty (as n ) wthn the parametrc class that mnmzes the KL-dvergence wth the target dstrbuton. It s mportant to note that when choosng the value for m the practtoner must be cautous of the fact that the accuracy of the kernel densty estmate wll begn to dmnsh as the number of tme steps ncreases, regardless of whether resamplng s used. Ths degradaton s dfferent, but not entrely unlke, the partcle degeneracy problem often observed n sequental Monte Carlo (SMC) settngs. Consequently, the authors recommend settng ths 2 We revew the SMC framework n Secton

23 parameter to a rather small value, somewhere between 5-10, although ths does somewhat lmt the adaptablty of the algorthm. In ether form, the PMC algorthm s an effectve and elegant way of fttng the proposal densty wthout requrng a non-convex optmzaton. That sad, n practce, PMC mproves on standard PAIS only to the extent that kernel densty estmaton s easer than the k- means clusterng. There are certanly domans where ths s the case, but n general both of these tasks are computatonally ntractable and, consequently, both methods suffer from the varous effects of local mnma Dscusson An nterestng aspect of the adaptve mportance samplng settng s the fact that jontly optmzng the proposal, and ntegratng the functon, mmedately gves rse to an exploratonexplotaton tradeoff. In partcular, f a gven algorthm converges too quckly on any mode of π, t wll no longer sample other parts of the doman. It wll therefore not be able to correct for the mproper ft. Ths, coupled wth the fact that very-hgh varance IS samplers are dffcult to detect numercally, can result n hghly unstable behavour. The standard approach to address ths s to mx the proposal wth a defensve samplng dstrbuton, though ths heavy-handed approach ultmately results n unnecessary exploraton and prevents asymptotc convergence to the optmal proposal. Developng methods for addressng ths tradeoff remans an actve area of research and the methods presented n Chapter 4 can be seen, n part, as early steps n ths area. Perhaps a more fundamental observaton to be made here s that adaptve mportance samplng strateges can be hghly advantageous for smple ntegraton tasks (.e. small number of modes/dmensons) but often fal dramatcally when presented wth more complex problems. The unfortunate realty s that the core deas of such global fttng approaches are somewhat at odds wth the ntal motvatons for Monte Carlo ntegraton. That s, Monte Carlo methods are often deployed as a last resort n complex domans where alternatve approaches do not perform well. It s no concdence that these problems are, overwhelmngly, those where the ntergrand s not easly approxmated wth convenent parametrc, or sem-parametrc, forms. The man alternatve to these global fttng approaches are what we refer to as local move methods whch explot local structure through the use of smulated Markov chans. At a hgh level ths strategy forms a bass for both the popular Markov chan Monte Carlo approach and the sequental Monte Carlo approach, whch we revew n the followng sectons. 14

24 2.3 Markov Chan Monte Carlo The Markov chan Monte Carlo (MCMC) approach s possbly the most recognzed and wdely used Monte Carlo approach. As a result of ts fundamentally dfferent samplng mechancs, the method has ts own rch body of lterature whch ncludes a number of unque strateges havng a wde varety of applcatons. Most of ths work falls outsde our scope and n ths secton we wll only touch on some key aspects of MCMC as they relate to the methods we examne n ths thess. For a broader survey on the MCMC approach we refer to the semnal revew gven by Neal (1993) as well as the textbook treatments of Lu (2001) and Robert and Casella (2005). Addtonally, we note that some of the more cuttng edge developments n MCMC nclude methods for explotng gradent nformaton on the target densty, such as Langevn-adjusted and Hamltonan MCMC as summarzed by Neal (2011). Addtonally, notable extensons to ths work nclude methods for explotng Remann geometry (Grolam and Calderhead, 2011) and mn-batch methods for bg data applcatons (Ahn et al., 2012; Bardenet et al., 2014). Addtonally, adaptve MCMC methods, where the parameters of the smulaton are tuned as a functon of past samples, are now farly well understood and have lead to a number of practcal advancements (Andreu et al., 2006; Roberts and Rosenthal, 2009). In general, MCMC methods can be used to generate samples from any dstrbuton, even f known only up to a normalzng constant, through the smulaton of a Markov chan havng a statonary dstrbuton equal to ths target. Much of the popularty of ths approach can be attrbuted to the ease at whch such a Markov chan may be constructed. In partcular, one must ensure that the chan satsfes condton of detaled balance: T (x, x )π(x) = T (x, x)π(x ), were T s a Markov kernel that defnes the smulated chan. Addtonally, t must be ensured that the chan s ergodc, whch mples that the chan wll always result n the same statonary dstrbuton regardless of startng pont; n a way, ths condton s analogous to the support condtons n mportance samplng. If these two condtons are met, then samples generated from ths smulaton are guaranteed to be dstrbuted (asymptotcally) accordng to π. The most basc MCMC constructon s the Metropols-Hastngs (MH) sampler whch uses the transton kernel T (x, x ) = k(x, x )α(x, x ) where k(x, x ) s a fxed Markov kernel (the proposal) and α(x, x ) s an acceptance functon whch gves the probablty of 15

25 acceptng or rejectng a gven move so as to satsfy detaled balance, that s ( α(x, x ) := mn 1, π(x )k(x ), x) π(x)k(x, x. ) The algorthm proceeds as follows, gven the current pont X t, we propose pont X t k(x t, ), and sample Y t Unform(0, 1) we then let X t+1 = I{Y t < α(x t, X t)}x t + I{Y t α(x t, X t)}x t. Note that the proposal k s unlke an mportance samplng proposal n that t does not have full support over π and nstead, as wth PMC, has a common startng pont that s a Gaussan kernel of specfed wdth. Approxmatons of (1.1) may then be computed by smulatng the above Markov chan, startng from an arbtrary start pont X 0, for some specfed number of tme-steps, n, and evaluatng the emprcal sum ˆµ n 1 n MCMC := h(x ). (2.7) n n B =n B Here n B s the number of samples requred to surpass the burn-n perod and should be large enough to ensure the X nb s ndependent of X 0. As one can magne, n settngs where the target densty s concentrated n a small regon of the sample space, or along a lower-dmensonal manfold, the MH algorthm wll conduct a random walk over the relevant areas whle largely gnorng the rrelevant parts. In hgh dmensonal problems, ths ntegrand structure s common and, as a result, the performance of MCMC s smply unrvalled for such tasks. It s possble to derve convergence guarantees for MCMC approaches generally through condtons on the mxng rate of the Markov chan. That s, the rate at whch the dependency between samples X and X +d drops off as d ncreases. For nstance, f two samples were guaranteed to be ndependent after some fnte number of steps (d) t s clear that the MSE of (2.7) decreases at a rate of O(d/n). In practce, however, t can be challengng to engneer rapdly mxng Markov chans. Partcularly n cases where the target densty s comprsed of multple modes separated by regons of low probablty. Also, poor mxng s often dffcult to dagnose numercally as there s no good way of knowng whether all of the modes have been vsted, and n the correct proportons. Poor mxng rates may be one reason to favour an mportance samplng approach over MCMC. Addtonally, mportance samplng approaches are often favoured for approxmatng the normalzaton constant for ˆπ, as ths task can be qute challengng wth MCMC approaches. For nstance, the most straghtforward strategy, whch s lkely the frst thng any 16

26 unsuspectng practtoner mght try, s the so-called harmonc mean (HM) method (Newton and Raftery, 1994). Here, one frst collects samples X π usng any MCMC approach, then uses these samples n conjuncton wth the weghted mportance samplng estmator to approxmate the normalzng constant. Specfcally, one deploys the estmator: 1 ˆζ HM n := 1 n. n n B ˆπ(X ) =n B 1 Although ntutve and mathematcally elegant, n practce ths estmate s rarely useful as t wll typcally exhbt extremely hgh or nfnte varance. There are numerous other ways of computng normalzng constants wth MCMC (see Neal (2005)) though none can be sad to be as straghtforward or generally applcable as ther counterparts for approxmatng µ. 2.4 Sequental Monte Carlo Another sophstcated and wdely applcable class of Monte Carlo algorthms are sequental Monte Carlo (SMC) methods. In the SMC settng the dstrbuton of nterest, π, s assumed to be decomposable nto a sequence of known condtonal denstes, that s π(x (0:m) ) = π 0 (x (0) )π 1 (x (1) x (0) )... π n (x (m) x (m 1) ). The ntegral n (1.1) can then be wrtten as m µ = h(x (0:m) )π( (0:m) )dx (0:m) = h(x (0:m) )π 0 (x (0) ) π (x () x ( 1) )dx (0:m). Addtonally, t s often the case that h may factored smlarly to the above equaton or s ndependent of most varables. For nstance, t s common that h(x (0:m) ) = h(x (m) ). The SMC formulaton arses naturally n numerous practcal settngs from proten formaton, robot dynamcs, to fnancal opton prcng (see Doucet et al. (2001)). Smlar to the general ntegraton problem, t s often the case that drect smulaton from π does not result n effcent estmators. In such cases the sequental mportance samplng (SIS) approach may be used where one uses a smlarly factored proposal densty: g(x (0:m) ) := g 0 (x (0) )g 1 (x (1) x (0) )... g n (x (m) x (m 1) ). Gven samples (X (1:m) 1,..., X n (1:m) ) d g( ) we arrve at the followng estmator where, ˆµ SIS n := 1 n n =1 =1 w(x (0:m) )h(x (0:m) ), ( ) ( ) ( ) w(x (0:m) π 0 (x (0) ) π 1 (x (1) x (0) ) π m (x (m) x (m 1) ) ) = g 0 (x (0) ) g 1 (x (1) x (0)... ) g m (x (m) x (m 1). ) 17

27 Ths approach s often desrable ether because condtonal proposal denstes are easer to engneer or because populaton-based resamplng approaches can be used, as s done by the sequental mportance resamplng (SIR) algorthm of Gordon et al. (1993). The nsght nto these populatons-based technques s as can be explaned as follows. Suppose we have samples (X (j 1) 1,..., X n (j 1) ) d π j 1 ( X 0:j 2 ) as well as samples {X (j) g j ( X (j 1) )} n =1. Gven ths, one can approxmate the dstrbuton π j wth the emprcal dstrbuton π j (x x (0:j 1) ) n =1 w j (X (j) )δ (j) X (x), where w j (X (j 1) ) := w j(x (j 1) ) wth w n j(x (j 1) k=1 w j(x (j 1) k ) ) = π j(x (j) X (j 1) ) and δ g j (X (j) X (j 1) x denotes ) the Drac δ-functon centered at x. Ths emprcal dstrbuton may used as-s to smulate the next step of the recurson. However, over tme, the varance of the mportance weghts wll tend to ncrease whch results n a poor approxmaton. Ths problem s referred to as partcle degeneracy and s routnely measured by the effectve sample sze (ESS) (Lu and Chen, 1998) gven by ESS(X 1:n ) := ( n =1 w(x )) 2 n =1 w(x ) 2. (2.8) The ESS takes on values n [1, n] and can be used to montor the health of the populaton. If the value drops below some fxed threshold, ε (0, n), one may attempt mprove the approxmaton by resamplng a new set of partcles from the emprcal dstrbuton. The most straghtforward resamplng approach s to draw partcles wth replacement from a multnomal dstrbuton, though there are a number of more effcent resamplng procedures such as resdual (Lu and Chen, 1998), systematc (Carpenter et al., 1999), and stratfed resamplng (Ktagawa, 1996). A basc SIR mplementaton, whch uses smple multnomal resamplng at every step, s gven n Algorthm 2. Followng normal executon of Algorthm 2 the SIR estmator for µ s gven as ˆµ SIR n := 1 n n =1 h(x (m) )w m (X (m) ), as well as the normalzng constant (see Del Moral and Doucet (2002)), ( m n ) ˆζ n SIR := log w j (X (j) ) m log n. j=0 =1 Equvalent estmators for the case where resamplng stages are executed dependng on the effectve sample sze are gven n Del Moral et al. (2006). 18

28 Algorthm 2 Sequental Importance Resamplng (SIR) 1: for {1,..., n} 2: Sample X (0) g 0 ( ); 3: Compute w (0) ) = π 0(X (0) ; g 0 (X (0) ) 4: end for 5: for j {1,..., m} (j 1) 6: Resample X 1:n Multnomal(X (j 1) 1:n, w(x (j 1) 1:n ), n); 7: for {1,..., n} 8: Sample X (j) (j 1) g j ( X ); 9: Compute w j (X (j) ) = π j(x (j) (j 1) X ) 10: end for 11: end for g j (X (j) (j 1) X ; ) In general, the use of resamplng n SMC methods can lead to consderably mproved estmates for some problem domans. In partcular, sequental state trackng tasks. However, the approach s not guaranteed to offer a reducton n varance and may actually ntroduce addtonally approxmaton errors due to resamplng varance and partcle degeneracy. Addtonally, as wth the populaton Monte Carlo approach theoretcal convergence results, typcally n the form of a central lmt theorem, requre that the populaton sze grow ndefntely Sequental Monte Carlo Samplers Sequental Monte Carlo samplers (SMCS) (Del Moral et al., 2006) are a class of mportance samplng algorthms that defne ther proposal dstrbutons as a sequence of local Markov transtons. As a result of ths constructon, SMCS methods are able to combne many of the advantages of MCMC methods wth those of SMC methods. SMCS algorthms operate under the same assumptons as regular mportance samplng n that they do not requre that the target dstrbuton factors nto a product of condtonals. Instead, the algorthm explots the fact that a sequence of Markov transtons may be factored n much the same way. Specfcally, the SMCS proposal densty s defned usng an ntal proposal, π 0, as well as a sequence of forward Markov kernels, f 1:m, whch are assumed to be evaluable pontwse and smulable. In partcular, we defne the proposal densty as g(x (0:m) ) := π 0 (x (0) ) m f j (x (j 1), x (j) ). (2.9) In order to permt convenent cancellatons n later steps, the target dstrbuton, π, s also augmented wth a sequence of backward Markov transton kernels, b 1:m, whch are re- j=1 19

29 qured only to be effcently evaluable pont-wse, that s, we defne π(x (0:m) ) := π(x (m) ) m b j (x (j), x (j 1) ). (2.10) Snce each b j (x, ) s a probablty dstrbuton we have b j (x, x )dx = 1 t then follows that π(x (0:m) )dx (0:m 1) = π(x (m) ), whch mples that h(x (m) ) π(x (0:m) )dx (0:m) = µ. From here, we can approxmate ths expanded ntegral through mportance samplng j=1 methods, whch can be verfed by observng h(x (m) ) π(x (0:m) ) g(x (0:m) g(x (0:m) )dx (0:m) = ) = h(x (m) ) π(x (0:m) )dx (0:m) h(x (m) )π(x (m) )dx (m). A crtcal detal s that π and g are easly factored, whch wll later permt the applcaton of SMC methods (.e. resamplng). In order to buld n addtonal flexblty, the formulaton allows for the use of local moves n conjuncton wth annealng dstrbutons. Specfcally, a sequence of unnormalzed auxlary dstrbutons, π 0:m, whch are generally expected to blend smoothly between a the ntal proposal dstrbuton (π 0 ) and the target dstrbuton (π m := π) are used. A common choce s the tempered verson π j = π (1 β j) π β j 0 for some fxed annealng schedule 1 = β 0 > β 1 >... > β m = 0 (Neal, 2001; Gelman and Meng, 1997). Addtonally, n Bayesan settngs the sequence of posteror dstrbutons each wth ncrementally more data may be useful, that s, π (x) = π(x z 1,..., z (1 β )l)π 0 (x) where (z 1,...z l ) denotes the set of observed data (Chopn, 2002). Of course, the use of annealed dstrbutons s not strctly requred and one may elect to use the homogenous parameterzaton, where π 1 =... = π n = π. Usng these auxlary dstrbutons the target dstrbuton s then re-wrtten as ˆπ(x (0:m) ) := m j=1 π j (x (j) )b j (x (j), x (j 1) ) π j 1 (x (j 1), ) where one can observe that these auxlary dstrbutons telescope to gve ˆπ(x (0:m) ) = π(x (0:m) ). Temporarly gnorng these potental cancellatons, the stepwse mportance weghtng functon can be defned π 0 (x (0) ) as r j (x (j 1), x (j) ) := π j(x (j) )b j (x (j), x (j 1) ) π j 1 (x (j 1) )f j (x (j 1), x (j) ). (2.11) Usng these formula the basc SMC procedure for smulatng from g and recursvely computng the approprate weghtng s gven n Algorthm 3. As wth standard SMC, t s straghtforward to add n a resamplng step to ths procedure. 20

30 Algorthm 3 Sequental Monte Carlo Samplng (SMCS) 1: for {1,..., n} 2: Sample X (0) π 0 ( ); 3: Compute W (0) ) 4: end for 5: for j {1,..., m} 6: for {1,..., n} 7: Sample X (j) 8: Compute W (j) 9: end for 10: end for = π 1(X (0) ; π 0 (X (0) ) f j (X (j 1), ); = W (j 1) r j (X (j 1), X (j) ) After generatng samples {X (0:m) 1:n } and correspondng weghts {W (0:m) 1:n } wth ths procedure the (unbased) SMCS estmator s gven by ˆµ SMCS n := 1 n n =1 W (m) h(x (m) ), addtonally, as wth standard mportance samplng, the normalzaton constant for π may be estmated as ζ SMCS n := 1 n n =1 W (m). In the case where homogenous auxlary dstrbutons are used the samples at each tme-step may be used (Del Moral and Doucet, 2002), resultng the followng (unbased) estmator wth ζ SMCS n ˆµ SMCS n := defned smlarly. 1 n(m + 1) n m =1 j=0 W (j) h(x (j) ), The most wdely deployed nstantaton of SMCS s the pre-dated method of annealed mportance samplng (Neal, 2001), whch can recovered by parameterzng the backward kernel as b j (x, x ) = π j(x ) π j (x) f j(x, x), where f j s any vald MCMC transton for π j ; that s, f j (x, x )π j (x) = f j (x, x)π j (x ). Ths choce leads to cancellatons n the mportance weghtng functon, yeldng the smpler weghtng functon r j (x, x ) = r j (x) = π j(x) π j 1 (x).3 Ths partcular parameterzaton s powerful because t opens the door for the applcaton of vast number of exstng MCMC strateges. Moreover, AIS s able to sdestep the 3 Multple MCMC transtons are typcally executed on the same annealng step, as the composton stll satsfes detaled balance. 21

31 drawbacks of MCMC assocated wth mxng, snce we have not requred that the Markov chan smulate for some burn n perod or approach statonarty to ensure unbasedness. Unfortunately however, the MCMC chan must stll produce samples roughly dstrbuted accordng to the statonary dstrbuton n order for the approach to offer a meanngful varance reducton (see Neal (2001)). It s worth notng that n cases where π does not permt effcent mxng AIS may be advantageous snce the annealed dstrbutons π may be sgnfcantly easer to smulate wth MCMC. In fact, ths annealng s the key ntuton behnd the wdely deployed MCMC method of parallel temperng (Neal, 1996; Earl and Deem, 2005) Adaptve SMCS SMCS algorthms can be consderably more effectve than smple mportance samplng methods snce the proposal s altered automatcally by the observed values of the ntegrand. Despte ths dependence, SMCS methods are generally not consdered adaptve methods snce the parameters of the algorthm.e. the MCMC moves, annealng rate, ntal proposal reman fxed. However, as wth prevous approaches there have been a handful of adaptve SMCS extensons proposed n the lterature. That s, procedures for automatcally tunng the parameters of a SMCS algorthm usng data obtaned n past smulatons (Chopn, 2002; Schäfer and Chopn, 2013; Jasra et al., 2011). As t currently stands, there s no general approach for ncorporatng such adaptve behavour nto SMCS; the few exstng methods have lttle n common and each rely on specfc propertes of the underlyng SMCS method. That sad, provded that the adaptaton occurs under mld techncal condtons t s at least straghtforward to obtan asymptotc convergence results, n the form of consstency proofs for the resultng estmators, for any adaptve SMCS approach (Beskos et al., 2013). We note also that though all of the exstng approaches make use of partcle representatons and resamplng as the prmary engne drvng ths adaptaton, ths s not strctly necessary. One popular adaptve SMCS approach s gven by Schäfer and Chopn (2013) who proposed a method for sequentally tunng the proposal for a Metropols-Hastngs MCMC sampler used by SMCS. In partcular, t was observed that the deal MH proposal at tme t s gven by the dstrbuton π t. Unfortunately, ths dstrbuton cannot be sampled from effcently, otherwse we would not be usng MCMC n the frst place. In order to bypass ths dffculty, the proposed approach fts a parametrc dstrbuton to the current partcle representaton of π t through straghtforward optmzaton routnes. Ths ft proposal s then used 22

32 n the MCMC transtons for the next update to the partcle representaton, and the process s repeated. Fttng a parametrc densty to a set of partcles s a non-convex optmzaton n general, so some form of kernel densty estmaton may be advantageous. Perhaps the most effectve, and wdely applcable, adaptve SMCS strategy s to tune the set of auxlary dstrbutons usng the partcle representaton. The rate at whch these dstrbutons converge to the target densty has a sgnfcant effect on the performance of the SMC sampler. If the dstrbutons converge too slowly the sampler wll acheve less varance reducton per unt of computaton. On the other hand, f the dstrbutons converge too quckly, poor MCMC mxng or hgh varance mportance weghts (whch leads to partcle degeneracy) often result. Wth these consderatons n mnd, one approach, gven by Schäfer and Chopn (2013), s to tune the annealng parameters at each step so that a fxed effectve sample sze (Eq. (2.8)) s mantaned, for example ESS = 0.8n. In partcular, supposng that tempered dstrbutons are used wth β j defned as β j+1 = β j + α for some fxed α (0, 1), we have that w j (x, x ) = r j (x) = π j(x) π j 1 (x) = π(x)α π 0 (x). The α ESS then can be expressed as a functon of α and the populaton X (j) 1:n and we can solve for α s.t. ESS(α, X (j) 1:n ) = ESS. As wth the adaptve approach above, ths procedure can be shown to offer mproved emprcal performance over fxed schedules whle stll provdng asymptotcly consstent estmators (Beskos et al., 2013). 2.5 Adaptve Stratfed Samplng As mentoned prevously, the method of stratfed samplng can yeld substantal reductons n varance f the stratfcaton and sample allocatons are well tuned. However, n order for the allocaton of samples to be effcent the standard devaton (σ ) of ndvdual samples wthn a gven strata must be known beforehand, whch s rarely the case. A natural strategy then, s to approxmate these standard devatons from prevous samples and adjust future allocatons accordngly. One such adaptve stratfed samplng strategy s proposed by Etoré and Jourdan (2010) and operates smlar to the sequental optmzaton procedures for adaptve mportance samplng detaled prevously. In partcular, ths approach proceeds n a seres of stages where the number of samples drawn n each stage (l) s specfed beforehand. The frst step n each stage s to approxmate the sample standard devatons for each strata as (reusng the 23

33 notaton from Secton 2.1.3): ˆσ,n := 1 n X,j 2 n ˆµ2,n, j=1 where {1,..., k} denotes the strata ndex, n the number of samples allocated to strata thus far, n = n, X,j π ( ), and ˆµ j,n := 1 n n j=1 X,j. The next step s the exploraton phase where a sngle draw s allocated to each strata, whch ensures the asymptotc convergence of ˆσ,n to σ. The remanng l k samples are then allocated n an explotve manner. That s, n proporton to the optmal value assumng the ˆσ,n approxmatons are accurate (n λ ˆσ,n ). Etoré and Jourdan provde asymptotc analyss for ths samplng scheme and show that each strata s sampled n the optmal proporton so long as the exploraton phase ensures each stata s sampled nfntely often and yet takes up a neglgble percentage of the samples n relaton to the explotaton phase. More recently, there have been a number of developments n adaptve stratfed samplng n the feld of machne learnng whch have borrowed technques for balancng exploraton and explotaton from the mult-armed bandt lterature (Grover, 2009; Carpenter and Munos, 2011; Carpenter, 2012). Algorthmcally, these approaches are not entrely unlke the approach of Etoré and Jourdan, though they balance ths tradeoff more carefully. As a result, these approaches can be shown to acheve mproved emprcal performance as well as permt more robust fnte-tme guarantees. The problem s formalzed as follows: consder a sequental allocaton algorthm A to be a procedure for selectng, at each tme-step t, whch strata should receve the next sample. Denote ths choce as I t {1,..., k}. One can wrte the loss (MSE) of the stratfed samplng estmator for A as: L n (A) := E [ ( k ) 2 (ˆµ n µ) 2] = E λ (ˆµ,n µ). Lettng the random varable T,n =1 := n t=1 I{I t = } denote the number of samples for strata up to tme n one observes that the estmate ˆµ,n := 1 T,n T,n t=1 X,t s not necessarly unbased snce T,n s allowed to depend on X 1:n 1. Ths dependence s the key frustraton n analyzng these algorthms ndeed most bandt algorthms and motvates the use of the smplfed weghted MSE loss L n (A) defned as the frst term n the expanded loss: [ k ] k L n (A) = E λ 2 (ˆµ,n µ) 2 +E 2 λ λ j (ˆµ,n µ)(ˆµ j,n µ). =1 } {{ } L n(a) 24 j

34 Addtonally, snce we are often nterested n the performance of our allocaton algorthm wth respect to loss under the optmal allocaton L n (A ) = ( k =1 λ σ ) 2 (Secton 2.1.3), we defne the regret for algorthm A as n R n (A) = L n (A) L n (A ) also Rn (A) = L n (A) L n (A ). The frst algorthm n ths space was the GAFS-WL algorthm proposed by Grover (2009), whch allocated samples n the optmal proporton λ ˆσ whle managng exploraton by ensurng each strata sampled at least t tmes. Ths smple algorthm was shown to have regret bounded as R n (A) = Õ(n 3/2 ) for any n, where the notaton Õ hdes logarthmc factors. An alternatve approach to managng exploraton and explotaton n bandt problems s to construct confdence bounds around the approxmaton ˆσ and allocate samples accordng the worst-case (hghest varance) value. Usng ths strategy one arrves at the MC-UCB algorthm of Carpenter and Munos (2011), whch can also be shown have a regret bounded as R n (A) = Õ(n 3/2 ) though wth slghtly dfferent constant factors than GAFS-WL. Ths result was later mproved upon and t can be shown that the true regret s bounded as R n (A) = poly(βmn 1 )Õ(n 3/2 ), where βmn 1 s a problem-dependent constant. Addtonally, for the problem-ndependent case (mnmax) the regret can be bounded as R n (A) = Õ(n 4/3 ) whch has a matchng lower bound (up to log factors), that s, R n (A) = Ω(n 4/3 ) for any algorthm (Carpenter et al., 2014). The strong fnte-tme theoretcal guarantees pared wth these algorthms s n many ways the man contrbuton of ths work snce ths form of analyss s rarely seen n the area of adaptve Monte Carlo. 2.6 Summary In ths chapter we have revewed a number of the more powerful Monte Carlo approaches and varance reducton technques as well as adaptve extensons to many of these approaches. In the followng chapters we present novel approaches extendng many of the methods presented here. 25

35 Chapter 3 Varance Reducton va Antthetc Markov Chans There are no routne statstcal questons, only questonable statstcal routnes. D.R. Cox In ths chapter we ntroduce a novel approach, dubbed antthetc Markov chan samplng (AMCS) (Neufeld et al., 2015), whch mproves on a fxed proposal densty through the addton of local Markov chan smulatons. In ths respect, the approach s smlar to the sequental Monte Carlo samplng approach of Del Moral et al. (2006) and, at a hgh level, both approaches are essentally explotng the same observaton: ntegrands are often relatvely smooth and have the majorty of ther mass concentrated nto local regons or modes. The approaches dffer prmarly n how they parameterze the Markov chans whch, as we wll see, ultmately affects the types of ntegrands for whch ether method s most sutable. Specfcally, from each sampled pont the AMCS sampler smulates two separate, shortlved, Markov chans that are desgned to head n opposte drectons. The objectve of these chans s to quckly acqure a set of ponts for whch the value of the ntegrand takes on a large range of values. Ultmately, by averagng over these ponts much of the local varablty n the ntegrand can be removed, lowerng the varance of the fnal estmate. Ths technque allows for substantal savngs over comparable Monte Carlo methods on hghly peaked and multmodal ntegrands. An example of such an ntegrand the log lkelhood functon for a Bayesan robot localzaton task whch we consder n Secton 3.5 n shown n Fg Ths lkelhood functon has thousands of local modes separated by regons of low probablty whch makes t challengng to engneer effcent proposal denstes as well as MCMC chans that can explore ths space effcently. However, there s stll a great deal of local structure, n the form of local modes, whch can be exploted by an AMCS sampler. 26

36 Fgure 3.1: Log-lkelhood functon of poston gven sensor readngs n a Bayesan robot localzaton problem, 2 of 3 dmensons shown. In the remander of the chapter we outlne the mechancs of the AMCS approach and provde some specfc parameterzatons and motvatons for the Markov chans used by the method. Addtonally, we establsh the unbasedness of ths approach and analyze the potental for varance reducton. Lastly, the effectveness of the approach, n contrast to smlar algorthms, s evaluated emprcally on non-trval machne learnng tasks. 3.1 Approach In an nutshell, a sngle teraton of the AMCS algorthm proceeds as follows: the algorthm draws a sngle sample from the proposal π 0, smulates two ndependent Markov chans to produce a set of ponts, evaluates the target functon on each, then returns the resultng average. After executng n (ndependent) runs of ths procedure the emprcal average of all these returns s then used as the fnal estmate. In what follows we refer to the two Markov chans as the postve and negatve chan and denote the Markov transton kernels (Defnton 2.1) used by each as k + and k respectvely. Addtonally, these chans termnate accordng to correspondng probablstc stoppng rules referred to as (postve and negatve) acceptance functons, α + and α, whch specfy the probablty of acceptng a move n the respectve drectons. The kernels must be effcently evaluable, smulable, and must also satsfy a jont symmetry property together wth the acceptance functons as descrbed n Defnton 3.1. Addtonally, for smplcty throughout ths chapter we assume that all functons (α +/, h) and denstes are measurable as needed. Defnton 3.1. The Markov kernels and acceptance functons specfed by (k +,α + ) and 27

37 (k, α ) are sad to be jontly symmetrc ff for any x, x R d the followng holds k + (x, x )α + (x, x ) = k (x, x)α (x, x). Usng these components the AMCS procedure can be descrbed as follows. For each of the n ndependent runs of ths algorthm, a startng pont X (0) π 0 ( ) s frst sampled from the gven proposal densty. Outward from ths startng pont both a postve and negatve Markov chan are then smulated. For the postve chan, ponts are generated as X (j) k + (X (j 1), ) where j 1, and ths chan s termnated at step j wth probablty α + (X (j 1), X (j) ). The stoppng tme for ths chan can be wrtten usng the acceptance varable A (j) {0, 1} where A (j) α + (X (j 1), X (j) ) 1 whch gves M := 1 + j j 1 l=1 A(l). The negatve chan s denoted usng ndces j 1, where smlarly defne X (j) k (X (j+1), ), A (j) α (X (j+1), X (j) ), and stoppng tme N := 1 + j j 1 l= 1 A(l). The full algorthm s outlned n Algorthm 4 and the graphcal model representng the dependences between these varables s shown n Fg. 3.2 below. Fgure 3.2: Graphcal model outlnng the dependences between the sampled varables, here the postve chan (X (1),..., X (M) ) s shown on the rght of X (0) whle the negatve chan (X ( 1),..., X ( N) ) s shown on the left. Any varables correspondng to ndces greater than M or less than N are not sampled by the algorthm. After n ndependent trajectores have been collected, {( X ( N ),..., X (M )} ) 1 n µ and ζ may be approxmated wth the estmators ˆµ n AMCS := 1 n ˆζ n AMCS := 1 n n =1 n =1 M 1 j=1 N h(x (j) )π(x (j) ) (M + N 1)π 0 (X (0) ), (3.1) M 1 j=1 N ˆπ(X (j) ) (M + N 1)π 0 (X (0) ). (3.2) Note that the two endponts X (M) and X ( N) are not used n these estmators, we refer to all the other ponts n a trajectory (X (1 N),..., X (M 1) ) as the accepted ponts. For ( ) 1 Here A (j) α + (X (j 1), X (j) ) mples P A (j) = 1 = α + (X (j 1), X (j) ). 28

38 Algorthm 4 AMCS Procedure 1: for {1,..., n} 2: Sample X (0) π 0 ( ); 3: for j = 1, 2,... 4: Sample X (j) k + (X (j 1), ); 5: Sample A (j) α + (X (j 1), X (j) ) 6: If (A (j) = 0) break loop and set M = j; 7: end for 8: for j = 1, 2,... 9: Sample X (j) 10: Sample A (j) 11: If (A (j) k (X (j+1), ); α (X (j+1), X (j) ) = 0) break loop and set N = j; 12: end for 13: end for 14: return estmate from Eq. (3.1) or Eq. (3.2). nstance, n the case that the frst move n both drectons s rejected there s only one accepted pont: X (0). Ths generc formulaton s not partcularly useful wthout some clues how to choose k +/ and α +/. Before provdng explct parameterzaton we frst establsh the unbasedness of these estmators and analyze what nfluence k +/ and α +/ may have on the varance of the AMCS estmators. 3.2 Unbasedness In order to analyze the AMCS estmators gven by Eq. (3.1) and Eq. (3.2) we wll ntroduce an addtonal random ndex J Unform({1 N,..., M 1}) whch wll be used to select a sngle pont out of the sampled trajectory unformly at random. Specfcally, we defne X (J) := M 1 j=1 N I{J = j}x(j). Ths random ndex s useful n establshng the unbasedness of the AMCS estmator, n partcular we wll show that E [ h(x (J) )π(x (J) ) ] = E[ˆµ AMCS ] and that E [ h(x (J) )π(x (J) ) ] = µ. We begn by ntroducng some lemmas whch wll ad n provng the latter expresson, t what follows we wll make use of an Assumpton 3.1 below. In practce ths assumpton can be acheved n a number ways, for example by forcng α +/ to termnate the chan on any step wth a fxed probablty. Assumpton 3.1. The Markov chans generated by k +/ and α +/ are assumed to termnate eventually;.e. M < and N < almost surely. In order to prove that E [ h(x (J) )π(x (J) ) ] = µ we wll show that the procedure for samplng X (J) s dentcal to samplng from a symmetrc Markov kernel k, that s a kernel 29

39 where k(x, x ) = k(x, x), and that any such procedure results n a unbased estmator, as descrbed n the followng lemma. Lemma 3.1. Suppose Y π( ), X π 0 ( ) and X k(x, ) for symmetrc Markov kernel k, that s k(x, x ) = k(x, x). Provded that supp(π) supp(k π 0 ) t follows that [ h(x )π(x ] ) E = E[h(Y )] = µ. π 0 (X) (The statement follows from Fubn s theorem; full proof s gven n Appendx A.1.) In order to show that X (J) s ndeed sampled accordng to a symmetrc Markov kernel we express the condtonal densty of (X ( N),..., X (M) ) gven X (0) n terms of k +/ and α +/, that s γ(x ( n),..., x (m), n, m x (0) ) := (1 α + (x (m 1), x (m) ))k + (x (m 1), x (m) ) α + (x (j 1), x (j) )k + (x (j 1), x (j) ) m 1 j=1 (1 α (x (1 n), x ( n) ))k (x (1 n), x (n) ) 1 n j= 1 α (x (j+1), x (j) )k (x (j+1), x (j) ). (3.3) Here we are able to ntegrate out the acceptance varables a (j) snce the condtonal densty s zero whenever we have nvald confguratons,.e. a ( n) 0, a (m) 0, or a () 1 (for 1 n m 1). We now observe that ths s a vald condtonal densty and the densty for a gven set of ponts evaluates to the same value regardless of whch of these ponts n chosen as the startng pont x (0), as formalzed n the followng lemma. Lemma 3.2. Gven jontly symmetrc (k +, α + ) and (k, α ), and sequences (x ( n),..., x (m) ) and (x ( n ),..., x (m ) ), the densty γ defned n Eq. (3.3) satsfes the followng.. γ(x ( n),..., x (m), n, m x (0) )dx ( n:m) 0 = 1 n,m 1. γ(x ( n),..., x (m), n, m x (0) ) = γ(x ( n ),..., x (m ), n, m x (0) ) whenever x ( n) = x ( n ), x (1 n) = x (1 n ),..., x (m) = x (m ), and m + n = m + n where m, n, m, n 1, and where x ( n:m) 0 denotes all varables x ( n:m) but x (0). (Here () follows from Assumpton 3.1 and () from the defnton of jont symmetry; the full proof s gven n Appendx A.2.) 30

40 One remanng detal s to express the condtonal densty for only a sngle pont (X (J) ) gven X (0). For ths we wll make use of the followng lemma: Lemma 3.3. Gven random varables (X 0,..., X L, L), where X R d, L N, and L 2, dstrbuted accordng to some jont densty γ and a random varable J Unform({1,..., L 1}), the varable X J = l j=0 I{J = j}x j has p.d.f. p(x) = l 2 1 l 1 l 1 j=1 γ j(x, l). Where γ j s the j th margnal densty of γ, gven by γ j (x j, l) := γ(x 1,..., x l, l)dx j 1:l. (proof gven n Appendx A.3.) Wth these components we are now able to state the frst man result for ths secton whch establshes the symmetry of the condtonal densty for X (J) X (0). However, before dong so we requre some addtonal notaton. Specfcally by x(n, m, x, j), for n, m 1, n j m, we denote the trajectory (x ( n),..., x ( 1), x (1), x (m) ) where x (j) = x. Usng Lemma 3.3, the condtonal densty γ(x x ) may be wrtten as γ(x x ) = l 2 1 l 1 l 1 m 1 m=1 j=m l+1 γ(x(l m, m, x, j), l m, m x (0) = x )dx (m l:m) {j,0}, (3.4) where γ(x(l m, m, x, 0), l m, m x (0) = x ) s defned as zero whenever x x. Thanks ultmately to the symmetrc propertes of γ(x(l m, m, x, j), l m, m x (0) = x ) ths condtonal densty also meets a crtcal symmetry property formalzed n the followng lemma. Lemma 3.4. Provded the densty functon γ defned n Eq. (3.3) satsfes the condtons n Lemma 3.2 the condtonal densty γ(x x) defned n Eq. (3.4) satsfes γ(x x) = γ(x x ). (The lemma follows from reorderng sums n γ and deployng Lemma 3.2; full proof s gven n Appendx A.4.) Usng these three lemmas we now formally establsh the unbasedness of AMCS n the followng theorem. Theorem 3.1. Provded the transton kernels and acceptance functons satsfy the condtons of Lemma 3.2 and Lemma 3.4 for any n > 0 the AMCS procedure acheves E[ˆµ n AMCS] = µ ] E [ˆζn AMCS = ζ 31

41 Proof. Because ˆµ n AMCS s an emprcal average of ndependent samples we need only show that each ndvdual sample has the desred expectaton. Recallng that X (J) s a sngle pont sampled unformly at random from a gven trajectory (X ( N),..., X (M) ) we frst observe that [ ] [ [ ]] h(x (J) )π(x (J) ) h(x (J) )π(x (J) ) E π 0 (X (0) = E E ) π 0 (X (0) (X ( N),..., X (M) ) ) [ M 1 ] j=1 N = E h(x(j) )π(x (j) ) (M + N 1)π 0 (X (0) ) It remans then only to show that E = E[ˆµ AMCS ]. [ h(x (J) )π(x (J) ) π 0 (X (0) ) ] = µ. By defnng our Markov kernel k(x, x ) = γ(x x) from Eq. (3.4), where X (J) k(x (0), ), t follows from Lemma 3.4 that k(x, x ) = k(x, x). As a result of ths symmetry we may nvoke Lemma 3.1 whch gves the desred result. Notng that the dentcal steps hold for the estmator ˆζ n AMCS as well, concludes the proof. 3.3 Varance Analyss Snce the AMCS estmator s unbased for any choce of jontly symmetrc k +/ and α +/ we now consder how these choces affect the MSE of the estmator. In the followng developments we make use of the unformly dstrbuted ndex J as defned n the prevous secton. In partcular we observe ( M 1 ) j=1 N v AMCS := V h(x(j) )π(x (j) ) (M + N 1)π 0 (X (0) ) ( [ ]) h(x (J) )π(x (J) ) = V E π 0 (X (0) ) X( N),..., X (M), where the nner expectaton s taken w.r.t. J. We are nterested n the dscrepancy between the above varance expresson and that of vanlla mportance samplng gven by v IS := ( ) V h(x)π(x) for X π 0 ( ). To relate these quanttes we wll frst make the smplfyng π 0 (X) assumpton that π 0 s unform, so that the effects of the suppled proposal are neglgble. 2 2 In practce π 0 need not be unform, only that the densty does not change sgnfcantly across any gven trajectores. 32

42 Usng ths assumpton and lettng k(x, x ) = γ(x x) we observe ( ) h(x (J) )π(x (J) ( ) h(x )π(x ) ) 2 V π 0 (X (0) = k(x, x )π 0 (x)dx dx µ 2 ) π 0 (x) ( h(x )π(x ) ) 2 = π 0 (x π 0 (x )dx k(x, x)dx µ 2 ) ( ) h(x)π(x) = V = v IS, π 0 (X) where we have used the symmetry propertes of k (Lemma 3.4). Ths expresson s essentally statng that f one were to actually use a unformly drawn sample from each trajectory to estmate µ (as opposed to the average over all the ponts n a trajectory) the varance of the resultng estmator would be equal to v IS. Next, usng the law of total varance we see that ( ) h(x (J) )π(x (J) ) v IS = V π 0 (X (0) ) [ ( )] h(x (J) )π(x (J) ) = E V π 0 (X (0) ) (X( N),..., X (M) ) ( [ ]) h(x (J) )π(x (J) ) + V E π 0 (X (0) ) (X( N),..., X (M) ) [ ( )] h(x (J) )π(x (J) ) = E V π 0 (X (0) ) (X( N),..., X (M) ) + v AMCS. From ths expresson we derve the key result of ths secton, whch we refer to as the varance capture dentty: v AMCS = v IS E [ V ( h(x (J) )π(x (J) ) π 0 (X (0) ) )] X( N),..., X (M). (3.5) Ths dentfy shows that the varance of the AMCS estmator cannot be hgher than the vanlla mportance samplng estmator gven the same number of samples (n) (.e. gnorng the addtonal computatonal costs of AMCS). Addtonally, the reducton n varance s determned entrely by the expected varance of the ponts nsde a gven trajectory under the unform dstrbuton. Ths observaton motvates the use of antthetc Markov chans, n whch the transton kernels k +/ are confgured to head n opposte drectons n the hopes of capturng more varablty. However, n order to get a far understandng of the tradeoffs between AMCS and other approaches, t s mportant to consder the ncreased computatonal costs ncurred by smulatng the Markov chans. If we consder any Monte Carlo estmator whch takes the 33

43 emprcal average of an arbtrary sequence of..d. random varables, say X 1,..., X n, then we know varance s gven by V(X) n. If we also assume that ths samplng procedure has a stochastc cost assocated wth each sample, denoted by the random varables D 1,..., D n where δ := E[D] also where D X, by fxng a computatonal budget C >> δ, standard arguments for renewal reward processes ndcate t wll have a varance of approxmately V(X) C/δ = δv(x) C. Sad smply, f technque A requres, on average, a factor of δ more computaton per sample than technque B, then t must have a reduced varance by a factor of at least 1/δ to be compettve. Pluggng ths forumla nto the dentty n Eq. (3.5) we can approxmate that AMCS wll offer a varance reducton whenever [ ( )] h(x (J) )π(x (J) ) E V π 0 (X (0) ) (X( N),..., X (M) ) > δ 1 v IS, δ where δ = E[M + N + 1] gves the expected computatonal costs measured n terms of evaluatons of the functons π and h. We say ths value s approxmate snce the M and M are not techncally ndependent of (X ( N),..., X (M) ), but for large sample sze the effects dependence wll be neglgble. That s, f the AMCS sampler requres 10 functon evaluatons each sample t wll need to capture 90% of the varance of v IS nsde each trajectory, on average. It s clear then from ths expresson that the potental for savngs drops of quckly as the per-sample computatonal costs ncrease. Interestngly, the above steps can be used to analyze a Monte Carlo estmator usng antthetc varates as well,.e. when δ = 2, and also extends to the multple varable settng. Also, an alternatve analyss (for determnstc N,M) shows that the method would offer a varance reducton whenever j Cov( Z (), Z (j)) < 0. In the next secton we explore explct parameterzatons for the Markov transtons and stoppng rules that can be defned to capture varablty whle smultaneously keepng these costs n check. 3.4 Parameterzaton In order to formulate concrete parameterzatons for AMCS, t s frst necessary to set out what propertes of ntegrand we are hopng to explot. In ths secton, keepng wth the orgnal motvatons for AMCS, we consder parameterzatons that are targeted toward mult-modal and peaked ntegrands. In these settngs perhaps the most useful observaton that can be made s that f the ntegrand s very near zero at a gven pont t s not lkely to have large ntegrand values n ts mmedate vcnty. As a result, smulatng Markov 34

44 chans n these areas s not lkely to be worthwhle n terms of overall varance reducton. Conversely, f the ntegrand has some magntude t has a much hgher chance of beng near a local mode. Ths observaton motvates the threshold acceptance functon { α +/ (x, x 1, f f(x) > ε and f(x ) > ε ) = 0, otherwse, (3.6) where ε > 0 and f(x) s some functon of nterest, lkely the ntegrand (h(x)π(x)). An mportant aspect of ths acceptance functon s that f the frst pont sampled from π 0 s below threshold, the AMCS procedure can return mmedately wthout evaluatng the ntegrand at any neghborng ponts, therefore ncurrng no addtonal computatonal costs. That s to say, the sampler wll perform exactly as well as a vanlla mportance sampler n these regons and any varance reductons wll therefore depend entrely on ts behavour n regons where the ntegrand s above ths threshold. In regards to transton kernels, a natural frst choce s the lnear Markov kernel denstes k + (x, ) = N (x + v, σ 2 I), k (x, ) = N (x v, σ 2 I), where N (µ, Σ) denotes a multvarate normal dstrbuton wth mean µ and covarance Σ, v R d s some fxed vector, and σ 2 a fxed varance parameter. In general these transtons should be parameterzed so that σ 2 << v 2 so that the resultng Markov chan wll make consstent progress n one drecton and typcally experence more varablty than, say, a normal random walk gven the same number of steps. For contnuously dfferentable ntegrands we can use the gradent to set the drecton vector automatcally gvng rse to the Langevn Markov kernels k + (x, ) = N (x + ε f(x), σ 2 I), k (x, ) = N (x ε f(x), σ 2 I), (3.7) where ε > 0 s a fxed step sze parameter. Snce the gradent ponts n the drecton of steepest ascent ths choce seems deal for capturng varablty wthn a trajectory. Interestngly, ths Markov kernel has been used extensvely n the MCMC lterature and, n fact, can be shown to produce Markov chan wth statonary dstrbuton f when σ 2 = 2ε and ε 0 (Neal, 2011). However, when used n AMCS these moves can (and should) be parameterzed so that ε >> σ 2, as a result the Markov chans look more lke those generated from a steepest descent optmzaton algorthm than a MCMC dffuson. 35

45 One mportant concern wth the Langevn kernel s that for nonlnear functons the transtons are not exactly symmetrc. Whle ths ssue can be partally addressed by ensurng the gradent vector s normalzed to length 1, exact jont symmetry (Defnton 3.1) s best attaned through the use of the symmetrzng acceptance functons ( k α + (x, x (x, x ) ) ) = mn k + (x, x ), 1, ( k α (x + (x ), x), x) = mn k (x, x), 1. Note that multple acceptance functons can be combned nto a sngle functon by takng the product. Lastly, when takng gradent steps n ether drecton one can expect to eventually settle nto a local mode or plateau. In these regons t contnung the chan wll not capture any addtonal varaton, and t s therefore benefcal to termnate t. Ths can be accomplshed through the use of the monotonc acceptance functons { α + (x, x 1, f f(x) + ε < f(x ) ) = 0, otherwse, { α (x, x 1, f f(x) ε > f(x ) ) = 0, otherwse, where ε 0 s some fxed threshold. Ths acceptance functon ensures that the chans make monotonc progress n ether drecton. 3.5 Expermental Evaluaton We now consder a number of emprcal tests desgned to evaluate our prevous clams that the AMCS approach can reduce the statstcal error of a vanlla mportance samplng approach. Addtonally, we am to uncover whether these reductons are comparable to those offered by alternatve state of the art Monte Carlo approaches. Specfcally, we contrast the behavour of an AMCS sampler wth that of vanlla mportance samplng (IS), annealed mportance samplng (AIS), and greedy mportance samplng (GIS) Southey et al. (2002). The prevously unmentoned GIS approach s somewhat smlar to the AMCS approach n terms of underlyng mechancs snce t uses a sequence of determnstc, axs-algned, steepest ascent moves to augment a fxed proposal. In fact, the AMCS approach ultmately spawned from earler attempts to extend the GIS approach to explot contnuous gradent nformaton, as opposed to the expensve fnte dfference approxmatons. Interestngly, n the course of ths work we observed that the AMCS acceptance functons could also be used 36

46 wth the GIS approach wth mnmal effort. Indeed, the threshold acceptance functon resulted n consderable mprovements n a early testng. As a result, we added ths mproved GIS approach, dubbed GIS-A, to our sute of comparson approaches. In contrastng these dfferent approaches, careful consderaton of the addtonal computatonal effort per sample s necessary. To account for these addtonal costs we measured the expected number of ntegrand evaluatons per sample (denoted δ M for method M) and compared methods usng the cost-adjusted varance defned as v M := δ M nv(ˆµ n M ). Addtonally, to ensure a meanngful comparson across experments, we normalzed ths value by takng ts rato between the varance of the vanlla mportance samplng approach to gve the relatve cost-adjusted varance gven by v M / v IS. Here, a value of 0.5 ndcates a 2x reducton n the number of ntegrand evaluatons needed to attan the same error as an mportance samplng estmator. 3 For our comparsons we consdered three dfferent problem scenaros, frst a synthetc problem wth havng numerous modes and takng on negatve values, second a Bayesan k-means posteror, and fnally a Bayesan posteror for a robot localzaton task Sn Functon We frst consder a synthetc ntegraton problem defned usng a d-dmensonal functon h(x) = d =1 sn(x ) 999 and densty π(x) = U(x; 0, 3π), where U(x; a, b) denotes the multvarate unform densty on the nterval (a, b) d. At frst glance t may seem that the exponent n h s a rather extreme choce, however, to gve some perspectve we note that a mode for ths functon s roughly the same sze and shape as a normal densty wth σ = Unsurprsngly, ths ntegral poses challenges for numercal ntegraton methods due to both ts peaked landscape and large number (3 d ) of separated modes. However, as t turns out, the most challengng aspect of ths problem s the fact that the ntegrand takes on both postve and negatve values. As a drect consequence the most effectve SMCS approaches, where one targets a sampler toward the un-normalzed dstrbuton hπ, cannot be deployed, and snce π s a smple unform dstrbuton t s obvously not necessary, or worthwhle, to target a complex sampler toward ths dstrbuton. One alternatve approach for such problems s the so-called harmonc mean (HM) method whch nvolves frst smulatng a sequence of ponts X (1),..., X (n) from h π, usng a MCMC approach, then approxmatng I usng the weghted mportance samplng 3 Note that n our analyss we do not apply addtonal costs for gradent evaluatons snce, n most settngs, computatons of h(x) and h(x) typcally share the same sub-computatons whch can be cached and reused. Gradents wth respect to mn-batches may also be used. 37

47 Relatve Cost-Adjusted Varance Dmenson (d) AMCS GIS GIS A IS Fgure 3.3: Cost-adjusted varance (log-scale) for the varous methods on the sn(x) 999 functon. GIS referes to the orgnal greedy mportance samplng approach and GIS-A the extended verson usng the threshold acceptance functon. estmator (2.3) (see Newton and Raftery (1994)). The resultng estmator s consstent but often has extremely hgh (or nfnte) varance even when the MCMC chan s able to mx effcently. Despte extensve expermentaton we were not able to fnd a parameter settng for whch the HM estmator was not able to produce even remotely accurate estmates for d > 2. Ultmately, there are very few Monte Carlo approaches that can offer a far comparson on ntegrands havng both postve and negatve values and, as a result, we consder ACMS alongsde the GIS and IS approaches only. For these comparsons the AMCS approach was parameterzed wth a proposal densty π 0 = π, Langevn Markov kernels wth f = h, step-sze ε = 0.03, and σ 2 = 3E 5, and symmetrzng and monotonc acceptance functons (wth threshold ε = 0). Addtonally, we used the threshold acceptance functons usng the squared 2-norm of the gradent vector, that s f = h 2, and a threshold value of ε = 1E 15. For the GIS approach we used the same step-sze (ε = 0.03) and, for the GIS-A varant, the same acceptance functon. The relatve performance of each of these methods, as the dmensonalty of the problem s ncreased, s plotted n Fg Agan, we consder the cost-adjusted varance relatve to the vanlla mportance samplng approach varance whch uses drect Monte-Carlo samplng from π (calculated analytcally). The ponts plotted under the IS headng are those computed through smulaton and confrm the accuracy of the calculaton and gve some ndcaton of the statstcal error for ths smulaton. As mentoned prevously, the smulatons for the harmonc mean method were not nformatve as they were ether off the chart where too nosy to permt emprcal evaluaton. As for the GIS method, the results clearly ndcate that ncorporatng the threshold acceptance functon (GIS-A) can lead to a sgnfcant ncrease n performance and, n ths example, yelds up to a 150-fold mprovement over the orgnal method. Ths s a welcome mprovement snce the orgnal method performed 38

48 much worse than even smple mportance samplng especally as the dmensonalty of the ntegrand was ncreased. Ultmately, however, the AMCS approach s clearly the most effectve approach for ths task as t consstently outperformed all other approaches offerng up to an 8x mprovement over drect samplng. Moreover, the relatve performance of the AMCS approach seemed to mprove as the dmensonalty of the ntergrand was ncreased Bayesan k-mxture Model In our next experment we consder the task of approxmatng the normalzaton constant (ζ), or model evdence, for a Bayesan k-mxture model. Specfcally, we defne a generatve model wth k unformly weghted multvarate normal dstrbutons n R d wth fxed dagonal covarance matrces Σ = 20I for = {1,..., k}. The unobserved latent varables for ths model are the means for each component µ R d whch are assumed to be drawn from a multvarate normal pror wth mean zero and dentty covarance. Gven n samples, y 1,..., y n, from the true underlyng model, the model evdence s defned as the ntegraton of the un-normalzed posteror ζ = n =1 L(µ 1,..., µ k y )p(µ 1,..., µ k )dµ 1...dµ k, where the lkelhood functon s gven by L(µ 1,..., µ k y ) = k j=1 1 k N (y ; µ j, Σ j ) and the pror densty the standard normal p(µ 1,..., µ k ) = N ([µ 1,..., µ k ]; 0, I), where [µ 1,..., µ k ] denotes a dk-dmensonal vector of stacked µ vectors. To use the same notaton as prevous sectons we may wrte ˆπ(x) = n =1 L(x y )p(x), where x = [µ 1,..., µ k ]. For ths task standard SMCS approaches can be appled n a straghtforward manner though they do requre some prelmnary parameter tunng. After some expermentaton we found that the AIS approach performed well wth 150 annealng dstrbutons set usng the power of 4 heurstc suggested by Kuss and Rasmussen (2005),.e. β = ((150 )/150) 4. Each annealng stage used 3 MCMC transtons, here we evaluate both slce samplng Neal (2003) and Hamltonan transtons Neal (2011). The Hamltonan moves were tuned to acheve a accept/reject rate of about 80% whch resulted n a step-sze parameter of and 5 leapfrog steps. Addtonally, for AIS and the remanng methods we use the pror as the proposal densty, π 0 = p, and the posteror as the target. For AMCS we used Langevn local moves wth monotonc, symmetrzng, and threshold acceptance functons. For the Langevn moves we used a step-sze parameter ε = and σ 2 = 3E 5 whch, agan, was set though some manual tunng. The threshold acceptance functons were confgured usng a prelmnary samplng approach. In partcular, let- 39

49 tng f = ˆπ we set the threshold parameter to a value that accepted roughly 1.5% of the data ponts on a small sub-sample (2000 ponts). These ponts were not used n the fnal estmate but n practce they can be ncorporated wthout adverse effects. For the GIS approach we used step-sze parameter ε = 0.015, also, we expermented the modfed verson (GIS-A) whch used the same threshold acceptance functon as the AMCS approach. The results for ths problem are shown n Fg. 3.4 as the number of tranng ponts (n), the dmensonalty of these ponts (d), and number of mxture components (k) are altered. For each of these dfferent settngs the parameters for the samplng approaches reman fxed. Smulatons were run for a perod of 8 hours for each method and each settng of d, n, and k gvng a total runnng tme of 106 CPU days runnng on a cluster wth 2.66GHz processors. However, even n ths tme many of the methods were not able to return a meanngful estmate after executon, these results are therefore omtted from the fgure. Lookng at the relatve varances plotted n Fg. 3.4 t s mmedately clear from these smulatons that GIS (both varants) and AIS wth Hamltonan moves (AIS-HAM) are smply not effectve for ths task as they perform several orders of magntude worse than even vanlla IS. The AIS approach wth slce samplng moves (AIS-SS) and the AMCS approach, however, exhbt much more nterestng behavour. In partcular, the experments ndcate that AIS-SS can offer tremendous savngs over vanlla IS (and sometmes AMCS) for hgher dmensonal problems and problems wth more tranng samples. However, ths advantage seems to comes at a prce, as the method performed up to 10-20x worse than vanlla IS n other cases, essentally where the posteror dstrbuton was not so challengng for IS. AMCS, on the other hand, was consderably more robust to changes n the target, snce for each settng t performed at least as well as vanlla mportance samplng. Addtonally, n the more challengng settngs t offered a consderable advantage over IS. In summary, dependng on the problem at hand, and the practtoner s appette for rsk, the most approprate approach for ths partcular problem s ether AMCS or AIS-SS. In many cases however, the practtoner may be nterested n a large set of potental problem settngs where t s not possble to determne whch method, and parameter settngs, are most approprate for each case. In such scenaros t may be worthwhle to consder an adaptve algorthm to select from a set of fxed approaches automatcally. In the remanng chapters of ths thess we consder adaptve strateges precsely of ths form. 40

50 Relatve Cost-Adjusted Varance Relatve Cost-Adjusted Varance Relatve Cost-Adjusted Varance 1e+04 1e+03 1e+02 1e+01 1e+00 1e 01 1e 02 1e+04 1e+03 1e+02 1e+01 1e+00 1e 01 1e 02 1e+04 1e+03 1e+02 1e+01 1e+00 1e 01 1e 02 n = 15 k:1,d:1 k:2,d:1 k:3,d:1 k:4,d:1 k:5,d:1 k:6,d:1 k:7,d:1 k:8,d:1 k:1,d:2 k:2,d:2 k:3,d:2 k:4,d:2 k:5,d:2 k:6,d:2 k:7,d:2 k:8,d:2 k:1,d:3 k:2,d:3 k:3,d:3 k:4,d:3 k:5,d:3 n = 35 k:1,d:1 k:2,d:1 k:3,d:1 k:4,d:1 k:5,d:1 k:6,d:1 k:7,d:1 k:8,d:1 k:1,d:2 k:2,d:2 k:3,d:2 k:4,d:2 k:5,d:2 k:6,d:2 k:7,d:2 k:8,d:2 k:1,d:3 k:2,d:3 n = 70 k:1,d:1 k:2,d:1 k:3,d:1 k:4,d:1 k:5,d:1 k:6,d:1 k:7,d:1 k:8,d:1 k:1,d:2 k:2,d:2 k:3,d:2 k:4,d:2 k:1,d:3 k:2,d:3 AIS HAM AIS SS AMCS GIS GIS A IS AIS HAM AIS SS AMCS GIS GIS A IS AIS HAM AIS SS AMCS GIS GIS A IS Fgure 3.4: Cost-adjusted varance (log scale) for the dfferent approaches on the Bayesan k-means task. Mssng data ponts are due to the fact that trals where the fnal estmate (emprcal mean) s ncorrect by a factor of 2 or greater are automatcally removed. From left to rght the three plots ndcate performance on the same problem but wth an ncreasng number of observed tranng samples 15, 35, and 70 respectvely. 41

51 3.5.3 Problem 3: Robot Localzaton Our fnal smulatons are centered around the approxmaton of the normalzaton constant for a Bayesan posteror for the (smulated) kdnapped robot problem (Thrun et al., 2005). In ths settng an autonomous robot s placed at an unknown locaton and must recover ts poston usng relatve sensors, such as a laser range fnder, and a known map. Ths posteror dstrbuton s notorously dffcult to work wth when the sensors are hghly accurate snce ths produces a hghly peaked dstrbuton; a phenomenon referred to as the curse of accurate sensors. Here, we assume the pror dstrbuton over the robot s (x,y) poston and orentaton, denoted x R 3, s a unform dstrbuton. In our smulatons the robot s observatons are akn to those produced by a laser range fnder returnng dstance measurements at n postons spaced evenly n a 360 feld of vew (see Fg. 3.5). The sensor model for each ndvdual sensor, that s, the lkelhood of observng a measurement y gven the true ray-traced dstance from poston x: d(x), s gven by the mxture L(y d(x)) = 0.95N (y; d(x), σ 2 ) U(y; 0, M), where σ 2 = 4cm and the maxmum ray length M = 25m. 4 Ths sensor model s used commonly n the lterature (see Thrun et al. (2005)) and s meant to capture the nose nherent n laser measurements (normal dstrbuton) as well as movng obstacles or faled measurements (unform dstrbuton). Gven a set of observed measurements y 1,..., y n then, we have the un-normalzed posteror dstrbuton ˆπ(x) = n =1 L(y d (x))p(x), where p denotes the densty of the unform pror. The posteror dstrbuton for a fxed observaton and orentaton are llustrated n Fg. 3.5, the true robot poston and laser measurements on the left and the log-lkelhood on the rght, addtonally a smlar 3d plot was shown earler n Fg These plots help to llustrate the hghly multmodal and peaked lanscape whch poses challenges for standard ntegraton approaches. Addtonally, the fact that ntegrand values requre an expensve ray-tracng procedure to compute underscores the mportance of effcent samplng routnes. Also, due to the sharp map edges and propertes of the observaton model, the posteror dstrbuton s hghly non-contnuous and non-dfferentable. Ths prevents the use of gradent-based Markov transton (for AMCS and AIS) and severely lmts the effectveness of annealng the target densty. Agan, through some manual tunng we found that a sutable parameterzaton for AIS was to use 100 annealng dstrbutons each featurng 3 Metropols-Hastngs MCMC steps 4 Measurements assume that the map (Fg. 3.5) s 10x10 meters. 42

52 A Fgure 3.5: Left, the map used for the robot smulator wth 6 dfferent robot poses and correspondng laser measurements (for n = 12). Rght, a 2d mage where %blue s proportonal to the log-lkelhood functon usng the observatons shown at poston A, here pxel locatons correspond to robot (x, y) poston whle the orentaton remans fxed. wth proposal q(x, x ) = N (x ; x, σ 2 I) wth σ 2 = 4cm. For AMCS, we used the pror as a proposal densty, lnear Markov kernels wth v = [2cm, 2cm, 0.2cm] and σ 2 = 2E 3 cm and threshold acceptance functon wth threshold set to be larger than 4% of ponts on a 2000 pont sub-sample. For GIS, we used the same proposal, step-szes, and (for GIS-A) the same threshold acceptance functon as AMCS. The relatve error rates for the dfferent samplng approaches for 6 dfferent postons (see Fg. 3.5) and 3 dfferent laser confguratons, n = 12, 18, 24, are shown n Fg Unlke the prevous task, the results here are very straghtforward and ndcate that AMCS consstently offers a 8-10x mprovement over vanlla mportance samplng. Agan, the the GIS approach can be sgnfcantly mproved through the addton of threshold acceptance functons but even so, t s only margnally better than vanlla IS. Lastly, t s clear from these results that AIS s smply not an effectve approach for ths task as t s roughly 10x less effcent than smple IS and 100x less effcent than AMCS. Ths s prmarly due to the fact that the unmodfed proposal densty has some reasonable chance of landng n or near an ntegrand peak. Consequently, takng a large number of MCMC transtons s smply not a cost-effectve means to mprovng the proposal, ths detal exacerbated by landscape of the posteror dstrbuton whch nhbts effcent MCMC mxng Dscusson In ths chapter we detaled the antthetc Markov chan samplng approach can be seen as a unque way to extend the mportance samplng (or antthetc varates) approach through 43

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 0-80 /02-70 Computatonal Genomcs Normalzaton Gene Expresson Analyss Model Computatonal nformaton fuson Bologcal regulatory networks Pattern Recognton Data Analyss clusterng, classfcaton normalzaton, mss.