The Option-Critic Architecture

Size: px

Start display at page:

Download "The Option-Critic Architecture"

Primrose Fitzgerald
6 years ago
Views:

1 The Option-Critic Architecture Pierre-Luc Bcon nd Jen Hrb nd Doin Precup Resoning nd Lerning Lb, School of Computer Science McGill University {pbcon, jhrb, Preliminries nd Nottion A Mrkov Decision Process consists of set of sttes S, set of ctions A, trnsition function P : S A (S [, 1]) nd rewrd function r : S A R. For convenience, we develop our ides ssuming discrete stte nd ction sets. However, our results extend to continuous spces using usul mesure-theoretic ssumptions (some of our empiricl results re in continuous tsks). A (Mrkovin sttionry) policy is probbility distribution over ctions conditioned on sttes, π : S A [, 1]. In discounted problems, the vlue function of policy π is defined s the expected return: V π (s) = E π [ t= γt r t+1 s = s] nd its ction-vlue function s Q π (s, ) = E π [ t= γt r t+1 s = s, = ], where γ [, 1) is the discount fctor. A policy π is greedy with respect to given ction-vlue function Q if π(s, ) > iff = rgmx Q(s, ). In discrete MDP, there is t lest one optiml policy which is greedy with rerxiv: v2 [cs.ai] 3 Dec 216 Abstrct Temporl bstrction is key to scling up lerning nd plnning in reinforcement lerning. While plnning with temporlly extended ctions is well understood, creting such bstrctions utonomously from dt hs remined chllenging. We tckle this problem in the frmework of options [Sutton, Precup & Singh, 1999; Precup, 2]. We derive policy grdient theorems for options nd propose new option-critic rchitecture cpble of lerning both the internl policies nd the termintion conditions of options, in tndem with the policy over options, nd without the need to provide ny dditionl rewrds or subgols. Experimentl results in both discrete nd continuous environments showcse the flexibility nd efficiency of the frmework. Introduction Temporl bstrction llows representing knowledge bout courses of ction tht tke plce t different time scles. In reinforcement lerning, options (Sutton, Precup, nd Singh 1999; Precup 2) provide frmework for defining such courses of ction nd for semlessly lerning nd plnning with them. Discovering temporl bstrctions utonomously hs been the subject of extensive reserch efforts in the lst 15 yers (McGovern nd Brto 21; Stolle nd Precup 22; Menche, Mnnor, nd Shimkin 22; Şimşek nd Brto 29; Silver nd Ciosek 212), but pproches tht cn be used nturlly with continuous stte nd/or ction spces hve only recently strted to become fesible (Konidris et l. 211; Niekum 213; Mnn, Mnnor, nd Precup 215; Mnkowitz, Mnn, nd Mnnor 216; Kulkrni et l. 216; Vezhnevets et l. 216; Dniel et l. 216). The mjority of the existing work hs focused on finding subgols (useful sttes tht n gent should rech) nd subsequently lerning policies to chieve them. This ide hs led to interesting methods but ones which re lso difficult to scle up given their combintoril flvor. Additionlly, lerning policies ssocited with subgols cn be expensive in terms of dt nd computtion time; in the worst cse, it cn be s expensive s solving the entire tsk. We present n lterntive view, which blurs the line between the problem of discovering options from tht of lerning options. Bsed on the policy grdient theorem (Sutton et l. 2), we derive new results which enble grdul lerning process of the intr-option policies nd termintion functions, simultneously with the policy over them. This pproch works nturlly with both liner nd non-liner function pproximtors, under discrete or continuous stte nd ction spces. Existing methods for lerning options re considerbly slower when lerning from single tsk: much of the benefit comes from re-using the lerned options in similr tsks. In contrst, we show tht our pproch is cpble of successfully lerning options within single tsk without incurring ny slowdown nd while still providing benefits for trnsfer lerning. We strt by reviewing bckground relted to the two min ingredients of our work: policy grdient methods nd options. We then describe the core ides of our pproch: the intr-option policy nd termintion grdient theorems. Additionl technicl detils re included in the ppendix. We present experimentl results showing tht our pproch lerns meningful temporlly extended behviors in n effective mnner. As opposed to other methods, we only need to specify the number of desired options; it is not necessry to hve subgols, extr rewrds, demonstrtions, multiple problems or ny other specil ccommodtions (however, the pproch cn tke dvntge of pseudo-rewrd functions if desired). To our knowledge, this is the first end-to-end pproch for lerning options tht scles to very lrge domins t comprble efficiency.

2 spect to its own ction-vlue function. Policy grdient methods (Sutton et l. 2; Kond nd Tsitsiklis 2) ddress the problem of finding good policy by performing stochstic grdient descent to optimize performnce objective over given fmily of prmetrized stochstic policies, π θ. The policy grdient theorem (Sutton et l. 2) provides expressions for the grdient of the verge rewrd nd discounted rewrd objectives with respect to θ. In the discounted setting, the objective is defined with respect to designted strt stte (or distribution) s : ρ(θ, s ) = E πθ [ t= γt r t+1 s ]. The policy grdient theorem shows ρ(θ,s tht: ) = s µ π θ (s s ) π θ ( s) Q πθ (s, ), where µ πθ (s s ) = t= γt P (s t = s s ) is discounted weighting of the sttes long the trjectories strting from s. In prctice, the policy grdient is estimted from smples long the on-policy sttionry distribution. (Thoms 214) showed tht neglecting the discount fctor in this sttionry distribution mkes the usul policy grdient estimtor bised. However, correcting for this discrepncy lso reduces dt efficiency. For simplicity, we build on the frmework of (Sutton et l. 2) nd discuss how to extend our results ccording to (Thoms 214). The options frmework (Sutton, Precup, nd Singh 1999; Precup 2) formlizes the ide of temporlly extended ctions. A Mrkovin option ω Ω is triple (I ω, π ω, β ω ) in which I ω S is n initition set, π ω is n intr-option policy, nd β ω : S [, 1] is termintion function. We lso ssume tht s S, ω Ω : s I ω (i.e., ll options re vilble everywhere), n ssumption mde in the mjority of option discovery lgorithms. We will discuss how to dispense with this ssumption in the finl section. (Sutton, Precup, nd Singh 1999; Precup 2) show tht n MDP endowed with set of options becomes Semi-Mrkov Decision Process (Putermn 1994, chpter 11), which hs corresponding optiml vlue function over options V Ω (s) nd option-vlue function Q Ω (s, ω). Lerning nd plnning lgorithms for MDPs hve their counterprts in this setting. However, the existence of the underlying MDP offers the possibility of lerning bout mny different options in prllel : this is the ide of introption lerning, which we leverge in our work. Lerning Options We dopt continul perspective on the problem of lerning options. At ny time, we would like to distill ll of the vilble experience into every component of our system: vlue function nd policy over options, intr-option policies nd termintion functions. To chieve this gol, we focus on lerning option policies nd termintion functions, ssuming they re represented using differentible prmeterized function pproximtors. We consider the cll-nd-return option execution model, in which n gent picks option ω ccording to its policy over options π Ω, then follows the intr-option policy π ω until termintion (s dictted by β ω ), t which point this procedure is repeted. Let π ω,θ denote the intr-option policy of option ω prmetrized by θ nd β ω,ϑ, the termintion function of ω prmeterized by ϑ. We present two new results for lerning options, obtined using s blueprint the policy grdient theorem (Sutton et l. 2). Both results re derived under the ssumption tht the gol is to lern options tht mximize the expected return in the current tsk. However, if one wnted to dd extr informtion to the objective function, this could redily be done so long s it comes in the form of n dditive differentible function. Suppose we im to optimize directly the discounted return, expected over ll the trjectories strting t designted stte s nd option ω, then: ρ(ω, θ, ϑ, s, ω ) = E Ω,θ,ω [ t= γt r t+1 s, ω ]. Note tht this return depends on the policy over options, s well s the prmeters of the option policies nd termintion functions. We will tke grdients of this objective with respect to θ nd ϑ. In order to do this, we will mnipulte equtions similr to those used in intr-option lerning (Sutton, Precup, nd Singh 1999, section 8). Specificlly, the definition of the option-vlue function cn be written s: Q Ω (s, ω) = π ω,θ ( s) Q U (s, ω, ), (1) where Q U : S Ω A R is the vlue of executing n ction in the context of stte-option pir: Q U (s, ω, ) = r(s, ) + γ s P (s s, ) U(ω, s ). (2) Note tht the (s, ω) pirs led to n ugmented stte spce, cf. (Levy nd Shimkin 211). However, we will not work explicitly with this spce; it is used only to simplify the derivtion. The function U : Ω S R is clled the option-vlue function upon rrivl, (Sutton, Precup, nd Singh 1999, eqution 2). The vlue of executing ω upon entering stte s is given by: U(ω, s ) = (1 β ω,ϑ (s ))Q Ω (s, ω) + β ω,ϑ (s )V Ω (s ) (3) Note tht Q U nd U both depend on θ nd ϑ, but we do not include these in the nottion for clrity. The lst ingredient required to derive policy grdients is the Mrkov chin long which the performnce mesure is estimted. The nturl pproch is to consider the chin defined in the ugmented stte spce, becuse stte-option pirs now ply the role of regulr sttes in usul Mrkov chin. If option ω t hs been initited or is executing t time t in stte s t, then the probbility of trnsitioning to (s t+1, ω t+1 ) in one step is: P (s t+1, ω t+1 s t, ω t ) = π ωt,θ ( s t ) P(s t+1 s t, )( (1 β ωt,ϑ(s t+1 ))1 ωt=ω t+1 + β ωt,ϑ(s t+1 )π Ω (ω t+1 s t+1 )) (4) Clerly, the process given by (4) is homogeneous. Under mild conditions, nd with options vilble everywhere, it is in fct ergodic, nd unique sttionry distribution over stte-option pirs exists. We will now compute the grdient of the expected discounted return with respect to the prmeters θ of the intr-

3 option policies, ssuming tht they re stochstic nd differentible. From (1, 2), it follows tht: ( ) Q Ω (s, ω) π ω,θ ( s) = Q U (s, ω, ) + π ω,θ ( s) γ P (s s, ) U(ω, s ). s We cn further expnd the right hnd side using (3) nd (4), which yields the following theorem: Theorem 1 (Intr-Option Policy Grdient Theorem). Given set of Mrkov options with stochstic intr-option policies differentible in their prmeters θ, the grdient of the expected discounted return with respect to θ nd initil condition (s, ω ) is: µ Ω (s, ω s, ω ) s,ω π ω,θ ( s) Q U (s, ω, ), where µ Ω (s, ω s, ω ) is discounted weighting of stteoption pirs long trjectories strting from (s, ω ): µ Ω (s, ω s, ω ) = t= γt P (s t = s, ω t = ω s, ω ). The proof is in the ppendix. This grdient describes the effect of locl chnge t the primitive level on the globl expected discounted return. In contrst, subgol or pseudorewrd methods ssume the objective of n option is simply to optimize its own rewrd function, ignoring how proposed chnge would propgte in the overll objective. We now turn our ttention to computing grdients for the termintion functions, ssumed this time to be stochstic nd differentible in ϑ. From (1, 2, 3), we hve: Q Ω (s, ω) = π ω,θ ( s) γ P (s s, ) U(ω, s ). s discounted return objective with respect to ϑ nd the initil condition (s 1, ω ) is: s,ω µ Ω (s, ω s 1, ω ) β ω,ϑ(s ) A Ω (s, ω), where µ Ω (s, ω s 1, ω ) is discounted weighting of stte-option pirs from (s 1, ω ): µ Ω (s, ω s 1, ω ) = t= γt P (s t+1 = s, ω t = ω s 1, ω ). The dvntge function often ppers in policy grdient methods (Sutton et l. 2) when forming bseline to reduce the vrince in the grdient estimtes. Its presence in tht context hs to do mostly with lgorithm design. It is interesting tht in our cse, it follows s direct consequence of the derivtion nd gives the theorem n intuitive interprettion: when the option choice is suboptiml with respect to the expected vlue over ll options, the dvntge function is negtive nd it drives the grdient corrections up, which increses the odds of terminting. After termintion, the gent hs the opportunity to pick better option using π Ω. A similr ide lso underlies the interrupting execution model of options (Sutton, Precup, nd Singh 1999) in which termintion is forced whenever the vlue of Q Ω (s, ω) for the current option ω is less thn V Ω (s ). (Mnn, Mnkowitz, nd Mnnor 214) recently studied interrupting options through the lens of n interrupting Bellmn Opertor in vlueitertion setting. The termintion grdient theorem cn be interpreted s providing grdient-bsed interrupting Bellmn opertor. Algorithms nd Architecture Policy over options π Ω Options ω t Hence, the key quntity is the grdient of U. This is nturl consequence of the cll-nd-return execution, in which the goodness of termintion functions cn only be evluted upon entering the next stte. The relevnt grdient cn be further expnded s: s t π ω, β ω Grdients Critic Q U, A Ω TD error t U(ω, s ) = β ω,ϑ(s ) A Ω (s, ω) + γ P (s, ω s, ω) U(ω, s ), (5) ω s where A Ω is the dvntge function (Bird 1993) over options A Ω (s, ω) = Q Ω (s, ω) V Ω (s ). Expnding U(ω,s ) recursively leds to similr form s in theorem (1) but where the weighting of stte-option pirs is now ccording to Mrkov chin shifted by one time step: µ Ω (s t+1, ω t s t, ω t 1 ) (detils re in the ppendix). Theorem 2 (Termintion Grdient Theorem). Given set of Mrkov options with stochstic termintion functions differentible in their prmeters ϑ, the grdient of the expected r t Environment Figure 1: Digrm of the option-critic rchitecture. The option execution model is depicted by switch over the contcts. A new option is selected ccording to π Ω only when the current option termintes. Bsed on theorems 1 nd 2, we cn now design stochstic grdient descent lgorithm for lerning options. Using two-timescle frmework (Kond nd Tsitsiklis 2), we propose to lern the vlues t fst timescle while updting the intr-option policies nd termintion functions t slower rte.

4 We refer to the resulting system s n option-critic rchitecture, in reference to the ctor-critic rchitectures (Sutton 1984). The intr-option policies, termintion functions nd policy over options belong to the ctor prt of the system while the critic consists of Q U nd A Ω. The option-critic rchitecture does not prescribe how to obtin π Ω since vriety of existing pproches would pply: using policy grdient methods t the SMDP level, with plnner over the options models, or using temporl difference updtes. If π Ω is the greedy policy over options, it follows from (2) tht the corresponding one-step off-policy updte trget g (1) t is: g (1) γ t = r t+1 + ( (1 β ωt,ϑ(s t+1 )) + β ωt,ϑ(s t+1 ) mx ω π ωt,θ ( s t+1 ) Q U (s t+1, ω t, ) ) π ω,θ ( s t+1 ) Q U (s t+1, ω, ) which is lso the updte trget of the intr-option Q-lerning lgorithm of (Sutton, Precup, nd Singh 1999). A prototypicl implementtion of option-critic which uses intr-option Q-lerning is shown in Algorithm 1. The tbulr setting is ssumed only for clrity of presenttion. We write α, α θ nd α ϑ for the lerning rtes of the critic, intr-option policies nd termintion functions respectively. Algorithm 1: Option-critic with tbulr intr-option Q- lerning s s Choose ω ccording to n ɛ-soft policy over options π Ω (s) repet Choose ccording to π ω,θ ( s) Tke ction in s, observe s, r 1. Options evlution: δ r Q U (s, ω, ) if s is non-terminl then δ δ + γ(1 β ω,ϑ (s ))Q Ω (s, ω) + γβ ω,ϑ (s ) mx ω Q Ω(s, ω) end Q U (s, ω, ) Q U (s, ω, ) + αδ 2. Options improvement: θ θ + α θ log π ω,θ ( s) Q U (s, ω, ) ϑ ϑ α ϑ β ω,ϑ (s ) (Q Ω (s, ω) V Ω (s )) if β ω,ϑ termintes in s then choose new ω ccording to ɛ-soft(π Ω (s )) s s until s is terminl Lerning Q U in ddition to Q Ω is computtionlly wsteful both in terms of the number of prmeters nd smples. A prcticl solution is to only lern Q Ω nd derive n estimte of Q U from it. Becuse Q U is n expecttion over next, sttes, Q U (s, ω, ) = E s P [r(s, ) + γu(ω, s ) s, ω, ], it follows tht g (1) t is n pproprite estimtor. We chose this pproch for our experiment with deep neurl networks in the Arcde Lerning Environment. Experiments We first consider nvigtion tsk in the four-rooms domin (Sutton, Precup, nd Singh 1999). Our gol is to evlute the bility of set of options lerned fully utonomously to recover from sudden chnge in the environment. (Sutton, Precup, nd Singh 1999) presented similr experiment for set of pre-specified options; the options in our results hve not been specified priori. Initilly the gol is locted in the est doorwy nd the initil stte is drwn uniformly from ll the other cells. After 1 episodes, the gol moves to rndom loction in the lower right room. Primitive movements cn fil with probbility 1/3, in which cse the gent trnsitions rndomly to one of the empty djcent cells. The discount fctor ws.99, nd the rewrd ws +1 t the gol nd otherwise. We chose to prmetrize the intr-option policies with Boltzmnn distributions nd the termintions with sigmoid functions. The policy over options ws lerned using intr-option Q-lerning. We lso implemented primitive ctor-critic (denoted AC-PG) using Boltzmnn policy. We lso compred option-critic to primitive SARSA gent using Boltzmnn explortion nd no eligibility trces. For ll Boltzmnn policies, we set the temperture prmeter to.1. All the weights were initilized to zero. Steps SARSA() AC-PG Episodes OC 4 options OC 8 options Figure 2: After 1 episodes, the gol loction in the fourrooms domin is moved rndomly. Option-critic ( OC ) recovers fster thn the primitive ctor-critic ( AC-PG ) nd SARSA(). Ech line is verged over 35 runs. As cn be seen in Figure 2, when the gol suddenly chnges, the option-critic gent recovers fster. Furthermore, the initil set of options is lerned from scrtch t rte comprble to primitive methods. Despite the simplicity of the domin, we re not wre of other methods which could hve solved this tsk without incurring cost much lrger thn when using primitive ctions lone (McGovern nd Brto 21; Şimşek nd Brto 29).

211) of order 3. We experimented with 2, 3 or 4 options. We used Boltzmnn policies for the intr-option policies nd liner-sigmoid functions for the termintion functions. The lerning rtes were set to.

The drkest color represents the wlls in the environment while lighter colors encode higher termintion probbilities.

As opposed to (Sutton, Precup, nd Singh 1999), we did not encode this knowledge ourselves but simply let the gents find options tht would mximize the expected discounted return.

The initil stte is in the top left corner nd the gol is in the bottom right one (red circle).

5 211) of order 3. We experimented with 2, 3 or 4 options. We used Boltzmnn policies for the intr-option policies nd liner-sigmoid functions for the termintion functions. The lerning rtes were set to.1 for the critic nd.1 for both the intr nd termintion grdients. We used n epsilongreedy policy over options with ɛ = Figure 3: Termintion probbilities for the option-critic gent lerning with 4 options. The drkest color represents the wlls in the environment while lighter colors encode higher termintion probbilities. In the two temporlly extended settings, with 4 options nd 8 options, termintion events re more likely to occur ner the doorwys (Figure 3), greeing with the intuition tht they would be good subgols. As opposed to (Sutton, Precup, nd Singh 1999), we did not encode this knowledge ourselves but simply let the gents find options tht would mximize the expected discounted return. Pinbll Domin Figure 4: Pinbll: Smple trjectory of the solution found fter 25 episodes of trining using 4 options All options (color-coded) re used by the policy over options in successful trjectories. The initil stte is in the top left corner nd the gol is in the bottom right one (red circle). In the Pinbll domin (Konidris nd Brto 29), bll must be guided through mze of rbitrrily shped polygons to designted trget loction. The stte spce is continuous over the position nd velocity of the bll in the x- y plne. At every step, the gent must choose mong five discrete primitive ctions: move the bll fster or slower, in the verticl or horizontl direction, or tke the null ction. Collisions with obstcles re elstic nd cn be used to the dvntge of the gent. In this domin, drg coefficient of.995 effectively stops bll movements fter finite number of steps when the null ction is chosen repetedly. Ech thrust ction incurs penlty of 5 while tking no ction costs 1. The episode termintes with +1 rewrd when the gent reches the trget. We interrupted ny episode tking more thn 1 steps nd set the discount fctor to.99. We used intr-option Q-lerning in the critic with liner function pproximtion over Fourier bses (Konidris et l. Undiscounted Return options 3 options 2 options Episodes Figure 5: Lerning curves in the Pinbll domin. In (Konidris nd Brto 29), n option cn only be used nd updted fter gesttion period of 1 episodes. As lerning is fully integrted in option-critic, by 4 episodes ner optiml set of options hd lredy been lerned in ll settings. From qulittive point of view, the options exhibit temporl extension nd speciliztion (fig. 4). We lso observed tht cross mny successful trjectories the red option would consistently be used in the vicinity of the gol. Arcde Lerning Environment We pplied the option-critic rchitecture in the Arcde Lerning Environment (ALE) (Bellemre et l. 213) using deep neurl network to pproximte the critic nd represent the intr-option policies nd termintion functions. We used the sme configurtion s (Mnih et l. 213) for the first 3 convolutionl lyers of the network. We used 32 convolutionl filters of size 8 8 nd stride of 4 in the first lyer, 64 filters of size 4 4 with stride of 2 in the second nd filters with stride of 1 in the third lyer. We then fed the output of the third lyer into dense shred lyer of 512 neurons, s depicted in Figure 6. We fixed the lerning rte for the intr-option policies nd termintion grdient to.25 nd used RMSProp for the critic. π Ω ( s) {β ω(s)} {π ω( s)} Figure 6: Deep neurl network rchitecture. A conctention of the lst 4 imges is fed through the convolutionl lyers, producing dense representtion shred cross intr-option policies, termintion functions nd policy over options. We represented the intr-option policies s liner-softmx

6 of the fourth (dense) lyer, so s to output probbility distribution over ctions conditioned on the current observtion. The termintion functions were similrly defined using sigmoid functions, with one output neuron per termintion. The critic network ws trined using intr-option Q- lerning with experience reply. Option policies nd termintions were updted on-line. We used n ɛ-greedy policy over options with ɛ =.5 during the test phse (Mnih et l. 213). As consequence of optimizing for the return, the termintion grdient tends to shrink options over time. This is expected since in theory primitive ctions re sufficient for solving ny MDP. We tckled this issue by dding smll ξ =.1 term to the dvntge function, used by the termintion grdient: A Ω (s, ω)+ξ = Q Ω (s, ω) V Ω (s)+ξ. This term hs regulriztion effect, by imposing n ξ-mrgin between the vlue estimte of n option nd tht of the optiml one reflected in V Ω. This mkes the dvntge function positive if the vlue of n option is ner the optiml one, thereby stretching it. A similr regulrizer ws proposed in (Mnn, Mnkowitz, nd Mnnor 214). As in (Mnih et l. 216), we observed tht the intr-option policies would quickly become deterministic. This problem seems to pertin to the use of policy grdient methods with deep neurl networks in generl, nd not from option-critic itself. We pplied the regulrizer prescribed by (Mnih et l. 216), by penlizing for low-entropy intr-option policies. Primitive ctions options, no bseline 8 options, bseline 2 options, bseline Figure 7: Sequest: Using bseline in the grdient estimtors improves the distribution over ctions in the intr-option policies, mking them less deterministic. Ech column represents one of the options lerned in Sequest. The verticl xis spns the 18 primitive ctions of ALE. The empiricl ction frequencies re coded by intensity. Finlly, the bseline Q Ω ws dded to the intr-option policy grdient estimtor to reduce its vrince. This chnge provided substntil improvements (Hrb 216) in the qulity of the intr-option policy distributions nd the overll gent performnce s explined in Figure 7. We evluted option-critic in Asterisk, Ms. Pcmn, Sequest nd Zxxon. For comprison, we llowed the system to lern for the sme number of episodes s (Mnih et l. 213) nd fixed the prmeters to the sme vlues in ll four domins. Despite hving more prmeters to lern, optioncritic ws cpble of lerning options tht would chieve the gol in ll gmes, from the ground up, within 2 episodes (Figure 8). In Asterisk, Sequest nd Zxxon, option-critic surpssed the performnce of the originl DQN rchitecture bsed on primitive ctions. The eight options lerned in ech gme re lerned fully end-to-end, in tndem with the feture representtion, with no prior specifiction of subgol or pseudo-rewrd structure. The solution found by option-critic ws esy to interpret in the gme of Sequest when lerning with only two options. We found tht ech option specilized in behvior sequence which would include either the up or the down button. Figure 9 shows typicl trnsition from one option to the other, first going upwrd with option then switching to option 1 downwrd. Options with similr structure were lso found in this gme by (Krishnmurthy et l. 216) using n option discovery lgorithm bsed on grph prtitioning. Relted Work As option discovery hs received lot of ttention recently, we now discuss in more detil the plce of our pproch with respect to others. (Comnici nd Precup 21) used grdient-bsed pproch for improving only the termintion function of semi-mrkov options; termintion ws modeled by logistic distribution over cumultive mesure of the fetures observed since initition. (Levy nd Shimkin 211) lso built on policy grdient methods by constructing explicitly the ugmented stte spce nd treting stopping events s dditionl control ctions. In contrst, we do not need to construct this (very lrge) spce directly. (Silver nd Ciosek 212) dynmiclly chined options into longer temporl sequences by relying on compositionlity properties. Erlier work on liner options (Sorg nd Singh 21) lso used compositionlity to pln using liner expecttion models for options. Our pproch lso relies on the Bellmn equtions nd compositionlity, but in conjunction with policy grdient methods. Severl very recent ppers lso ttempt to formulte option discovery s n optimiztion problem with solutions tht re comptible with function pproximtion. (Dniel et l. 216) lern return-optimizing options by treting the termintion functions s hidden vribles, nd using EM to lern them. (Vezhnevets et l. 216) consider the problem of lerning options tht hve open-loop intr-option policies, lso clled mcro-ctions. As in clssicl plnning, ction sequences tht re more frequent re cched. A mpping from sttes to ction sequences is lerned long with commitment module, which triggers re-plnning when necessry. In contrst, we use closed-loop policies throughout, which re rective to stte informtion nd cn provide better solutions. (Mnkowitz, Mnn, nd Mnnor 216) propose grdient-bsed option lerning lgorithm, ssuming prticulr structure for the initition sets nd termintion functions. Under this frmework, exctly one option is ctive in ny prtition of the stte spce. (Kulkrni et l. 216)

Avg. Score 1 8 6 4 2 Testing Moving vg.1 DQN 5 1 15 2 25 2 15 1 1 8 6 4 Testing Moving vg.1 DQN Testing 5 Moving vg.1 DQN 2 5 1 15 2 Testing Moving vg.

7 Avg. Score Testing Moving vg.1 DQN Testing Moving vg.1 DQN Testing 5 Moving vg.1 DQN Testing Moving vg.1 DQN Epoch Epoch Epoch Epoch () Asterix (b) Ms. Pcmn (c) Sequest (d) Zxxon Figure 8: Lerning curves in the Arcde Lerning Environment. The sme set of prmeters ws used cross ll four gmes: 8 options,.1 termintion regulriztion,.1 entropy regulriztion, nd bseline for the intr-option policy grdients. Time Option Option 1 Figure 9: Up/down speciliztion in the solution found by option-critic when lerning with 2 options in Sequest. The top br shows trjectory in the gme, with white representing segment during which option 1 ws ctive nd blck for option 2. use the DQN frmework to implement grdient-bsed option lerner, which uses intrinsic rewrds to lern the internl policies of options, nd extrinsic rewrds to lern the policy over options. As opposed to our frmework, descriptions of the subgols re given s inputs to the option lerners. Option-critic is conceptully generl nd does not require intrinsic motivtion for lerning the options. Discussion We developed generl grdient-bsed pproch for lerning simultneously the intr-option policies nd termintion functions, s well s the policy over options, in order to optimize performnce objective for the tsk t hnd. Our ALE experiments demonstrte successful end-to-end lerning of options in the presence of nonliner function pproximtion. As noted, our pproch only requires specifying the number of options. However, if one wnted to use dditionl pseudo-rewrds, the option-critic frmework would esily ccommodte it. In this cse, the internl policies nd termintion function grdients would simply need to be tken with respect to the pseudo-rewrds insted of the tsk rewrd. A simple instnce of this ide, which we used in some of the experiments, is to use dditionl rewrds to encourge options tht re indeed temporlly extended by dding penlty whenever switching event occurs. Our pproch cn work semlessly with ny other heuristic for bising the set of options towrds some desirble property (e.g. compositionlity or sprsity), s long s it cn be expressed s n dditive rewrd structure. However, s seen in the results, such bising is not necessry to produce good results. The option-critic rchitecture relies on the policy grdient theorem, nd s discussed in (Thoms 214), the grdient estimtors cn be bised in the discounted cse. By introducing fctors of the form γ t t i=1 (1 β i) in our updtes (Thoms 214, eq (3)), it would be possible to obtin unbised estimtes. However, we do not recommend this pproch since the smple complexity of the unbised estimtors is generlly too high nd the bised estimtors performed well in our experiments. Perhps the biggest remining limittion of our work is the ssumption tht ll options pply everywhere. In the cse of function pproximtion, nturl extension to initition sets is to use clssifier over fetures, or some other form of function pproximtion. As result, determining which options re llowed my hve similr cost to evluting policy over options (unlike in the tbulr setting, where options with sprse initition sets led to fster decisions). This is kin to eligibility trces, which re more expensive thn using no trce in the tbulr cse, but hve the sme complexity with function pproximtion. If initition sets re to be lerned, the min constrint tht needs to be dded is tht the options nd the policy over them led to n ergodic chin in the ugmented stte-option spce. This cn be expressed s flow condition tht links initition sets with termintions. The precise description of this condition, s well s sprsity regulriztion for initition sets, is left for future work. Acknowledgements The uthors grtefully cknowledge finncil support for this work by the Ntionl Science nd Engineering Reserch Council of Cnd (NSERC) nd the Fonds de recherche du Quebec - Nture et Technologies (FRQNT).

8 Appendix Augmented Process If ω t hs been initited or is executing t time t, then the discounted probbility of trnsitioning to (s t+1, ω t+1 ) is: γ (s t+1, ω t+1 s t, ω t ) = π ωt ( s t ) γ P(s t+1 s t, ) ( (1 β ωt (s t+1 ))1 ωt=ω t+1 + β ωt (s t+1 )π Ω (ω t+1 s t+1 ) ). When conditioning the process from (s t, ω t 1 ), the discounted probbility of trnsitioning to s t+1, ω t is: γ (s t+1, ω t s t, ω t 1 ) = ( (1 β ωt 1 (s t ))1 ωt=ω t 1 + β ωt 1 (s t )π Ω (ω t s t ) ) π ωt ( s t ) γ P (s t+1 s t, ). More generlly, the k-steps discounted probbilities cn be expressed recursively s follows: P (k) γ (s t+k, ω t+k s t, ω t ) = s t+1 γ (s t+1, ω t+1 s t, ω t ) P (k 1) ω t+1 ( P (k) γ (s t+k, ω t+k 1 s t, ω t 1 ) = s t+1 γ (s t+k, ω t+k s t+1, ω t+1 ) ), ( ω t γ (s t+1, ω t s t, ω t 1 ) P (k 1) γ (s t+k, ω t+k 1 s t+1, ω t ) ). Proof of the Intr-Option Policy Grdient Theorem Tking the grdient of the option-vlue function: Q Ω (s, ω) = π ω,θ ( s) Q U (s, ω, ) = ( π ω,θ ( s) Q U (s, ω, )+ ) π ω,θ ( s) Q U (s, ω, ) = ( π ω,θ ( s) Q U (s, ω, )+ π ω,θ ( s) ) γ P (s s, ) U(ω, s ), (6) s U(ω, s ) = (1 β ω,ϑ (s )) Q Ω(s, ω) + β ω,ϑ (s ) V Ω(s ) = (1 β ω,ϑ (s )) Q Ω(s, ω) + β ω,ϑ (s ) π Ω (ω s ) Q Ω(s, ω ) ω = ω ( (1 βω,ϑ (s ))1 ω =ω+ β ω,ϑ (s )π Ω (ω s ) ) Q Ω (s, ω ). (7) where (7) follows from the ssumption tht θ only ppers in the intr-option policies. Substituting (7) into (6) yields recursion which, using the previous remrks bout ugmented process cn be trnsformed into: Q Ω (s, ω) = π ω,θ ( s) Q U (s, ω, )+ π ω,θ ( s) γ P (s s, ) (β ω,ϑ (s )π Ω (ω s ) s ω ) + (1 β ω,ϑ (s QΩ (s, ω ) ))1 ω =ω = π ω,θ ( s) Q U (s, ω, )+ γ (s, ω s, ω) Q Ω(s, ω ) s ω = P (k) γ (s, ω s, ω) π ω,θ ( s ) Q U (s, ω, ). k= s,ω The grdient of the expected discounted return with respect to θ is then: Q Ω (s, ω ) = s,ω k= = s,ω P (k) γ (s, ω s, ω ) µ Ω (s, ω s, ω ) π ω,θ ( s) Q U (s, ω, ) π ω,θ ( s) Q U (s, ω, ). Proof of the Termintion Grdient Theorem The expected sum of discounted rewrds strting from (s 1, ω ) is given by: [ ] U(ω, s 1 ) = E γ t 1 r t s 1, ω. t=1 We strt by expnding U s follows: U(ω, s ) = (1 β ω,ϑ (s ))Q Ω (s, ω) + β ω,ϑ (s )V Ω (s ) = (1 β ω,ϑ (s )) ( π ω,θ ( s ) r(s, ) + ) γ P (s s, ) U(ω, s ) s + β ω,ϑ (s ) ω π Ω (ω s ) ( π ω,θ ( s ) r(s, ) + ) γ P (s s, ) U(ω, s ) s The grdient of U is then: U(ω, s ) = β ω,ϑ(s ) (V Ω (s ) Q Ω (s, ω)) + }{{} A Ω(s,ω) (1 β ω,ϑ (s )) π ω,θ ( s ) γ P (s s, ) U(ω, s ). s.

9 Using the structure of the ugmented process: U(ω, s ) = β ω,ϑ(s ) A Ω (s, ω)+ γ (s, ω s, ω) U(ω, s ) ω s = P (k) γ (s, ω s, ω) β ω,ϑ(s ) A Ω (s, ω ). ω,s k= We finlly obtin: U(ω, s 1 ) = ω,s P (k) k= γ (s, ω s 1, ω ) β ω,ϑ(s ) A Ω (s, ω) = µ Ω (s, ω s 1, ω ) β ω,ϑ(s ) A Ω (s, ω). ω,s References Bird, L. C Advntge updting. Technicl Report WL TR , Wright Lbortory. Bellemre, M. G.; Nddf, Y.; Veness, J.; nd Bowling, M The rcde lerning environment: An evlution pltform for generl gents. Journl of Artificil Intelligence Reserch 47: Comnici, G., nd Precup, D. 21. Optiml policy switching lgorithms for reinforcement lerning. In AAMAS, Şimşek, O., nd Brto, A. G. 29. Skill chrcteriztion bsed on betweenness. In NIPS 21, Dniel, C.; vn Hoof, H.; Peters, J.; nd Neumnn, G Probbilistic inference for determining options in reinforcement lerning. Mchine Lerning, Specil Issue 14(2): Hrb, J Lerning options in deep reinforcement lerning. Mster s thesis, McGill University. Kond, V. R., nd Tsitsiklis, J. N. 2. Actor-critic lgorithms. In NIPS 12, Konidris, G., nd Brto, A. 29. Skill discovery in continuous reinforcement lerning domins using skill chining. In NIPS 22, Konidris, G.; Kuindersm, S.; Grupen, R. A.; nd Brto, A. G Autonomous skill cquisition on mobile mnipultor. In AAAI. Krishnmurthy, R.; Lkshminrynn, A. S.; Kumr, P.; nd Rvindrn, B Hierrchicl reinforcement lerning using sptio-temporl bstrctions nd deep neurl networks. CoRR bs/ Kulkrni, T.; Nrsimhn, K.; Seedi, A.; nd Tenenbum, J Hierrchicl deep reinforcement lerning: Integrting temporl bstrction nd intrinsic motivtion. In NIPS 29. Levy, K. Y., nd Shimkin, N Unified inter nd intr options lerning using policy grdient methods. In EWRL, Mnkowitz, D. J.; Mnn, T. A.; nd Mnnor, S Adptive skills, dptive prtitions (ASAP). In NIPS 29. Mnn, T. A.; Mnkowitz, D. J.; nd Mnnor, S Timeregulrized interrupting options (TRIO). In ICML, Mnn, T. A.; Mnnor, S.; nd Precup, D Approximte vlue itertion with temporlly extended ctions. Journl of Artificil Intelligence Reserch 53: McGovern, A., nd Brto, A. G. 21. Automtic discovery of subgols in reinforcement lerning using diverse density. In ICML, Menche, I.; Mnnor, S.; nd Shimkin, N. 22. Q-cut - dynmic discovery of sub-gols in reinforcement lerning. In ECML, Mnih, V.; Kvukcuoglu, K.; Silver, D.; Grves, A.; Antonoglou, I.; Wierstr, D.; nd Riedmiller, M. A Plying tri with deep reinforcement lerning. CoRR bs/ Mnih, V.; Bdi, A. P.; Mirz, M.; Grves, A.; Lillicrp, T. P.; Hrley, T.; Silver, D.; nd Kvukcuoglu, K Asynchronous methods for deep reinforcement lerning. In ICML. Niekum, S Semnticlly Grounded Lerning from Unstructured Demonstrtions. Ph.D. Disserttion, University of Msschusetts, Amherst. Precup, D. 2. Temporl bstrction in reinforcement lerning. Ph.D. Disserttion, University of Msschusetts, Amherst. Putermn, M. L Mrkov Decision Processes: Discrete Stochstic Dynmic Progrmming. John Wiley & Sons, Inc. Silver, D., nd Ciosek, K Compositionl plnning using optiml option models. In ICML. Sorg, J., nd Singh, S. P. 21. Liner options. In AAMAS, Stolle, M., nd Precup, D. 22. Lerning options in reinforcement lerning. In Abstrction, Reformultion nd Approximtion, 5th Interntionl Symposium, SARA Proceedings, Sutton, R. S.; McAllester, D. A.; Singh, S. P.; nd Mnsour, Y. 2. Policy grdient methods for reinforcement lerning with function pproximtion. In NIPS Sutton, R. S.; Precup, D.; nd Singh, S. P Between mdps nd semi-mdps: A frmework for temporl bstrction in reinforcement lerning. Artificil Intelligence 112(1-2): Sutton, R. S Temporl Credit Assignment in Reinforcement Lerning. Ph.D. Disserttion. Thoms, P Bis in nturl ctor-critic lgorithms. In ICML, Vezhnevets, A. S.; Mnih, V.; Agpiou, J.; Osindero, S.; Grves, A.; Vinyls, O.; nd Kvukcuoglu, K Strtegic ttentive writer for lerning mcro-ctions. In NIPS 29.

Multi-Step Reinforcement Learning: A Unifying Algorithm

Multi-Step Reinforcement Learning: A Unifying Algorithm Multi-Step Reinforcement Lerning: A Unifying Algorithm Kristopher De Asis, 1 J. Fernndo Hernndez-Grci, 1 G. Zchris Hollnd, 1 Richrd S. Sutton Reinforcement Lerning nd Artificil Intelligence Lbortory, University