Multi-Step Reinforcement Learning: A Unifying Algorithm

Size: px

Start display at page:

Download "Multi-Step Reinforcement Learning: A Unifying Algorithm"

Ashlie Cox
6 years ago
Views:

1 Multi-Step Reinforcement Lerning: A Unifying Algorithm Kristopher De Asis, 1 J. Fernndo Hernndez-Grci, 1 G. Zchris Hollnd, 1 Richrd S. Sutton Reinforcement Lerning nd Artificil Intelligence Lbortory, University of Albert {kldesis,jfhernn,ghollnd,rsutton}@ulbert.c 1992) is rgubly the most populr, nd is considered n offpolicy method becuse the policy generting the behviour (the behviour policy), nd the policy tht is being lerned (the trget policy) re different. Srs (Rummery nd Nirnjn 1994; Sutton 1996) is the clssicl on-policy control method, where the behviour nd trget policies re the sme. However, Srs cn be extended to lern off-policy with the use of importnce smpling (Precup, Sutton, nd Singh 2000). Expected Srs is n extension of Srs tht, insted of using the ction-vlue of the next stte to updte the vlue of the current stte, uses the expecttion of ll the subsequent ction-vlues of the current stte with respect to the trget policy. Expected Srs hs been studied s strictly on-policy method (vn Seijen et l. 2009), but in this pper we present more generl version tht cn be used for both on- nd off-policy lerning nd tht lso subsumes Q-lerning. All of these methods re often described in the simple one-step cse, but they cn lso be extended cross multiple time steps. The TD(λ) lgorithm unifies one-step TD lerning with Monte Crlo methods (Sutton 1988). Through the use of eligibility trces, nd the trce-decy prmeter, λ [0, 1], spectrum of lgorithms is creted. At one end, λ = 1, exists Monte Crlo methods, nd t the other, λ = 0, exists one-step TD lerning. In the middle of the spectrum re intermedite methods which cn perform better thn the methods t either extreme (Sutton nd Brto 1998). The concept of eligibility trces cn lso be pplied to TD control methods such s Srs nd Q-lerning, which cn crete more efficient lerning nd produce better performnce (Rummery 1995). Multi-step TD methods re usully thought of in terms of n verge of mny multi-step returns of differing lengths nd re often ssocited with eligibility trces, s is the cse with TD(λ). However, it is lso nturl to think of them in terms of individul n-step returns with their ssocited n- step bckup (Sutton nd Brto 1998). We refer to ech of these individul bckups s tomic bckups, wheres the combintion of severl tomic bckups of different lengths cretes compound bckup. In the existing literture, it is not cler how best to extend one-step Expected Srs to multi-step lgorithm. The Treebckup lgorithm ws originlly presented s method to perform off-policy evlution when the behviour policy is non-mrkov, non-sttionry or completely unknown (Precup, Sutton, nd Singh 2000). In this pper, we re-present TreerXiv: v2 [cs.ai] 11 Jun 2018 Abstrct Unifying seemingly disprte lgorithmic ides to produce better performing lgorithms hs been longstnding gol in reinforcement lerning. As primry exmple, TD(λ) elegntly unifies one-step TD prediction with Monte Crlo methods through the use of eligibility trces nd the trce-decy prmeter λ. Currently, there re multitude of lgorithms tht cn be used to perform TD control, including Srs, Q-lerning, nd Expected Srs. These methods re often studied in the one-step cse, but they cn be extended cross multiple time steps to chieve better performnce. Ech of these lgorithms is seemingly distinct, nd no one domintes the others for ll problems. In this pper, we study new multi-step ction-vlue lgorithm clled Q(σ) tht unifies nd generlizes these existing lgorithms, while subsuming them s specil cses. A new prmeter, σ, is introduced to llow the degree of smpling performed by the lgorithm t ech step during its bckup to be continuously vried, with Srs existing t one extreme (full smpling), nd Expected Srs existing t the other (pure expecttion). Q(σ) is generlly pplicble to both on- nd off-policy lerning, but in this work we focus on experiments in the on-policy cse. Our results show tht n intermedite vlue of σ, which results in mixture of the existing lgorithms, performs better thn either extreme. The mixture cn lso be vried dynmiclly which cn result in even greter performnce. The Lndscpe of TD Algorithms Temporl-difference (TD) methods (Sutton nd Brto 1998) re n importnt concept in reinforcement lerning (RL) tht combines ides from Monte Crlo nd dynmic progrmming methods. TD methods llow lerning to occur directly from rw experience in the bsence of model of the environment s dynmics, like with Monte Crlo methods, while lso llowing estimtes to be updted bsed on other lerned estimtes without witing for finl result, like with dynmic progrmming. The core concepts of TD methods provide flexible frmework for creting vriety of powerful lgorithms tht cn be used for both prediction nd control. There re number of TD control methods tht hve been proposed. Q-lerning (Wtkins 1989; Wtkins nd Dyn Copyright c 2018, Assocition for the Advncement of Artificil Intelligence ( All rights reserved. 1 Authors contributed eqully, nd re listed lphbeticlly.

2 bckup s nturl multi-step extension of Expected Srs. Insted of performing the updtes with entirely smpled trnsitions s with multi-step Srs, Tree-bckup performs the updte using the expected vlues of ll the ctions t ech trnsition. Q(σ) is n lgorithm tht ws first proposed by Sutton nd Brto (2018) which unifies nd generlizes the existing multistep TD control methods. The degree of smpling performed by the lgorithm is controlled by the smpling prmeter, σ. At one extreme (σ = 1) exists Srs (full smpling), nd t the other (σ = 0) exists Tree-bckup (pure expecttion). Intermedite vlues of σ crete lgorithms with mixture of smpling nd expecttion, nd σ cn be interpreted s wy to control the bis-vrince trde-off inherent in multi-step TD lgorithms. In this work, on problems with tbulr representtion nd problem requiring function pproximtion, we show tht n intermedite vlue of σ cn outperform the lgorithms tht exist t either extreme. In ddition, we show tht σ cn be vried dynmiclly to produce even greter performnce. We limit our discussion of Q(σ) to the tomic multi-step cse without eligibility trces, but nturl extension is to mke use of compound bckups nd is n venue for future reserch. Furthermore, Q(σ) is generlly pplicble to both on- nd off-policy lerning, but for our initil empiricl study we exmined only on-policy prediction nd control problems. MDPs nd One-step Solution Methods The sequentil decision problem encountered in RL is often modeled s Mrkov decision process (MDP). Under this frmework, n gent nd the environment interct over sequence of discrete time steps t. At every time step, the gent receives informtion bout the environment s stte, S t S, where S is the set of ll possible sttes. The gent uses this informtion to select n ction, A t, from the set of ll possible ctions A. Bsed on the behvior of the gent nd the stte of the environment, the gent receives rewrd, R t+1 R, nd moves to nother stte, S t+1 S, with stte-trnsition probbility p(s s, ) = P (S t+1 = s S t = s, A t = ), for A nd s, s S. The gent behves ccording to policy π( s), which is probbility distribution over the set S A. Through the process of policy itertion (Sutton nd Brto 1998), the gent lerns the optiml policy, π, tht mximizes the expected discounted return: G t = R t+1 + γr t+2 + γ 2 R t = T t 1 k=0 γ k R t+1+k, (1) for discount fctor γ [0, 1) nd T = for continuing tsks, or γ [0, 1] nd T equl to the finl time step in episodic tsks. TD lgorithms strive to mximize the expected return by computing vlue functions tht estimte the expected future rewrds in terms of the elements of the environment nd the ctions of the gent. The stte-vlue function is the expected return when the gent is in stte s nd follows policy π, defined s v π (s) = E π [G t S t = s]. For control, most of the time we focus on estimting the ction-vlue function, which is the expected return when the gent tkes n ction, in stte s, while following policy π, nd is defined s: q π (s, ) = E π [G t S t = s, A t = ]. (2) Eqution 2 cn be estimted itertively by observing new rewrds, bootstrpping on old estimtes of q π, nd using the updte rule: Q(S t, A t ) Q(S t, A t ) (3) + α[r t+1 + γq(s t+1, A t+1 ) Q(S t, A t )], where α (0, 1] is the step size prmeter. Updte rules re lso known s bckup opertions becuse they trnsfer informtion bck from future sttes to the current one. A common wy to visulize bckup opertions is by using bckup digrms such s the ones depicted in Figure 1. For clrity, the lgorithmic ides in this pper re presented initilly s tbulr solution methods, but we lso extend them to use function pproximtion, nd thus they lso serve s pproximte solution methods. The term in brckets in (3): δ S t = R t+1 + γq(s t+1, A t+1 ) Q(S t, A t ), (4) is lso known s the TD error, denoted δ t. TD control methods re chrcterized by their TD error; for exmple, the TD error in (4) corresponds to the clssic on-policy method known s Srs. Becuse lerning requires certin mount of explortion, behving greedily with respect to the estimted optiml policy is often infesible. Therefore, gents re often trined under ɛ-greedy policies for which the gent only chooses the optiml ction with probbility (1 ɛ) nd behves rndomly with probbility ɛ, for ɛ [0, 1]. Nevertheless, lerning the optiml policy is possible if it is done off-policy. When the gent is lerning off-policy, it behves ccording to behvior policy, µ, while lerning trget policy, π. This cn be chieved by using nother TD control method, Expected Srs. In contrst with Srs, Expected Srs behves ccording to the behvior policy, but updtes its estimte by tking n expecttion of Q(S t, A t ) over the ctions t time t, ccording to the trget policy (vn Seijen et l. 2009). For convenience, let the expected ction-vlue be defined s: V t+1 = π( S t+1 )Q(S t+1, ). (5) Then, the TD error of Expected Srs cn be written s: δ ES t = R t+1 + γv t+1 Q(S t, A t ). (6) A specil cse of Expected Srs is Q-lerning, where the estimte is updted ccording to the mximum of Q(S t, ) over the ctions (Wtkins 1989): δ QL t = R t+1 + γ mx Q(S t+1, ) Q(S t, A t ). (7) Q-lerning is the resulting lgorithm when the trget policy of Expected Srs is the greedy policy with respect to Q.

3 4-Step Srs 4-Step Expected Srs 4-Step Tree-Bckup 4-Step Figure 1: Bckup digrms for tomic 4-step Srs, Expected Srs, Tree-bckup, nd Q(σ). Here we cn see tht Q(σ) encompsses the other three lgorithms bsed on the setting of σ. Atomic Multi-Step Algorithms The TD methods presented in the previous section cn be generlized even further by bootstrpping over longer time intervls. This hs been shown to decrese the bis of the updte t the cost of incresing the vrince (Jkkol, Jordn, nd Singh 1994). Nevertheless, in mny cses it is possible to chieve better performnce by choosing vlue for the bckup length prmeter, n, greter thn one (Sutton nd Brto 1998). We refer to lgorithms which mke use of multi-step tomic bckup s tomic multi-step lgorithms. Just like how one-step methods re defined by their TD error, ech tomic multi-step lgorithm is chrcterized by its n- step return. For tomic multi-step Srs, the n-step return is: G t:t+n =R t+1 + γr t+2 + γ 2 R t γ n 1 R t+n + γ n Q t+n 1 (S t+n, A t+n ), γ k R t+k+1 + γ n Q t+n 1 (S t+n, A t+n ), (8) n 1 = k=0 where Q t+n 1 is the estimte of q π t time t + n 1, nd the subscript rnge, t : t + n, denotes the length of the bckup. n-step Srs cn be dpted for off-policy lerning by introducing n importnce smpling rtio term (Precup, Sutton, nd Singh 2000): ρ t+n t = τ k=t π(a k S k ) µ(a k S k ), (9) nd multiplying it with the TD error to get the following updte rule: Q t+n (S t, A t ) Q t+n 1 (S t, A t ) (10) + αρ t+n t+1 [G t:t+n Q t+n 1 (S t, A t )], where τ = min(t + n 1, T 1) is the time step before the end of the updte or before the end of the episode. In the updte, the ction-vlues for ll other sttes remin the sme i.e. Q t+n (s, ) = Q t+n 1 (s, ), s S t, nd A t. This updte rule is not only pplicble for off-policy n-step Srs, but is generlly useful form for other tomic multi-step lgorithms s well. We present the lgorithms in this work s generl off-policy solution methods, but in the experiments section we evlute them empiriclly on-policy only which provides useful insight into their behviour. We defer the empiricl study nd comprison of the lgorithms in n offpolicy setting to future work. Expected Srs cn lso be generlized to multi-step method by using the return: G t:t+n = R t+1 + γr t+2 + γ 2 R t γ n V t+n. (11) The first n 1 sttes nd ctions re smpled ccording to the behviour policy, s with n-step Srs, but the lst stte is bcked up ccording to the expected ction-vlue under the trget policy. To mke n-step Expected Srs entirely off-policy, n importnce smpling rtio term cn lso be introduced, but it needs to omit the lst time step. The resulting updte would be the sme s in (10), but would use ρt+1 t+n 1 nd the n-step return for n-step Expected Srs from (11). A drwbck to using importnce smpling to lern offpolicy is tht it cn crete high vrince which must be compensted for by using smll step sizes; this cn slow lerning (Precup, Sutton, nd Singh 2000). In the next section we present method tht is lso generliztion of Expected Srs, but tht cn lern off-policy without importnce smpling. Tree-bckup As shown in (11), the TD return of n-step Expected Srs is clculted by tking n expecttion over the ctions t the lst step of the bckup. However, it is possible to extend this ide to every time step of the bckup by tking n expecttion t every step (Precup, Sutton, nd Singh 2000). The resulting lgorithm is multi-step generliztion of Expected Srs tht is known s Tree-bckup becuse of its chrcteristic bckup digrm (Figure 1). Moreover, just like Expected Srs nd Q-lerning, this proposed generliztion does not require importnce smpling to be pplied off-policy. Hence, it could be rgued tht it is more pproprite generliztion of Expected Srs to multi-step lerning (Sutton nd Brto 2018). Becuse Expected Srs subsumes Q-lerning, Treebckup cn lso be thought of s multi-step generliztion of Q-lerning if the trget policy is greedy with respect to the ction-vlue function. Tree-bckup hs severl dvntges over n-step Expected Srs. Tree-bckup hs the cpcity for lerning off-policy without the need for importnce smpling, reducing the vrince due to the importnce smpling rtios. Additionlly, becuse n importnce smpling rtio does not need to be computed, the behvior policy does not need to be sttionry, Mrkov, or even known (Precup, Sutton, nd Singh 2000). Ech brnch of the tree represents n ction, while the min brnch represents the ction tken t time t. The vlue of ech of the brnches is the vlue of Q t+n (S t, A t ) for the corresponding t, wheres the vlue of ech segment of the min brnch is the rewrd t the corresponding time step.

4 The n-step return is the sum of the vlues of ech brnch weighted by the product of the probbilities of the ctions leding to the brnch nd multiplied by the corresponding power of the discount term. For clrity, it is esier to present the n-step return of the Tree-bckup lgorithm in terms of the TD error of Expected Srs from (6): G t:t+n =Q t 1 (S t, A t ) + τ k=t δ ES k k i=t+1 γπ(a i S i ). (12) This tomic version of multi-step Tree-bckup ws first presented by Sutton nd Brto (2018). As result of the product term in (12), in ddition to the discount fctor γ, future rewrds re further discounted by the probbilities of the ctions tken. The Tree-bckup lgorithm therefore ssigns less weight to the rewrd sequence received, nd compenstes by bootstrpping off of the vlues of ctions not tken. Due to this, Tree-bckup is more bised thn Srs in the multi-step cse with stochstic policy, s Srs gives full weight to every rewrd received prior to bootstrpping. However, this increse in bis (towrds the estimtes in the vlue function) is trded off with decresed vrince in the rewrd sequence from tking expecttions. The Q(σ) Algorithm In the previous sections we hve incrementlly introduced severl generliztions for the TD control methods Srs nd Expected Srs, nd in this section we present n lgorithm tht unifies them clled Q(σ). Srs cn be generlized to n tomic multi-step lgorithm by using n n-step return, nd n-step Srs generlizes to n off-policy lgorithm through the use of importnce smpling. In contrst, Expected Srs cn lern off-policy without the need for importnce smpling, nd generlizes to the tomic multi-step lgorithms: Tree-bckup nd n-step Expected Srs. All of the lgorithms presented so fr cn be brodly ctegorized into two fmilies: those tht bckup their ctions s smples, like Srs; nd those tht consider n expecttion over ll ctions in their bckup, like Expected Srs nd Tree-bckup. In this section, we introduce method to unify both fmilies of lgorithms by introducing new prmeter, σ. The possibility of unifying Srs nd Tree-bckup ws first suggested by Precup et l. (2000), nd the first formultion of Q(σ) ws presented by Sutton nd Brto (2018). The intuition behind Q(σ) is bsed on the ide tht we hve choice to updte the estimte of q π bsed on one ction smpled from the set of possible future ctions, or bsed on the expecttion over the possible future ctions. For exmple, with n-step Srs, smple is tken t every step of the bckup, wheres with the Tree-bckup lgorithm, n expecttion is tken insted. However, the choice of smpling or expecttion does not hve to remin constnt for every step of the bckup. Furthermore, the bckup t time step t could be bsed on weighted verge of both smpling nd expecttion. In order to implement this, the prmeter, σ t [0, 1], is introduced to control the degree of smpling t ech step of the bckup. Thus, the TD error of Q(σ) cn be represented in terms of weighted sum of the TD errors of Algorithm 1 Off-policy n-step Q(σ) for estimting q π Input: behviour policy µ nd trget policy π Initilize S 0 terminl; select A 0 ccording to µ(. S 0 ) Store S 0, A 0, nd Q(S 0, A 0 ) for t = 0, 1, 2,..., T + n 1 do if t < T then Tke ction A t ; observe R nd S t+1 Store S t+1 if S t+1 is terminl then Store: δt σ = R Q(S t, A t ) else Select A t+1 ccording to µ( S t+1 ) nd Store Store: Q(S t+1, A t+1 ), σ t+1, π(a t+1 S t+1 ) Store: δt σ = R + γ[σ t+1 Q(S t+1, A t+1 ) +(1 σ t+1 )V t+1 ] Q(S t, A t ) Store: ρ t+1 = π(at+1 St+1) µ(s t+1 A t+1) end if end if τ t n + 1 if τ 0 then ρ 1 E 1 G Q(S τ, A τ ) for k = τ,..., min(τ + n 1, T 1) do G G + Eδk σ E γe [(1 σ k )π(a k+1 S k+1 ) + σ k+1 ] ρ ρ(1 σ k + σ k ρ k ) end for Q(S τ, A τ ) Q(S τ, A τ ) + αρ[g Q(S τ, A τ )] end if end for Srs nd Expected Srs: δ σ t = σ t+1 δ S t + (1 σ t+1 )δ ES t, = R t+1 + γ[σ t+1 Q t (S t+1, A t+1 ) + (1 σ t+1 )V t+1 ] Q t 1 (S t, A t ). (13) The n-step return is then: G t:t+n = Q t 1 (S t, A t ) (14) τ k + γ[(1 σ i )π(a i S i ) + σ i ]. δk σ k=t i=t+1 Moreover, the importnce smpling rtio from (9) cn be modified to include σ s follows: τ ( ) ρ t+n t+1 = π(a k S k ) σ k µ(a k S k ) + 1 σ k. (15) k=t+1 The updte rule for Q(σ) cn then be obtined by using G t:t+n from (14) nd ρ t+n t+1 from (15), with the updte rule from (10). Algorithm 1 shows the pseudocode for the complete off-policy n-step Q(σ) lgorithm. Additionlly, proof for one-step Q(σ) is redily vilble by pplying the results from Jkkol et l. (1994), Singh et l. (2000), nd vn Seijen et l. (2009).

5 Theorem 1. The one-step Q(σ) estimte defined by Q t+1 (S t, A t ) = (1 α t )Q t (S t, A t ) + α t [R t+1 + γ(σ t+1 Q t+1 (S t+1, A t+1 ) + (1 σ t+1 )V t+1 )], (16) converges to the optiml policy when the following conditions re stisfied: 1. The size of the set S A is finite. 2. α t = α t (S t, A t ) [0, 1], t α t =, t α2 t < w.p. 1 nd (s, ) (S t, A t ) : α t (S t, A t ) = The policy is greedy in the limit with infinite explortion. 4. The rewrd function is bounded. We defer the full detils of the proof to the ppendix; however, There re two importnt results from the proof tht re worth emphsizing. First, just s with one-step Q-lerning, Srs, nd Expected Srs, one-step Q(σ) cn be used to lern optiml ction-vlue functions. Second, t ech time step t it is possible to choose σ t such tht the contrction property of the Q(σ) updte is less thn or equl to the contrction induced by the Srs or Expected Srs updtes. This implies tht it is possible to choose σ t t every time step in order to speed up convergence. It is importnt to note tht every TD control method presented thus fr cn be obtined with Q(σ) by vrying the smpling prmeter, σ; when σ = 1, we obtin Srs, when σ = 0, we obtin Expected Srs nd Tree-bckup, nd when σ = 1 for every step of the bckup except for the lst, where σ = 0, we obtin n-step Expected Srs. Thus, tuning the hyper-prmeter σ is not strictly necessry since it cn be set to fixed vlue in order to obtin one of the existing TD control lgorithms. Nevertheless, intermedite vlues of σ between 0 nd 1 crete entirely new lgorithms tht exist somewhere between full smpling nd pure expecttion nd tht could result in better performnce. Furthermore, σ does not need to remin constnt throughout every episode or even t every time step during n episode or continuing tsk. σ could be vried dynmiclly s function of time, of the current stte, or of some mesure of the lerning progress. In prticulr, σ could lso be vried s function of the episode number, which we investigte in our experiments. There re potentilly vriety of effective schemes for choosing nd vrying σ, nd would be subject for further reserch. Experiments 19-Stte Rndom Wlk The 19-stte rndom wlk, shown in Figure 2, is 1- dimensionl environment where n gent rndomly trnsitions to one of two neighboring sttes. There is terminl stte on ech end of the environment, trnsitioning to one of them gives rewrd of -1, nd trnsitioning to the other gives rewrd of 1. To compre lgorithms tht involve tking n expecttion bsed on its policy, the tsk is formulted such tht ech stte hd two ctions. Ech ction deterministiclly trnsitions to one of the two neighboring sttes, nd the gent lerns on-policy under n equiprobble rndom behvior policy. This differs from typicl rndom wlk setups where ech stte hs one ction tht will rndomly trnsition to either Figure 2: The 19-stte rndom wlk MDP. The gol is to ccurtely estimte the vlue of ech stte under equiprobble rndom behvior. neighboring stte (Sutton nd Brto 1998), but the resulting stte vlues re identicl. This environment ws treted s prediction tsk where lerning lgorithm is to estimte vlue function under its behvior policy. We conducted n experiment compring vrious Q(σ) lgorithm instnces, ssessing different multistep bckup lengths, step sizes, nd degrees of smpling. The root-men-squre (RMS) error between its estimted vlue function nd the nlyticlly computed vlues ws mesured fter ech episode. Ech Q(σ) instnce nd prmeter setting rn for 50 episodes nd the results re verged cross 100 runs. Figure 3 shows the results with n = 3 nd α = 0.4, which ws found to be representtive of the best prmeter setting for ech instnce of Q(σ) on this tsk. Srs (full smpling) hd better initil performnce but poor symptotic performnce, Tree-bckup (no smpling) hd poor initil performnce but better symptotic performnce, nd intermedite degrees of smpling trded off between the initil nd symptotic performnces. This motivted the ide of dynmiclly decresing σ from 1 (full smpling) towrds 0 (pure expecttion) to tke dvntge of the initil performnce of Srs, nd the symptotic performnce of Tree-bckup. To ccomplish this we decresed σ by fctor of 0.95 fter ech episode. Q(σ) with dynmiclly vrying σ outperformed ll of the fixed degrees of smpling. Stochstic Windy Gridworld The windy gridworld is tbulr nvigtion tsk in stndrd gridworld which is described by Sutton nd Brto (1998). There is strt stte nd gol stte, nd there re four possible moves: right, left, up, nd down. When the gent moves into one of the middle columns of the gridworld, it is ffected by n upwrd wind which shifts the resultnt next stte upwrds by number of cells nd vries from column to column. If the gent is t the edge of the world nd selects move tht would cuse it to leve the grid, or would be pushed off the world by the wind, it is simply replced in the nerest stte t the edge of the world. At ech time step the gent receives constnt rewrd of -1 until the gol is reched. A vrition of the windy gridworld, clled the stochstic windy gridworld, is one where the results of choosing n ction re not deterministic. The lyout, ctions, nd wind strengths re the sme, but t ech time step, with probbility of 10%, the next stte tht results from picking ny ction is determined t rndom from the 8 sttes currently surrounding the gent. We conducted n experiment on the stochstic windy grid-

6 Figure 3: 19-stte rndom wlk results. The plot shows the performnce of Q(σ) in terms of RMS error in the vlue function. The results re n verge of 100 runs, nd the stndrd errors re ll less thn Q(1) hd the best initil performnce, Q(0) hd the best symptotic performnce, nd dynmic σ outperformed ll fixed vlues of σ. world which consisted of 1000 runs of 100 episodes ech to evlute the performnce of vrious instnces of Q(σ) with different prmeter combintions. All instnces of the lgorithms behved nd lerned ccording to n ɛ-greedy policy, with ɛ = 0.1. As the performnce mesure, we compred the verge return over the 100 episodes. The results re summrized in Figure 4. For ll the vlues of σ tht we tested, choosing n = 3 resulted in the gretest performnce; higher nd lower vlues of n decresed the performnce. Overll, Q(σ) with dynmic σ performed the best, while σ = 0.5 ws close second. Mountin Cliff We implemented vrint of the clssicl episodic tsk, mountin cr, s described by Sutton nd Brto (1998). For this implementtion, the rewrds, ctions nd gol remined the sme. However, if the gent ever ventured pst the top of the leftmost mountin, it would fll off cliff, be rewrded -100 nd returned to rndom initil loction in the vlley between the two hills. We nmed this environment mountin cliff. Both environments were tested nd showed the sme trend in the results. However, the results obtined in mountin cliff were more pronounced nd thus were more suitble for demonstrtion purposes. Becuse the stte spce is continuous, we pproximted q π using tile coding function pproximtion. Specificlly, we used version 3 of Sutton s tile coding softwre (n.d.) with 8 tilings, n symmetric offset by consecutive odd numbers, nd ech tile tking over 1/8 frction of the feture spce, which gives resolution of pproximtely 1.6%. For ech lgorithm, we conducted 500 independent runs of 500 episodes ech. All trining ws done on-policy under n ɛ-greedy policy with ɛ = 0.1 nd γ = 1. We optimized for the Figure 4: Stochstic windy gridworld results. The plot shows the performnce of Q(σ) in terms of the verge return over 100 episodes s function of the step size, α, for vrious vlues of σ. The results re for selected α vlues, then re connected by stright lines, nd re n verge of 1000 runs. The stndrd errors re ll less thn 0.3 which is bout line width. 3-step lgorithms performed better thn their 1-step equivlents, nd Q(σ) with dynmic σ performed the best overll. verge return fter 500 episodes over different vlues of the step size prmeter, α, nd the bckup length, n. The results correspond to the best-performing prmeter combintion for ech lgorithm: α = 1/6 nd n = 4 for Srs; α = 1/6 nd n = 8 for Tree-bckup; α = 1/4 nd n = 4 for Q(0.5); nd α = 1/7 nd n = 8 for Dynmic σ. We omit n-step Expected Srs in the results becuse its performnce ws not much different from n-step Srs s performnce. Figure 6 shows the return per episode verged over 500 runs. To smooth the results, we computed right-centered moving verge with window of 30 successive episodes. Additionlly, we dded the verge return per episode in lighter tone to show the vrince of ech lgorithm. As it cn be observed, tomic multi-step Srs nd Q(0.5) hd firly similr performnce. Among the tomic multi-step methods with sttic σ, Tree-bckup hd the best performnce. Nonetheless, Q(σ) with dynmic σ outperformed ll the lgorithms tht were using sttic σ. In order to gin more insight into the nture of the results, we looked t the verge return per episode fter 50 (initil performnce) nd 500 (symptotic performnce) episodes for ech lgorithm. Additionlly, 95% confidence intervl ws computed in order to vlidte the results. After 50 episodes, Q(0.5) hd the best performnce mong the four lgorithms with n verge return per episode of ; Dynmic σ ws close second with n verge return per episode of On the other hnd, fter 500 episodes, Dynmic σ mnged to outperform ll the other lgorithms with n verge return per episode of followed by Q(0.5) with n verge return per episode of Q(1) (Srs) hd the lowest

7 Figure 5: The mountin cliff environment. The gol of the gent is to drive pst the flg without flling off the cliff. The gent receives rewrd of -1 t every time step, nd flling off the cliff returns it to rndom initil loction in the vlley with rewrd of performnce with verge return per episode fter 50 episodes nd fter 500 episodes. These results contrst with Figure 6 becuse the verge is tken over ll the previous episode insted of the preceding 30 episodes. Discussion From our experiments, it is evident tht there is merit in unifying the spce of lgorithms with Q(σ). In prediction tsks, such s the 19-stte rndom wlk, vrying the degree of smpling results in trde-off between initil nd symptotic performnce. In control tsks, such s the stochstic windy gridworld, intermedite degrees of smpling re cpble of chieving higher per-episode verge return thn either extreme, depending on the number of elpsed episodes. These findings lso extend to tsks with continuous stte spces, such s the mountin cliff. Intermedite vlues of σ llow for higher initil performnce, wheres smll vlues of σ llow for better symptotic performnce. As shown in Figure 6, Q(σ) with dynmic σ is ble to exploit these two benefits by djusting σ over time. Moreover, our experiments in the stochstic windy gridworld tsk demonstrted tht it is possible to improve performnce by choosing higher vlue of the bckup length prmeter, n. Vrying n controls bis-vrince trde-off by djusting how mny rewrds re included in the bckup before bootstrpping, similr to the prmeter λ in the TD(λ) lgorithm. The prmeter σ lso hs bis-vrince trdeoff interprettion, s the Tree-bckup lgorithm decys the weighting of future rewrds bsed on the stochsticity in the policy (nd is therefore more bised). The length prmeter n controls the bis-vrince trde-off in the direction of the trjectory tken, while the prmeter σ mnges it by controlling the bootstrpping in the direction of ctions not tken. A qulittive result tht illustrtes the bis-vrince trde-off induced by the prmeter σ cn be observed in the 19-Stte Figure 6: Mountin cliff results. The plot shows the performnce of ech tomic multi-step lgorithm in terms of the verge return per episode. The drk lines show the results smoothed using right-centered moving verge with window of 30 successive episodes, while the light lines show the un-smoothed results. Q(σ) with dynmic σ hd the best performnce mong ll the lgorithms. Rndom Wlk experiment. A lrge vlue of σ results in lower bis t the beginning of trining nd lower RMS error s consequence. However, s the bis of the return decreses in the symptote, the low vrince inherent to smll vlues of σ result in more ccurte estimtes of the ction-vlue function. Conclusions In this pper we studied Q(σ), which is unifying lgorithm for multi-step TD control methods. Q(σ), through the use of the smpling prmeter σ, llows for continuous vrition between updting bsed on full smpling nd updting bsed on pure expecttion. Our results on prediction nd control problems showed tht n intermedite fixed degree of smpling cn outperform the methods tht exist t the extremes (Srs nd Tree-bckup). In ddition, we presented simple wy of dynmiclly djusting σ which outperformed ny fixed degree of smpling. Our presenttion of Q(σ) ws limited to the tomic multistep cse without eligibility trces, we only conducted experiments on on-policy problems, nd we only investigted one simple method for dynmiclly vrying σ. This leves open severl venues for future reserch. First, Q(σ) could be extended to use eligibility trces nd compound bckups. Second, the performnce of Q(σ) could be evluted on off-policy problems. Third, other schemes for dynmiclly vrying σ could be investigted perhps s function of stte, the recently observed rewrds, or some mesure of the lerning progress. Acknowledgments The uthors thnk Vincent Zhng, Hrm vn Seijen, Doin Precup, nd Pierre-luc Bcon for insights nd discussions contributing to the results presented in this pper, nd the entire

8 Reinforcement Lerning nd Artificil Intelligence reserch group for providing the environment to nurture nd support this reserch. We grtefully cknowledge funding from Albert Innovtes Technology Futures, Google Deepmind, nd from the Nturl Sciences nd Engineering Reserch Council of Cnd. Appendix Proof of Theorem 1 Let X = S A, X t = (S t, A t ) X, R t = E{R t }, nd Q be the optiml ction-vlue function defined s Q (S t, A t ) = R t+1 + γe{mx Q (S t+1, )}. (17) We define new stochstic process (α t, t, F t ) t 0 by subtrcting Q (X t ) from both sides of eqution (16) t+1 (X t ) = (1 α t (X t )) t (X t ) α t (X t )F t (X t ), nd letting α t (0, 1], t (X t ) = Q t (X t ) Q (X t ), nd F t = R t+1 +γ[σ t+1 Q t (X t+1 )+(1 σ t+1 )V t+1 ] Q (X t ). Additionlly, let P t be sequence of incresing σ-fields representing the history such tht α 0 nd 0 re P 0 -mesurble nd α t, t, nd F t 1 re P t -mesurble for t 1. Proving tht t converges to 0 s t is equivlent to showing tht Q t converges to Q s t. Consequently, the proof is equivlent to showing tht the conditions of lemm 1 from Singh et l. (2000) re stisfied for t. Conditions one, two, nd three of the lemm re stisfied by the corresponding ssumptions of the theorem. Hence, we only need to show tht E{F t P t } k t + C t where. is the mximum norm, k [0, 1), nd C t goes to 0 with probbility 1. By dding nd subtrcting mx Q t (S t, ), using the definition of Q nd the tringle inequlity, we cn show tht E{F t P t } E{R t+1 + γ mx Q t (S t+1, ) Q (S t, A t )} + γ E{σ t+1 Q t (s t+1, t+1 ) + (1 σ t+1 )V t+1 mx Q t (s t+1, )} = γ E{mx Q t (S t+1, ) mx Q (S t+1, b)} + C t b γ mx s mx γ mx mx s = γ t + C t. Q t (s, ) mx b Q t (s, ) Q (s, ) + C t Q (s, b) + C t Note tht if the policy is greedy nd σ t+1 [0, 1], then σ t+1 Q t (S t+1, A t+1 )+(1 σ t+1 )V t+1 = mx Q t (S t+1, ). Therefore, C t goes to 0 s the policy becomes greedy in the limit. Consequently, condition 3 of lemm 1 from Sing et l. (2000) is stisfied. Therefore, t converges to 0 w.p. 1, which implies tht Q t converges to Q w.p. 1. Precup, D.; Sutton, R. S.; nd Singh, S Eligibility trces for off-policy policy evlution. In Kufmn, M., ed., Proceedings of the 17th Interntionl Conference on Mchine Lerning, Rummery, G. A., nd Nirnjn, M On-line Q-lerning using connectionist systems. Technicl report, CUED/F- INFENG/TR 166, Engineering Deprtment, Cmbridge University. Rummery, G. A Problem Solving with Reinforcement Lerning. Ph.D. Disserttion, Cmbridge University. Singh, S.; Jkkol, T.; Littmn, M. L.; nd Szepesvári, C Convergence results for single-step on-policy reinforcement-lerning lgorithms. Mchine Lerning 38(3): Sutton, R. S., nd Brto, A. G Reinforcement Lerning: An Introduction. Cmbridge, Msschusetts: MIT Press. Sutton, R. S., nd Brto, A. G Reinforcement Lerning: An Introduction. 2nd edition. Mnuscript in preprtion. Sutton, R. S Lerning to predict by the methods of temporl differences. Mchine lerning 3(1):9 44. Sutton, R. S Generliztion in reinforcement lerning: Successful exmples using sprse corse coding. In Touretzky, D. S., nd Hsselmo, M. E., eds., Advnces in Neurl Informtion Processing Systems 8, MIT Press. vn Seijen, H.; vn Hsselt, H.; Whiteson, S.; nd Wiering, M A theoreticl nd empiricl nlysis of expected Srs. In Proceedings of the IEEE Symposium on Adptive Dynmic Progrmming nd Reinforcement Lerning, Wtkins, C. J. C. H., nd Dyn, P Q-lerning. Mchine lerning 8(3-4): Wtkins, C. J. C. H Lerning from Delyed Rewrds. Ph.D. Disserttion, Cmbridge University. References Jkkol, T.; Jordn, M. I.; nd Singh, S. P On the convergence of stochstic itertive dynmic progrmming lgorithms. Neurl Computtion 6(6):

Chapter 3: The Reinforcement Learning Problem. The Agent'Environment Interface. Getting the Degree of Abstraction Right. The Agent Learns a Policy

Chapter 3: The Reinforcement Learning Problem. The Agent'Environment Interface. Getting the Degree of Abstraction Right. The Agent Learns a Policy Chpter 3: The Reinforcement Lerning Problem The Agent'Environment Interfce Objectives of this chpter: describe the RL problem we will be studying for the reminder of the course present idelized form of