Multi-step Bootstrapping

Size: px

Start display at page:

Download "Multi-step Bootstrapping"

Bertram Harris
5 years ago
Views:

1 Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, / 29

2 Multi-step Bootstrapping Generalization of Monte Carlo methods and one-step TD methods Includes methods that lie in-between these two extremes Methods based on sample episodes of states, actions and rewards February 7, / 29

3 Multi-step Bootstrapping Generalization of Monte Carlo methods and one-step TD methods Includes methods that lie in-between these two extremes Methods based on sample episodes of states, actions and rewards Time intervals for making updates and bootstrapping are no longer the same Enables bootstrapping to occur over longer time intervals February 7, / 29

4 Prediction Problem (Policy Evaluation) Given a fixed policy π, estimate the state-value function v π J February 7, / 29

5 Prediction Problem (Policy Evaluation) Given a fixed policy π, estimate the state-value function v π Monte Carlo update V (S t ) V (S t ) + α(g t V (S t )) G t = R t+1 + γr t+2 + γ 2 Rt γ T t 1 R T Updates of the state-value estimates happen at the end of each episode Gt is the complete return of an episode after S t No bootstrapping involved (does not use other estimations) J February 7, / 29

6 Prediction Problem (Policy Evaluation) One-step TD update V t+1 (S t ) V t (S t ) + α(r t+1 + γv t (S t+1 ) V t (S t )) Updates happen one step later, bootstrapping using Vt(S t+1) Rt+1 + γv t(s t+1) approximates G t J February 7, / 29

7 n-step TD Prediction Approximate G t by looking ahead n steps Bootstrap using Vt+n 1(S t+n) G (n) t = { R t+1 + γr t γ n V t+n 1(S t+n) G t 0 t < T n t + n T G (n) t Incorporate discounted rewards up to Rt+n is called the n-step return February 7, / 29

8 n-step TD Prediction Approximate G t by looking ahead n steps Bootstrap using Vt+n 1(S t+n) G (n) t = { R t+1 + γr t γ n V t+n 1(S t+n) G t 0 t < T n t + n T G (n) t G (1) t G (T ) t Incorporate discounted rewards up to Rt+n is called the n-step return for one-step TD for Monte Carlo J February 7, / 29

9 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 February 7, / 29

10 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T February 7, / 29

11 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T No updates during the first n 1 time steps n 1 updates at the end of the episode using G t February 7, / 29

12 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T No updates during the first n 1 time steps n 1 updates at the end of the episode using G t Still considered TD methods (n < T ) Involves changing an earlier estimate based on how it differs from a later estimate February 7, / 29

13 n-step TD Prediction J February 7, / 29

14 n-step TD Prediction J February 7, / 29

15 n-step TD Prediction The expected n-step return is guaranteed to be a better estimate of v π than V t+n 1 in the worst case max s E[G (n) S t = s] v π (s) γ n max V t+n 1 (s) v π (s) t s February 7, / 29

16 n-step TD Prediction The expected n-step return is guaranteed to be a better estimate of v π than V t+n 1 in the worst case max s E[G (n) S t = s] v π (s) γ n max V t+n 1 (s) v π (s) t s All n-step TD methods converge to correct predictions under appropriate technical conditions February 7, / 29

17 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 February 7, / 29

18 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s February 7, / 29

19 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s Suppose the first episode goes from C to the right, through D and E February 7, / 29

20 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s Suppose the first episode goes from C to the right, through D and E At the end of the episode For a one-step method, only V (E) incremented towards 1 For a two-step method, both V (D) and V (E) incremented towards 1 For n 3, all V (C), V (D) and V (E) incremented towards 1 February 7, / 29

21 Example Empirical comparison for a similar problem Random walk with 19 states All rewards are 0 except the left-most being 1 February 7, / 29

22 Example Empirical comparison for a similar problem Random walk with 19 states All rewards are 0 except the left-most being 1 An intermediate value of n works best February 7, / 29

23 Control Problem (Policy Evaluation + Policy Improvement) Find an optimal policy π February 7, / 29

24 Control Problem (Policy Evaluation + Policy Improvement) Find an optimal policy π Alternate estimating action-value function q π (evaluation) and updating policy π (improvement) Estimate qπ instead of v π because we need this information to decide the next π February 7, / 29

25 Control Problem (On-Policy) Evaluation step Monte Carlo evaluation Q(S t, A t ) Q(S t, A t ) + α(g t Q(S t, A t )) Sarsa (one-step on-policy TD) evaluation Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t+1 + γq t (S t+1, A t+1 ) Q t (S t, A t )) Rt+1 + γq t(s t+1, A t+1) approximates G t February 7, / 29

26 Control Problem (On-Policy) Evaluation step Monte Carlo evaluation Q(S t, A t ) Q(S t, A t ) + α(g t Q(S t, A t )) Sarsa (one-step on-policy TD) evaluation Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t+1 + γq t (S t+1, A t+1 ) Q t (S t, A t )) Rt+1 + γq t(s t+1, A t+1) approximates G t Improvement step ɛ-greedy (or any other ɛ-soft policy) helps maintain exploration A argmax a Q(S t, a) a A(S t ), π(a S t ) { 1 ɛ + ɛ/ A(S t ) a = A ɛ a A February 7, / 29

27 n-step Sarsa Modification to evaluation step Similar to prediction, approximate G t with G (n) t = { R t+1 + γr t γ n Q t+n 1 (S t+n, A t+n ) G t 0 t < T n t + n T Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t, A t )) 0 t < T February 7, / 29

28 n-step Sarsa Modification to evaluation step Similar to prediction, approximate G t with G (n) t = { R t+1 + γr t γ n Q t+n 1 (S t+n, A t+n ) G t 0 t < T n t + n T Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t, A t )) 0 t < T Expected Sarsa Replace Qt+n 1(S t+n, A t+n) with E[Q t+n 1(S t+n, A t+n) S t+n] = π(a S t+n)q t+n 1(S t+n, a) a Moves deterministically in same direction Sarsa moves in expectation Requires more computation but eliminates variance from sampling At+n February 7, / 29

29 n-step Sarsa J February 7, / 29

30 n-step Sarsa J February 7, / 29

31 Example Gridworld scenario where rewards at all states are 0 except a positive reward on the square G Initialize V (s) = 0, s Suppose you take a path on the first episode, and you end at G At the end of the episode One-step method only strengthens the last state-actions pair in the path for the next policy n-step method strengthens the last n state-actions pairs in the path for the next policy J February 7, / 29

32 Control Problem (Off-Policy) Learn the value for one policy π while following another policy µ π often greedy and µ exploratory (ex. ɛ-greedy) Requires that π(a s) > 0 implies µ(a s) > 0 February 7, / 29

33 Control Problem (Off-Policy) Learn the value for one policy π while following another policy µ π often greedy and µ exploratory (ex. ɛ-greedy) Requires that π(a s) > 0 implies µ(a s) > 0 Importance sampling (Monte Carlo) Step size takes into account the difference between π and µ using relative probability of all the subsequent actions ρ T t is the importance sampling ratio ρ T t = V (S t) V (S t) + αρ T t (G t V (S t)) T 1 k=t T 1 π(a k S k )P(S k+1 S k, A k ) µ(a k S k )P(S k+1 S k, A k ) = k=t π(a k S k ) µ(a k S k ) February 7, / 29

34 Off-Policy Learning by Importance Sampling In n-step methods, returns are constructed over n steps Interested in the relative probability of just those n actions Incorporate ρ t+n t (in place of ρ T t ) into TD min(t+n,t 1) ρ t+n π(a k S k ) t = µ(a k S k ) k=t V t+n (S t ) V t+n 1 (S t ) + αρ t+n t (G (n) t V t+n 1 (S t )) 0 t < T February 7, / 29

35 Off-Policy Learning by Importance Sampling In n-step methods, returns are constructed over n steps Interested in the relative probability of just those n actions Incorporate ρ t+n t (in place of ρ T t ) into TD min(t+n,t 1) ρ t+n π(a k S k ) t = µ(a k S k ) k=t V t+n (S t ) V t+n 1 (S t ) + αρ t+n t (G (n) t V t+n 1 (S t )) 0 t < T If any π(a k S k ) = 0, then ρ t+n t = 0 and return would be totally ignored If any π(a k S k ) >> µ(a k S k ), then ρ t+n t increases weight given to return, which compensates for action being rarely selected under µ February 7, / 29

36 Off-Policy Learning by Importance Sampling Evaluation step replaces ρt+nt At already determined ρ t+n t+1 because requires no further sampling of A t Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + αρ t+n (n) t+1 (G t Q t+n 1 (S t, A t )) 0 t < T February 7, / 29

37 Off-Policy Learning by Importance Sampling Evaluation step replaces ρt+nt At already determined ρ t+n t+1 because requires no further sampling of A t Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + αρ t+n (n) t+1 (G t Q t+n 1 (S t, A t )) 0 t < T Expected Sarsa ρ t+n 1 t+1 replaces ρ t+n t+1 because requires no sampling of At+n Expected value all actions on (t + n)th step into account February 7, / 29

38 Off-Policy Learning by Importance Sampling J February 7, / 29

39 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data February 7, / 29

40 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data Improvements Autostep method (Mahmood et al, 2012) Invariant updates (Karampatziakis and Langford, 2010) Usage technique (Mahmood and Sutton, 2015) February 7, / 29

41 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data Improvements Autostep method (Mahmood et al, 2012) Invariant updates (Karampatziakis and Langford, 2010) Usage technique (Mahmood and Sutton, 2015) Off-policy possible without importance sampling? February 7, / 29

42 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) February 7, / 29

43 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour February 7, / 29

44 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour Updated values are independent of µ(at+1 S t+1) If π is greedy, this is exactly the Q-learning method { 1 a = argmax a Q(S t+1, a ) π(a S t+1) = 0 otherwise Q t+1(s t, A t) Q t(s t, A t) + α(r t + γmaxq a t(s t+1, a) Q t(s t, A t)) February 7, / 29

45 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour Updated values are independent of µ(at+1 S t+1) If π is greedy, this is exactly the Q-learning method { 1 a = argmax a Q(S t+1, a ) π(a S t+1) = 0 otherwise Q t+1(s t, A t) Q t(s t, A t) + α(r t + γmaxq a t(s t+1, a) Q t(s t, A t)) Possible to form off-policy methods without importance sampling J February 7, / 29

46 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Alternate the incorporation of expected values of future action-value estimates and corrections based on actual steps up to S t+n February 7, / 29

47 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Alternate the incorporation of expected values of future action-value estimates and corrections based on actual steps up to S t+n G (n) t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) γπ(a t+1 S t+1 )Q t (S t+1, A t+1 ) + γπ(a t+1 S t+1 )(R t+2 + γ π(a S t+2 )Q t+1 (S t+2, a)) a γ 2 π(a t+1 S t+1 )π(a t+2 S t+2 )Q t+1 (S t+2, A t+2 ) + γ 2 π(a t+1 S t+1 )π(a t+2 S t+2 )(R t+3 + γ a +... min(t+n,t ) 1 + γ min(t+n,t ) 1 ( π(a i S i )) (R min(t+n,t ) + γ a i=t+1 π(a S t+3 )Q t+2 (S t+3, a)) π(a S min(t+n,t ) )Q min(t+n,t ) (S min(t+n,t ), a)) February 7, / 29

48 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) February 7, / 29

49 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) min(t+n,t ) 1 G (n) t = Q t 1 (S t, A t ) + k=t δ k k i=t+1 γπ(a i S i ) Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t+n, A t+n )) February 7, / 29

50 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) min(t+n,t ) 1 G (n) t = Q t 1 (S t, A t ) + k=t δ k k i=t+1 γπ(a i S i ) Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t+n, A t+n )) G (1) t is used for Expected Sarsa February 7, / 29

51 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm J February 7, / 29

52 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm J February 7, / 29

53 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm The n-step Tree Backup algorithm is the natural extension of Q-learning to the multi-step case Requires no importance sampling like Q-learning February 7, / 29

54 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm The n-step Tree Backup algorithm is the natural extension of Q-learning to the multi-step case Requires no importance sampling like Q-learning However, if µ and π vastly differ then π(a t+i S t+i ) may be small for some i and bootstrapping may span only a few steps even if n is large February 7, / 29

55 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods February 7, / 29

56 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes February 7, / 29

57 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes Disadvantages Requires a delay of n time steps before updating Requires more computation per time step Requires more memory to store variables from the last n time steps February 7, / 29

58 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes Disadvantages Requires a delay of n time steps before updating Requires more computation per time step Requires more memory to store variables from the last n time steps n-step TD policy evaluation On-policy control: n-step Sarsa Off-policy control: Importance sampling n-step Tree Backup algorithm February 7, / 29

Reinforcement Learning

Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa