Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

Size: px

Start display at page:

Download "Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i"

Archibald Mills
6 years ago
Views:

1 Temporal-Di erence Learning Taras Kucherenko, Joonatan Manttari KTH March 7, 2017 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

2 Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it is too late) It cannot use bootstrapping Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

3 Overview 1 Description of TD Introduction to TD Sample Backup TD Error Advantages of TD 2 Examples of TD Algorithms SARSA Q-Learning Expected SARSA Double Q-Learning 3 Extensions Disadvantages of TD n-step TD and Eligibility Traces Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

4 Introduction to TD Temporal Di erence Learning Learning after time steps, not entire episodes Central Idea in RL Combination of Monte Carlo and DP No need for model, learn from experience (MC) Updates based on estimates (DP) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

5 Introduction to TD Relation to other methods: Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

6 TD Prediction Consider prediction problem, learn v For prediction both TD and MC use experience Follow, useexperiencetoupdateestimatev of v MC methods update after return is known (end of episode) TD methods update after one time step Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

7 Introduction to TD Remember state-value function: v (s) =E [G t S t = s] = E " 1 X k=0 = E "R t+1 + k R t+k+1 S t = s k=0 # # 1X k R t+k+2 S t = s = E [R t+1 + v (S t+1 ) S t = s] MC methods estimate E [G t S t = s] usingsampling DP methods estimate E [R t+1 + v (S t+1 ) S t = s] byusingthe current estimate V (S t+1 ) instead of the true value v (S t+1 ) TD methods use both ideas: sampling and using estimate for updates ) Potential for utilizing advantages of both MC and DP Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

8 Simple TD Simple every-visit MC example h i V (S t ) ( V (S t )+ G t V (S t ) G t = Return following t, =step-sizeparameter Use target and current prediction for updates Update after episode ends But what if we updated after each timestep? Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

9 MC Refresher Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

10 Simple TD Simple algorithm: TD(0) h i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ) Update each time step Bootstrapping like DP (uses estimates) MC target = G t TD target = R t+1 + V (S t+1 ) TD(0) = update only using one time step Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

11 TD Error TD updates according to target and current estimate: h i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ) t = R t+1 + V (S t+1 ) V (S t ) Known as TD Error Defined for time step t, availableatt +1 Error is temporal di erence between estimate at t and t+1 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

12 Sample Backup Backup diagram of TD(0): Immediately updates prediction Sample Backup Used by MC and TD Uses R t+1 and V (s 0 ) of a sample state s Full Backup: use distribution of all possible s Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

13 Tabular TD(0) algorithm for estimating v Input: the policy to be evaluated Initialize V (s) arbitrarily (e.g., V (s) =0, 82S + ) Repeat (for each episode): Initialize S Repeat (for each step of episode): A action given by for S Take action A, observe h R, S 0 i V (S t ) V (S t )+ R t+1 + V (S t+1 ) V (S t ) S S 0 until S is terminal Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

14 Example: Watching Netflix Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

15 Example: Watching Netflix How many episodes of a Netflix series will you watch today? State signal: Time, day, plans, auto-play activated Reward: Number of watched episodes between states State value: Expected episodes left to watch Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

16 Example: Watching Netflix Assume your day is as follows: State Watched Episodes Predicted Episodes to Go Predicted Total # Start watching Sunday at 15. Plans: laundry and dinner Neighbor says laundry machine broken Friend calls with dinner plans Cli hanger Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

17 Example: Watching Netflix Assume every-visit MC method with = 1, = 1 What would the update be for each state-value estimate? Remember update-rule: h i V (S t ) ( V (S t )+ G t V (S t ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

18 Example: Watching Netflix Example of calculating MC update h for broken laundry i machine: V (broken) ( V (broken)+ G broken V (broken) h i V (broken) ( V (broken) ( 4 New predicted total episodes to watch atv broken : Total broken = V (broken)+episodeswatched(broken) Total broken =4+2=6 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

19 Example: Watching Netflix Result of all Updates: Updates after day is complete But why not at each new situation? (What if internet is down?) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

20 Example: Watching Netflix Comparison with TD(0) What would the update be for each state-value estimate? Remember h update-rule: i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ), =1 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

21 Example: Watching Netflix Example of calculating TD(0) hupdate for broken laundry machine: i V (broken) ( V (broken)+ R t+1 + V (dinner) V (broken) h i V (broken) ( V (broken) ( 2 New predicted total episodes to watch atv broken : Total broken = V (broken)+episodeswatched(broken) Total broken =2+2=4 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

22 Example: Watching Netflix Result of all updates: Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

23 Example: Watching Netflix MC TD TD updates after each situation (Proportional to) error w.r.t. next time-step Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

24 Advantages of TD Comparison with other methods Illustrative examples Convergence Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

25 Comparison with other methods Combines sampling (MC) with bootstrapping (DP) More general than DP Does not require a model of the surrounding Faster changes than in MC Don t need to wait till the end of the episode Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

26 Example 1: Random Walk Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

27 Example 1: Random Walk Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

28 Example 2: Life of developer Episodes: Code runs, 1 Code runs, 1 Code runs, 1 Code runs, 1 Code runs, -1 PC is on, 0, Code runs, -1 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

29 Convergence of TD Definition A sequence Xn of random variables converges in probability towards the random variable X if for all >0: lim x!1 Pr( Xn X >= ) =0 Theory TD converges with probability 1, if step size satisfies the following conditions: P 1 n=1 a n(a) =1 and P 1 n=1 a2 n(a) < 1, wherea n (a) isastep size for an action a, which we usually call Practice Converges faster and to the better solution than MC Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

30 TD Methods for Control Problems Use TD Methods to find optimal policy As before, exploration achieved using on-policy or o -policy methods On-policy: SARSA O -policy: Q-learning Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

31 On-policy & O -policy Refresher regarding on-policy and o -policy: For both the goal is to update state-action pair values to find optimal policy On-policy: update Q(s,a) from experience following the current policy O -policy: update Q(s,a) from experience following another policy (behavior policy) Example: New go-kart driver learning from other drivers Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

32 SARSA Uses action-value function (not state-value) Estimate q (s, a), 8(s, a) following policy Similar to TD(0) Markov Chain with reward process Converge assured for Q(S t, A t )withsametheorems Di erent transition scheme Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

33 SARSA Uses action-value function (not state-value) Estimate q (s, a), 8(s, a) following policy Similar to TD(0) Markov Chain with reward process Converge assured for Q(S t, A t )withsametheorems Di erent transition scheme Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

34 SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update after each transition Update dependent on (S t, A t, R t+1, S t+1, A t+1 ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

35 SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update after each transition If S t+1 terminal, Q(S t+1, A t+1 )=0 Update dependent on (S t, A t, R t+1, S t+1, A t+1 ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

36 SARSA Backup Diagram: Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

37 SARSA Algorithm Initialize Q(s, a), 8s 2S, 8a 2A(s) arbitrarilyand Q(terminal-state, ) =0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g. -greedy) Repeat (for each step of episode): Take action A, observe R, S 0 Choose A 0 from S 0 using policy derived from Q (e.g. -greedy) Q(S, A) Q(S, A)+ [R + Q(S 0, A 0 ) Q(S, A)] S S 0, A A 0 ; Until S is terminal On policy, as we are following our -greedy to choose actions Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

38 SARSA Convergence properties depend on how is derived from Q Converges to, Q with probability 1, given: all (s, a) arevisited1 times policy converges to greedy (e.g. = 1/t) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

39 SARSA Example: Windy Grid World Wind shifts position upward A = (up,down,left,right) R=-18(s, a) until goal Apply -greedy SARSA =0.1, =0.5, initial Q(s, a) =0, 8(s, a) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

40 SARSA Example: Windy Grid World Results of Applying -Greedy SARSA: Policy improves after each times step Optimal policy reached after 6000 time steps MC methods could get stuck, TD more suitable (learns during episode) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

41 Q-Learning(O -Policy) Update rule : Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] Directly approximate q*, independent of the policy being followed, because update does not depend on the policy Converged for any policy, as long as: all pair continued to be updated step-size satisfies usual approximation condition Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

42 Q-Learning Algorithm Initialize: Q(s, a), 8s 2S, 8a 2A(s) arbitrarilyandq(terminal state,.)=0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g. -greedy) Take action A, observe R, S 0 Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] S S 0 Until S is terminal O policy, as we are taking the max for update (greedy policy) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

43 Example: Cli Walking Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

44 Expected SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update rule for Q-learning: Q(S t, A t ) Q(S t, A t )+ h R t+1 + i max Q(S t+1, a) Q(S t, A t ) a Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

45 Expected SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update rule for Q-learning: Q(S t, A t ) Q(S t, A t )+ h R t+1 + i max Q(S t+1, a) Q(S t, A t ) a Update rule for Expected SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + E[Q(S t+1, A t+1 ) S t+1 ] Q(S t, A t )] " # Q(S t, A t )+ R t+1 + X a (a S t+1 )Q(S t+1, a) Q(S t, A t ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

46 Expected SARSA Deterministically updates values in the same direction as SARSA, using expectation Eliminates variance (no need to randomly select A t+1 ) Computationally more expensive (sum over A(S 0 )) Performs better than SARSA Uses more information for updates Performs well with = 1(bad -greedy choice not a problem) If applied as o -policy, becomes Q-learning ( greedy,µ exploratory) X (a S t+1 )Q(S t+1, a) =maxq(s t+1, a) a a Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

47 Expected SARSA vs Q learning Figure: Backup diagrams Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

48 Maximization bias Reminder: Q Learning update rule : Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] We take maximum, which can lead to very optimistic estimates This problem is called Maximization bias Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

49 Example: making plans for the weekend Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

50 Maximization bias Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

51 Double Q-Learning Learn 2 independent estimates Q 1 (S, A) andq 2 (S, A) Use one to determine the maximizing action: A S =argmax a Q 1 (S, A) Use another one to provide an estimate: Q 2 (S, A S )=Q 2(arg max a Q 1 (S, A)) Resulting algorithm learns true estimate: E(Q 2 (S, A S )) = q(s, A S ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

52 Disadvantages of TD Reminder: Q Learning update rule : Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] It has maximization bias The learning is local: only the state before the reward is updated After usual initialization we get Q(S, A) = 0 8S, A So update is the following : Q(S, A) 0+ [R +0 0] Resulting in only the last state being updated Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

53 n-step TD TD and MC at opposites end of backup depth spectrum Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

54 n-step TD Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

55 n-step TD Remember 1-step TD target: h i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ) What if we used the reward for 2 transitions? R t+1 + R t V (S t+2 ) Or the reward for n transitions? R t+1 + R t R t n 1 R t+n + n V (S t+n ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

56 n-step TD Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

57 n-step TD Example of updates: Assume following episode in gridworld: How would simple 1-step TD(0) backup look after 1 episode? What about 3-step? Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

58 1-step TD(0) 2-step TD(0) 4-step TD(0) MC Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

59 Eligibility Tracing n-step TD can update more values However, n transitions required before update! How to use return from n transitions, without having to wait? Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

60 Eligibility Tracing Eligibility trace uses Complex Backup - average of several n-step TD errors Complex backup with 2-step and 4-step TD error Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

61 Eligibility Tracing Eligibility Tracing uses Complex Backup Complex backup with n-steps, weighted with (trace decay parameter). What is the result of = 0? What about = 1? Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

62 Eligibility Tracing Second part: Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

63 Eligibility Tracing Second part: Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

64 Eligibility Tracing Second part: Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

65 Summary TD works better than MC for prediction Because updates are happening every step, but not at the end of the episode TD often uses DP for control The common way is to use Generalized policy iteration (GPI), which is abstracted from Dynamic Programming Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

66 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

67 References Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An Introduction A Bradford Book Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

68 TD Error Monte Carlo error can be written in terms of TD error: G t V (S t )=R t+1 + G t+1 V (S t ) + V (S t ) V (S t ) = t + G t+1 V (S t+1 ) = t + t G t+2 V (S t+2 ) = t + t+1 + t T t+1 T t+1 + T t G T V (S T ) = t + t+1 + t T t 1 T t 1 + T t 0 0 = TXt 1 k=0 k t+k (1) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides