Chapter 6: Temporal Difference Learning

Size: px

Start display at page:

Download "Chapter 6: Temporal Difference Learning"

Roderick Shields
6 years ago
Views:

1 Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following the idea of Generalized Policy Iteration (GPI) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

2 wo Pieces of Motivation: I he neurotransmitter dopamine seems to implement something very similar to the so-called temporal difference error that gives D methods their name. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

3 wo pieces of motivation: II LEER RESEARCH Nature, February 2015 Extended Data Figure 1 wo-dimensional t-sne embedding of the3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

4 6.1 D Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V π Recall: Simple every-visit Monte Carlo method: [ ] V(S t ) V(S t )+α G t V(S t ) target: the actual return after time t he simplest D method, D(0): [ ] V(S t ) V(S t )+α R t+1 +γ V(S t+1 ) V(S t ) target: an estimate of the return R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

5 Simple Monte Carlo V(S t ) V(S t )+α [ G t V(S t )] where R t is the actual return following state S t. S t R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

6 Simplest D Method V(S t ) V(S t )+α R t+1 +γ V(S t+1 ) V(S t ) S t [ ] R t+1 S t+1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

7 cf. Dynamic Programming V(S t ) E π { R t+1 +γ V(S t )} S t R t+1 S t+1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

8 D methods bootstrap and sample Bootstrapping: update involves an estimate n MC does not bootstrap n DP bootstraps n D bootstraps Sampling: update does not involve an expected value n MC samples n DP does not sample n D samples R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

9 v (s) = E [G t S t =s] " 1 # X = E k R t+k+1 S t =s k=0 = E "R t+1 + # 1X k R t+k+2 S t =s k=0 = E [R t+1 + v (S t+1 ) S t =s]. Monte Carlo uses an estimate of this Dynamic Programming uses an estimate of this R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

10 abular D(0) for estimating state values abular D(0) for estimating v Input: the policy to be evaluated Initialize V (s) arbitrarily (e.g., V (s) =0, 8s 2 S + ) Repeat (for each episode): Initialize S Repeat (for each step of episode): A action given by for S ake action A, observe R, S 0 V (S) V (S)+ R + V (S 0 ) V (S) S S 0 until S is terminal backup diagram R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

11 Example: Driving Home Elapsed ime Predicted Predicted State (minutes) ime to Go otal ime leaving o ce, friday at reach car, raining exiting highway ndary road, behind truck entering home street arrive home R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

12 Driving Home Changes recommended by Monte Carlo methods (α=1) Changes recommended by D methods (α=1) 45 actual outcome Predicted total travel time leaving office reach exiting 2ndary home arrive car highway road street home Situation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

13 6.2 Advantages of D Prediction Methods D methods do not require a model of the environment, only experience D, but not MC, methods can be fully incremental n n You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and D converge (under certain assumptions to be detailed later), but which is faster? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

14 Random Walk Example A B C D E start 1 Values learned by D(0) after various numbers of episodes R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

15 D and MC on the Random Walk Data averaged over 100 sequences of episodes R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides