Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Daniella Lane
5 years ago
Views:

Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.

1 Reinforcement Learning n-step bootstrapping Daniel Hennes University Stuttgart - IPVS - Machine Learning & Robotics 1

2 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa Tree backup 2

3 n-step TD prediction 3

4 n-step returns Monte Carlo: G t = R t+1 + γr t+2 + γ 2 R t γ T t 1 R T TD: 2-step return: G t:t+1 = R t+1 + γv t (S t+1 ) G t:t+2 = R t+1 + γr t+2 + γ 2 V t+1 (S t+2 ) n-step return: G t:t+n = R t+1 + γr t γ n 1 R t+n + γ n V t+n 1 (S t+n ) 4

5 Error-reduction property Error reduction property of n-step returns max Eπ [G t:t+n S t = s] v π (s) γ n max Vt+n 1 (s) v π (s) s s }{{}}{{} Maximum error using n-step return Maximum error using V G t:t+n = R t+1 + γr t γ n 1 R t+n + γ n V t+n 1 (S t+n ) Using above, we can show that n-step methods converge Generalization of 1-step: max Eπ [R t+1 + γv (S t+1 ) S t = s] v π (s) γ max V (s) vπ (s) s s 5

6 n-step TD n-step return G t:t+n = R t+1 + γr t γ n 1 R t+n + γ n V t+n 1 (S t+n ) Not available until time t + n Natural algorithm is to wait until time t + n n-step TD update: [ ] V t+n (S t ) = V t+n 1 (S t ) + α G t:t+n V t+n 1 (S t ) 6

7 Initialize V (s) arbitrarily, for all s S for each episode do Initialize and store S 0 terminal T repeat for t = 0, 1, 2,... if t < T then Take an action according to π( S t) Observe and store next reward R t+1 and state S t+1 if S t+1 is terminal then T t + 1 end if τ t n + 1 τ is the time whose state s estimate is updated if τ 0 then G min(τ+n,t ) γ i τ 1 R i=τ+1 i if τ + n < T then G G + γ n V (S τ+n) V (S τ ) V (S τ ) + α[g V (S τ )] end if until τ = T 1 end for 7

8 Random walk example 0 A 0 B 0 C 0 D 0 E start 1 Suppose the first episode progressed directly from C to the right, through D and E How does 2-step TD work here? How about 3-step TD? 8

9 19-state random walk n=64 n=32 Average RMS error over 19 states and first 10 episodes n=32 n=16 n=1 n=8 n=4 n=2 An intermediate α is best An intermediate n is best Is there an optimal n? For every task? For larger n, smaller α seems best, why? 9

10 n-step Sarsa Action value of n-step return: G t:t+n = R t+1 + γr t γ n 1 R t+n + γ n Q t+n 1 (S t+n, A t+n ) n-step Sarsa: [ ] Q t+n (S t, A t ) = Q t+n 1 (S t, A t ) + α G t:t+n Q t+n 1 (S t, A t ) n-step Expected Sarsa: same update slightly different n-step return G t:t+n = R t+1 + γr t γ n 1 R t+n + γ n a π(a S t+n )Q t+n 1 (S t+n, a) 10

11 n-step Sarsa 11

12 n-step Sarsa example 12

13 n-step off policy learning Recall the importance sampling ratio: ρ t:h = min(h,t 1) k=t Off policy methods weight updates by this ratio π(a k S k ) µ(a k S k ) Off policy n-step TD: ] V t+n (S t ) = V t+n 1 (S t ) + αρ t:t+n 1 [G t:t+n V t+n 1 (S t ) Off policy n-step Sarsa: ] Q t+n (S t, A t ) = Q t+n 1 (S t, A t ) + αρ t+1:t+n 1 [G t:t+n Q t+n 1 (S t, A t ) Off policy n-step Expected Sarsa: ] Q t+n (S t, A t ) = Q t+n 1 (S t, A t ) + αρ t+1:t+n 2 [G t:t+n Q t+n 1 (S t, A t ), with G t:t+n = R t γ n 1 R t+n + γ n a π(a s)q t+n 1 (s, a) 13

14 Tree Backup Off policy learning without importance sampling Q-learning and Expected Sarsa for one step case Tree Backup: Update from estimated action values of the leaf nodes 14

15 n-step Tree Backup algorithm 1-step tree backup: G t:t+1 = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) 2-step tree backup: G t:t+2 = R t+1 + γ π(a S t+1 )Q t+1 (S t+1, a) a A t+1 ( + γπ(a t+1 S t+1 ) R t+2 + γ ) π(a S t+2 )Q t+1 (S t+2, a) a = R t+1 + γ π(a S t+1 )Q t+1 (S t+1, a) + γπ(a t+1 S t+1 )G t+1:t+2 a A t+1 n-step tree backup: G t:t+n = R t+1 + γ π(a S t+1 )Q t+n 1 (S t+1, a) + γπ(a t+1 S t+1 )G t+1:t+n a A t+1 15

16 A unifying algorithm: n-step Q(σ) Choosing on state by state basis wheter to sample (σ t = 1) or take the expectation (σ t = 0) On half transitions with ρ, importance sampling is required in the off policy case 16

17 Summary n-step bootstrapping generalizes TD and MC learning methods n = 1 is TD n = is MC intermediate n is often better than either extreme applies to both continuing and episodic domains Additional cost in computation we need to remember the last n states learning is delayed by n steps per-step computation is small (like TD) Everything generalizes nicely: error-reduction theory Sarsa, off policy by importance sampling, Expected Sarsa, Tree Backup very general n-step Q(σ) algorithm includes everything 17

Multi-step Bootstrapping

Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, 2017 1 / 29 Multi-step Bootstrapping Generalization