10703 Deep Reinforcement Learning and Control

Size: px

Start display at page:

Download "10703 Deep Reinforcement Learning and Control"

Drusilla Bradley
6 years ago
Views:

1 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department Temporal Difference Learning

2 Used Materials Disclaimer: Much of the material and slides for this lecture were borrowed from Rich Sutton s class and David Silver s class on Reinforcement Learning.

3 MC and TD Learning Goal: learn from episodes of experience under policy π Incremental every-visit Monte-Carlo: - Update value V(S t ) toward actual return G t Simplest Temporal-Difference learning algorithm: TD(0) - Update value V(S t ) toward estimated returns is called the TD target is called the TD error.

4 DP vs. MC vs. TD Learning Remember: MC: sample average return approximates expectation TD: combine both: Sample expected values and use a current estimate V(S t+1 ) of the true v π (S t+1 ) DP: the expected values are provided by a model. But we use a current estimate V(S t+1 ) of the true v π (S t+1 )

5 Dynamic Programming V(S t ) E [ π R t+1 + γ V(S t+1 )] = X a (a S t ) X s 0,r p(s 0,r S t,a)[r + V (s 0 )]

6 Monte Carlo

7 Simplest TD(0) Method

8 TD Methods Bootstrap and Sample Bootstrapping: update involves an estimate - MC does not bootstrap - DP bootstraps - TD bootstraps Sampling: update does not involve an expected value - MC samples - DP does not sample - TD samples

9 TD Prediction Policy Evaluation (the prediction problem): - for a given policy π, compute the state-value function v π Remember: Simple every-visit Monte Carlo method: V (S t ) h i V (S t )+ G t V (S t ) target: the actual return after time t The simplest Temporal-Difference method TD(0): h i V (S t ) V (S t )+ R t+1 + V (S t+1 ) V (S t ) target: an estimate of the return

10 Example: Driving Home Elapsed Time Predicted Predicted State (minutes) Time to Go Total Time leaving o ce, friday at reach car, raining exiting highway ndary road, behind truck entering home street arrive home

11 Example: Driving Home Changes recommended by Monte Carlo methods (α=1) Changes recommended by TD methods (α=1)

12 Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome - Less memory - Less computation You can learn without the final outcome - From incomplete sequences Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?

13 Bias-Variance Trade-Off Monte-Carlo: Update value V(S t ) toward actual return G t Return is unbiased estimate of TD: Update value V(S t ) toward estimated returns True TD target: is unbiased estimate of TD target: is biased estimate of TD target is much lower variance than the return: - Return depends on many random actions, transitions, rewards - TD target depends on one random action, transition, reward

14 Bias-Variance Trade-Off MC has high variance, zero bias - Good convergence properties - Even with function approximation - Not very sensitive to initial value - Very simple to understand and use TD has low variance, some bias - Good Usually more efficient than MC - TD(0) converges to v π (s) - More sensitive to initial value

15 Random Walk Example Values learned by TD after various numbers of episodes V (S t ) h i V (S t )+ R t+1 + V (S t+1 ) V (S t ).

16 TD and MC on the Random Walk Data averaged over 100 sequences of episodes

17 Batch Updating in TD and MC methods Batch Updating: train completely on a finite amount of data, - e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD or MC, but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD converges for sufficiently small α. Constant-α MC also converges under these conditions, but may converge to a different answer.

18 Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

19 AB Example Suppose you observe the following 8 episodes: Assume Markov states, no discounting (γ = 1)

20 AB Example

21 AB Example The prediction that best matches the training data is V(A)=0 - This minimizes the mean-square-error on the training set - This is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 - This is correct for the maximum likelihood estimate of a Markov model generating the data - i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts. - This is called the certainty-equivalence estimate - This is what TD gets

22 Summary so far Introduced one-step tabular model-free TD methods These methods bootstrap and sample, combining aspects of DP and MC methods TD methods are computationally congenial If the world is truly Markov, then TD methods will learn faster than MC methods

23 Unified View Temporaldifference learning width of backup Dynamic programming height (depth) of backup Monte Carlo Exhaustive search...

24 Learning An Action-Value Function Estimate q π for the current policy π R t +2 R t R t +1 S t S S t, A t +1 t S, t+1 A t+1 S t +2 S t +3 S, t+2 A t+2 S, t+3 A t+3 After every transition from a nonterminal state, S t, do this: Q(S t, A t ) Q(S t, A t ) + α [ R t+1 + γ Q(S t+1, A t+1 ) Q(S t, A t )] If S t+1 is terminal, then define Q(S t+1, A t+1 ) = 0

25 Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g., "-greedy) Repeat (for each step of episode): Take action A, observe R, S 0 Choose A 0 from S 0 using policy derived from Q (e.g., "-greedy) Q(S, A) Q(S, A)+ [R + Q(S 0,A 0 ) Q(S, A)] S S 0 ; A A 0 ; until S is terminal

26 Windy Gridworld undiscounted, episodic, reward = 1 until goal

27 Results of Sarsa on the Windy Gridworld

28 Q-Learning: Off-Policy TD Control One-step Q-learning: Q(S t,a t ) h Q(S t,a t )+ R t+1 + max a i Q(S t+1,a) Q(S t,a t ) Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g., "-greedy) Take action A, observe R, S 0 Q(S, A) Q(S, A)+ [R + max a Q(S 0,a) Q(S, A)] S S 0 ; until S is terminal

29 Cliffwalking

30 Expected Sarsa Instead of the sample value-of-next-state, use the expectation! Q(S t,a t ) h i Q(S t,a t )+ R t+1 + E[Q(S t+1,a t+1 ) S t+1 ] Q(S t,a t ) h i Q(S t,a t )+ R t+1 + (a S t+1 )Q(S t+1,a) Q(S t,a t ), X a Expected Sarsa performs better than Sarsa (but costs more)

31 Performance on the Cliff-walking Task Asymptotic Performance Expected Sarsa Reward per episode Q-learning Interim Performance (after 100 episodes) Q-learning Sarsa n = 100, Sarsa n = 100, Q learning n = 100, Expected Sarsa n = 1E5, Sarsa n = 1E5, Q learning n = 1E5, Expected Sarsa

32 Summary Introduced one-step tabular model-free TD methods These methods bootstrap and sample, combining aspects of DP and MC methods TD methods are computationally congenial If the world is truly Markov, then TD methods will learn faster than MC methods Extend prediction to control by employing some form of GPI - On-policy control: Sarsa, Expected Sarsa - Off-policy control: Q-learning

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following