Intro to Reinforcement Learning. Part 3: Core Theory

Size: px

Start display at page:

Download "Intro to Reinforcement Learning. Part 3: Core Theory"

Gregory Webster
5 years ago
Views:

1 Intro to Reinforcement Learning Part 3: Core Theory

2 Interactive Example: You are the algorithm!

Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2 state action reward states S t 2 S actions A t 2 A rewards

3 Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2 state action reward states S t 2 S actions A t 2 A rewards R t 2 R policy time... R... 3 policy : A S! [0, 1] (a s) =Pr{A t = a S t = s} dynamics p : S R S A! [0, 1] p(s 0,r s, a) =Pr{S t+1 = s 0,R t+1 = r S t = s, A t = a}

4 Rewards and returns The objective in RL is to maximize long-term future reward That is, to choose A t so as to maximize R t+1,r t+2,r t+3,... But what exactly should be maximized? The discounted return at time t: the discount rate G t = R t+1 + R t R t R t [0, 1) 0.5(or any) Reward sequence Return

5 Values are expected returns The value of a state, given a policy: v (s) =E{G t S t = s, A t:1 } v : S!< The value of a state-action pair, given a policy: q (s, a) =E{G t S t = s, A t = a, A t+1:1 } q : S A!< The optimal value of a state: v (s) = max v (s) v : S!< The optimal value of a state-action pair: q (s, a) = max q (s, a) q : S A!< Optimal policy: is an optimal policy if and only if (a s) > 0 only where q (s, a) = max q (s, b) 8s 2 S b in other words, is optimal iff it is greedy wrt q

6 4 value functions prediction state values v action values q control v q All theoretical objects, mathematical ideals (expected values) Distinct from their estimates: V (s) Q(s, a) ˆv(s; w t ) ˆq(s, a; w t )

7 Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; γ = 0.9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

8 Golf State is ball location Reward of 1 for each stroke until the ball is in the hole Value of a state? Actions:! putt (use putter)! driver (use driver) putt succeeds anywhere on the green vv putt putt!6!5!4!3!1!2 s an d!4!"!3!2!4!3 s a n d 0!1 green!"!2 Q q * * (s,driver) s a n d 0!1!2!2 s an R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction green 9!3

9 Why Optimal State-Value Functions are Useful Any policy that is greedy with respect to v * is an optimal policy. Therefore, given v *, one-step-ahead search produces the long-term optimal actions. E.g., back to the gridworld: A B B' A' v * π * a) gridworld b) V* c)!* R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

10 Optimal Value Function for Golf We can hit the ball farther with driver than with putter, but with less accuracy q * (s,driver) gives the value or using driver first, then using whichever actions are best Q q * * (s,driver) s a n d!2 0!1!2!3 s an d green R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

11 Why Optimal State-Value Functions are Useful Any policy that is greedy with respect to v * is an optimal policy. Therefore, given v *, one-step-ahead search produces the long-term optimal actions. E.g., back to the gridworld: A B B' A' v * π * a) gridworld b) V* c)!* R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

12 Solving unknown MDPs Tabular case: No function approximation; each state has dedicated memory, and the states are visible Given a policy and returns from following it, we can average the returns (for each state-action pair) to approximate its action-value function Given q, we can form a new policy 0 that is greedy with respect to it: 0 (a s) > 0 only where q (s, a) = max b The new policy is guaranteed to be an improvement: v 0(s) v (s) 8s 2 S q q (s, b) 8s 2 S with equality only if both are optimal

13 This is called policy improvement. Schematically: policy evaluation v greedification 0 policy improvement It follows then that repeated policy improvement: 1 v 1 2 v v v eval greedy eval greedy eval greedy eval greedy converges to an optimal policy in a finite number of iterations This is called policy iteration. It is the basis for almost all (control) solution methods.

14 Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. π evaluation v v π π greedy(v) v A geometric metaphor for convergence of GPI: improvement v = v π v 0 π 0 v * π * π * v * π = greedy(v) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

15 Summary Agent-environment interaction! States! Actions! Rewards Policy: stochastic rule for selecting actions Return: the function of future rewards agent tries to maximize Markov Decision Process Value functions! State-value function for a policy! Action-value function for a policy! Optimal state-value function! Optimal action-value function Optimal policies Generalized policy iteration Next up: A Monte Carlo Learning Example: Solving Blackjack R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?