Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Shawn Fowler
6 years ago
Views:

1 Reinforcement Learning Hierarchical Reinforcement Learning Action hierarchy, hierarchical RL, semi-mdp Vien Ngo Marc Toussaint University of Stuttgart

2 Outline Hierarchical reinforcement learning Learning subgoals/hierarchy 2/??

3 Accelerating Reinforcement learning Temporal abstraction Goal/State Abstraction 3/??

4 Accelerating Reinforcement learning: Abstraction Temporal abstraction Goal/State Abstraction 4/??

5 Temporal Abstraction Dealing with multiple-time step macro actions. Advantages: Only exploring/computing values for interesting states (e.i. subgoals,... ) Transfer learning across problems/regions. 5/??

6 Hierarchical reinforcement learning Three approaches to HRL Options: Sutton (temporal + state abstraction) Finite state controller: Parr & Russel (temporal abstraction) Given an action hierarchy: MAXQ (temporal + state abstraction) 6/??

7 Semi-Markov decision process 7/??

8 Semi-Markov decision process SMDP = {S, A, T, R}, State space S Action space A Transition function T(s, a, s, t) = p(s, t s, a) State space R(s, a) 8/??

9 SMDP Semi-Markov decision processes (SMDPs) generalize MDPs by allowing the decision maker to choose actions whenever the system state changes modeling the system evolution in continuous time allowing the time spent in a particular state to follow an arbitrary probability distribution The system state may change several times between decision epochs; only the state at a decision epoch is relevant to the decision maker. 9/??

10 Semi-Markov Decision Process (SMDP) T(s, a, s, t) = P (s, t s, a) defines the joint probability of a next state, and terminal time. 10/??

11 Bellman equations for SMDP Consider discrete-time SMDP: [ V (s) = max R(s, a) + γ ] τ p(s, τ s, a)v (s ) a s,τ Q (s, a) = R(s, a) + γ τ s,τ p(s, τ s, a) max Q (s, b) b 11/??

12 Bellman equations for SMDP Consider discrete-time SMDP: [ V (s) = max R(s, a) + γ ] τ p(s, τ s, a)v (s ) a s,τ Q (s, a) = R(s, a) + γ τ s,τ p(s, τ s, a) max Q (s, b) b Dynamic Programming algorithms are correspondingly extended to SMDPs (Howard, 1971; Puterman, 1994) 11/??

13 Example: Taxi Problem R G ROOT GET PUT pickup putdown NAVIGATE to Red Y B north south east west 12/??

14 Example 2: SMDP Sutton, Precup, Singh, /??

15 Example 2: SMDP Sutton, Precup, Singh, /??

16 Example 2: SMDP Sutton, Precup, Singh, /??

17 Options Sutton, Precup, Singh, /??

18 Options An option is a triple o =< I, π, β > I: initiation set. π : S A [0, 1]: option s policy [ ] β : S 0, 1 : termination condition 17/??

19 Options 18/??

20 Value Functions for Options option s policy: π i ; global policy: µ Denote reward part of option: { } r(s, o) = E r t+1 + γr t+2 + γ 2 r t γ k r t+k o, s t = s prediction-state part: p(s s, o) = p(s, k s, o)γ k k=1 Global policy s value function { } V µ (s) = E r t + γr t+1 + γ 2 r t+2... µ, s t = s { } = E r t+1 + γr t+2 + γ 2 r t γ k 1 r t+k + γ k V µ (s t+k ) µ, s t = s [ = E r(s, o) + ] p(s t+k s, o)v µ (s ) µ, s t = s s t+k 19/??

21 { } Q µ (s, o) = E r t + γr t+1 + γ 2 r t+2... oµ, s t = s { } = E r t+1 + γr t+2 + γ 2 r t γ k 1 r t+k + γ k V µ (s t+k ) µ, s t = s { = E r t+1 + γr t+2 + γ 2 r t γ k 1 r t+k } + max µ(s t+k, o )Q µ (s t+k, o ) oµ, s t = s o = r(s, o) + p(s t+k s, o) max µ(s t+k, o )Q µ (s t+k, o ) o 20/??

22 Options: Learning SMDP Q-learning: given the set of defined options. execute the current selected option (e.g use epsilon greedy Q(s, o)) to termination. compute r(s t, o), then update Q(s t, o) as Q-learning/SARSA. 21/??

23 Options: Learning SMDP Q-learning: given the set of defined options. execute the current selected option (e.g use epsilon greedy Q(s, o)) to termination. compute r(s t, o), then update Q(s t, o) as Q-learning/SARSA. Intra-option Q-learning: partially defined options after each primitive action, update all the options (off-policy learning). converge to correct values, under same assumptions as 1-step Q-learning (Sutton) 21/??

24 Hierarchies of Abstract Machines (HAM) Parr (1998) and Parr and Russell (1988) 22/??

25 HAM A HAM is a program which constrains the actions that the agent can take in each state. Each machine is defined by: a set of states, a transition function, and a start function. Machine states m: Action, Call, Choice, Stop. The transition function: a stochastic function of the current machine state and some features of the resulting environment state to determine the next machine state. Start function that determines the initial state of the machine. 23/??

26 HAM: Learning 24/??

27 RL with HAM H MDP = SMDP. Given defined HAMs, finding policy a t = π(s t, m t ) (actions are choices made by machines). Given HAM means: similar to given well defined options o, then finding π(s, o). 25/??

28 MAXQ T. G. Dietterich (2000) Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition, JAIR. 26/??

29 MAXQ The underlying MDP M is decomposed into a set of substask M 0, M 1,..., M n. M 0 is the root subtask. (solving M 0 solves M). Each substask might consist of either primitive actions or other substasks. example: TAXI problem. 27/??

30 MAXQ: Value Decomposition Consider all descendents a of a subtask M i (or option M i ) { } V µ (i, s) = E r t + γr t+1 + γ 2 r t+2... µ, s t = s (until M i terminates) { } = E r t+1 + γr t γ k 1 r t+k + γ k V µ (s t+k ) µ, s t = s [ ] = E r(s, π i (s)) + +γ k 1 r t+k + γ k V µ (i, s t+k ) µ, s t = s = V µ (π i (s), s) + p(s, N s, π i (s))γ N V µ (i, s ) }{{} s reward term,n }{{} C µ (i,s,π i(s))(completion term) Q µ (i, s, a) = V µ (a, s) + C µ (i, s, a) The reward term: V µ Q µ (i, s, π i (s)) (i, s) = s P (s s, a)r(s s, a) If i is a macro action If i is an primitive action 28/??

31 The completion term C µ (i, s, a) is the expected discounted cummulative reward of completing subtask M i after taking subroutine M a in state s. 29/??

32 The completion term C µ (i, s, a) is the expected discounted cummulative reward of completing subtask M i after taking subroutine M a in state s. It recursively decompose V µ (0, s) into value functions for M 1, M 2,..., M n. 29/??

33 The completion term C µ (i, s, a) is the expected discounted cummulative reward of completing subtask M i after taking subroutine M a in state s. It recursively decompose V µ (0, s) into value functions for M 1, M 2,..., M n. In general: V µ (0, s) = V µ (a m, s)+c µ (a m 1, s, a m )+...+C µ (a 1, s, a 2 )+C µ (0, s, a 1 ) where a 0, a 1,..., a m is a sequence of taken substasks by a hierarchical policy going from Root M 0. For learning: only need to store C functions for non-primitive actions, and V for primitive actions. 29/??

34 Example of MAXQ value decomposition r 1, r 2,..., r 14 is a sequence of rewards w.r.t primitve actions at times 1, 2,..., /??

35 MAXQ: Learning Algorithm MAXQ-0 learning algorithm Given action hierarchy. Each subtask has zero pseudo terminal reward. 31/??

36 MAXQ-0 Learning Algorithm Initialize V (i, s) (for all primitive i) and C(i, s, j) (for all non-primitive i, and descendents j of i) arbitrarily. MAXQ-0(Node i, State s) 1: if i is primitve then 2: execute i, receive r, s 3: V t+1 (i, s) = (1 α)v t (i, s) + αr t 4: else 5: steps = 0 6: while i not terminates do 7: Choose a π i (s) (e.g. arg max b Q(i, s, b))) 8: call N = MAXQ-0(a, s) (recursive call) 9: observe s 10: C t+1 (i, s, a) = (1 α)c t (i, s, a) + α.γ N.V t (i, s ) 11: steps = steps + N 12: s = s 13: end while 14: end if 32/??

37 MAXQ-0 Learning Algorithm Compute V t (i, s ) if i is non-primitive? 33/??

38 MAXQ-0 Learning Algorithm Compute V t (i, s ) if i is non-primitive? max a Q t (i, s, a) V t (i, s) = V t (i, s) If i is a macro action If i is an primitive action Q t (i, s, a) = V t (a, s) + C t (i, s, a) 33/??

39 MAXQ: Learning Algorithm Given action hierarchy. MAXQ-Q learning algorithm When Each subtask has arbitrary non-zero pseudo reward R i. MAXQ-Q introduces one more completion function for each subtask to use inside itself. 34/??

40 MAXQ-Q Learning Algorithm Initialize V (i, s) (for all primitive i) and C(i, s, j) and C(i, s, j) (for all non-primitive i, and descendents j of i) arbitrarily. MAXQ-Q(Node i, State s) 1: if i is primitve then 2: execute i, receive r, s 3: V t+1 (i, s) = (1 α)v t(i, s) + αr t 4: else 5: steps = 0 6: while i not terminates do 7: Choose a π i (s) (arg max a [ C(i, s, a ) + V (i, s )]) 8: call N = MAXQ-Q(a, s) (recursive call) 9: observe s 10: a = arg max a [ C(i, s, a ) + V (i, s )] 11: Ct+1 (i, s, a) = (1 α) C t(i, s, a) + α.γ.( N Ri (s ) + C ) t(i, s, a ) + V t(a, s ) ( ) 12: C t+1 (i, s, a) = (1 α)c t(i, s, a) + α.γ N. C t(i, s, a ) + V t(a, s ) 13: steps = steps + N 14: s = s 15: end while 16: end if 35/??

41 Optimality in HRL? 36/??

42 Optimality in HRL? hierarchically optimal vs. recursively optimal Hierarchical optimality: The learnt policy is the best policy consistent with the given hierarchy. Task s policy depends not only on its children s policies, but also on its context. Recursive optimality: The policy for a parent task is optimal given the learnt policies of its children. (Context-free task s policy). 37/??

43 Optimality in HRL? hierarchically optimal vs. recursively optimal Hierarchical optimality: The learnt policy is the best policy consistent with the given hierarchy. Task s policy depends not only on its children s policies, but also on its context. Recursive optimality: The policy for a parent task is optimal given the learnt policies of its children. (Context-free task s policy). The context-free policies offer state abstraction/transfer learning better, which provides common macro actions to many other tasks. 37/??

44 Optimality in HRL (an example from a tutorial of Dietterich). 38/??

45 Optimality: in Options If action space consists of both primitive actions and options, the it converges to an optimal policy. Otherwise, options with SMDP learning was proved to converge to a hierarchically optimal policy. (an example from a tutorial of Dietterich). 39/??

46 Opimality: in HAM π(s 1 ) = South, π(s 2 ) = {North, South} Proved to be hierarchically optimal. (an example from a tutorial of Dietterich). 40/??

47 Optimality: in MAXQ Root Exit ToGoal South North East (an example from a tutorial of Dietterich). 41/??

48 Optimality: in MAXQ Root Exit ToGoal South North East MAXQ is recursively optimal. (an example from a tutorial of Dietterich). 41/??

49 Optimality in HRL? Options/HAM learns a hierarchically optimal policy. MAXQ learns a recursively optimal policy. MAXQ can obtain a policy which has hierarchical optimality with good design of subtask or pseudo-rewards. 42/??

50 Hierarchy/subgoal learning 43/??

51 Subgoal learning Creating usefull options randomly/heuristically, then adding gradually. 44/??

52 Subgoal learning Creating usefull options randomly/heuristically, then adding gradually. Creating an option/subgoal w.r.t a bottltneck (commonalities across multiple paths to a solution). 44/??

53 Hierarchy/subgoal learning Amy McGovern, Andrew G. Barto, Barto et. al. (2004, intrinsically motivated learning) Hengst, (also use bottleneck) Neville Mehta et. al (using DBN) etc. 45/??

54 Human hierarchical decision making JJF Ribas-Fernandes, A Solway, C Diuk, JT McGuire, AG Barto, Y Niv & MM Botvinick (2011) - A neural signature of hierarchical reinforcement learning - Neuron 71: /??

Temporal Abstraction in RL

Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;