The Option-Critic Architecture

Size: px

Start display at page:

Download "The Option-Critic Architecture"

Timothy Atkins
6 years ago
Views:

1 The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017

2 Intelligence: the ability to generalize and adapt efficiently to new and uncertain situations Having good representations is key [...] solving a problem simply means representing it so as to make the solution transparent. Simon, / 18

3 Reinforcement Learning: a general framework for AI Equipped with a good state representation, RL has led to impressive results: Tesauro s TD Gammon (1995), Watson s Daily-Double Wagering in Jeopardy! (2013), Human-level video game play in the Atari games (2013), AlphaGo (2016)... The ability to abstract knowledge temporally over many different time scales is still missing. 2 / 18

4 Temporal abstraction Higher level steps Choosing the type of coffee maker, type of coffee beans Medium level steps Grind the beans, measure the right quantity of water, boil the water Lower level steps Wrist and arm movements while adding coffee to the filter,... 3 / 18

5 Temporal abstraction in AI A cornerstone of AI planning since the 1970 s: Fikes et al. (1972), Newell (1972, Kuipers (1979), Korf (1985), Laird (1986), Iba (1989), Drescher (1991) etc. It has been shown to : Generate shorter plans Reduce the complexity of choosing actions Provide robustness against model misspecification Improve exploration by taking shortcuts in the environment 4 / 18

6 Temporal abstraction in RL Options (Sutton, Singh, Precup 2000) can represent courses of action at variable time scales: High level Low level Trajectory, time 5 / 18

7 Options framework An option ω is a triple: 1. initiation set: I ω 2. internal policy: π ω 3. termination condition: β ω Example Robot navigation: if there is no obstacle in front (I ω ), go forward (π ω ) until you get too close to another object (β ω ) We can derive a policy over options π Ω that maximizes the expected discounted sum of rewards: [ ] E γ t r(s t, a t ) s 0, ω 0 t=0 6 / 18

8 Contribution of this work The problem of constructing/discovering good options has been a challenge for more than 15 years. Option-critic is a scalable solution to this problem: Online, continual and model-free (but models can be used if desired) Requires no a priori domain knowledge, decomposition, or human intervention Learns in a single task, at least as fast as other methods which do not use temporal abstraction Applies to general continuous state and action spaces 7 / 18

9 Actor-Critic Architecture (Sutton 1984) Actor Policy Gradient s t Critic Value function TD error a t r t Environment The policy (actor) is decoupled from its value function. The critic provides feedback to improve the actor Learning is fully online 8 / 18

10 Option-Critic Architecture Policy over options π Ω ω t Options π ω, β ω Gradients s t Critic Q U, A Ω TD error a t r t Environment Parameterize internal policies and termination conditions Policy over options is computed by a separate process 9 / 18

11 Main result: Gradient updates The gradient wrt. the internal policy parameters θ is given by: [ ] log πω,θ (a s) E Q U (s, ω, a) θ This has the usual interpretation: take better primitives more often inside the option The gradient wrt. the termination parameters ν is given by: [ E β ω,ν(s ] ) A πω (s, ω) ν where A πω = Q πω V πω is the advantage function This means that we want to lengthen options that have a large advantage 10 / 18

12 Results: Options transfer Hallways Walls Initial goal Random goal after 1000 episodes 11 / 18

13 Results: Options transfer Goal moves randomly SARSA(0) AC-PG OC 4 options OC 8 options Steps Episodes Primitive actions Using temporal abstractions discovered by option-critic Learning in the first task no slower than using primitives Learning once the goal is moved faster with the options 12 / 18

14 Results: Learned options are intuitive Probability of terminating in a particular state, for each option: Option 1 Option 2 Option 3 Option 4 Terminations are more likely near hallways (although there are no pseudo-rewards provided) 13 / 18

15 Results: Nonlinear function approximation Policy over options Termination functions Internal policies Last 4 frames Convolutional layers Shared representation Same architecture as DQN (Mnih & al., 2013) for the 4 first layers but hybridized with options and the policy over them. 14 / 18

16 Performance matching or better than DQN Avg. Score Option-Critic 500 Option-Critic 0 DQN DQN Epoch Epoch (a) Asterix (b) Ms. Pacman Option-Critic DQN Option-Critic DQN Epoch Epoch (c) Seaquest (d) Zaxxon 15 / 18

17 Interpretable and specialized options in Seaquest Action trajectory, time White: option 1 Black: option 2 Transition from option 1 to 2 Option 1: downward shooting sequences Option 2: upward shooting sequences 16 / 18

18 Conclusion Our results seem to be the first to be: fully end-to-end within a single task at speed comparable or better than using just primitive methods Using ideas from policy gradient methods, option-critic provides continual option construction can be used with nonlinear function approximators can incorporate regularizers or pseudo-rewards easily 17 / 18

19 Future work Learn initiation sets: Would require a new notion of stochastic initiation functions More empirical results! Try our code : 18 / 18

How to construct good temporal abstractions. Doina Precup McGill University Joint work with Pierre-Luc Bacon and Jean Mehreb-Harb

How to construct good temporal abstractions Doina Precup McGill University Joint work with Pierre-Luc Bacon and Jean Mehreb-Harb EWRL, December 2016 Options framework Suppose we have an MDP S, A, r, P,