Deep Learning and Reinforcement Learning

Size: px

Start display at page:

Download "Deep Learning and Reinforcement Learning"

Griffin Andrews
6 years ago
Views:

1 Deep Learning and Reinforcement Learning Razvan Pascanu (Google DeepMind) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

2 Disclaimers: Slides based on David Silver s Lecture Notes From a DL perspective Not complete, but rather biased and focused It is meant to make you want to learn this Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

3 What is Reinforcement Learning? Supervised Unsupervised RL Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

4 What is Reinforcement Learning? Supervised Learning Unsupervised Learning Razvan Pascanu (Google DeepMind) Reinforcement Learning Deep Learning and Reinforcement Learning 17 August / 40

5 Laundry list of differences for RL? a t x t+1 Active learning x? x Moving target Actions: Decision: Weak error signal Long Term Dependencies? a 1 a 2 a 3 Start Exploration/Explotation x 0 x 1 x 2 x 3 Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

6 RL problem Reward scalar feedback signal Goal pick the sequence of actions that maximizes the cumulative reward Reward Hypothesis. All goals can be described by the maximization of expected cumulative reward Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

RL problem Agent and Environment Source: http://www0.cs.ucl.ac.uk/staff/d.

7 RL problem Agent and Environment Source: Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

8 RL problem History is sequence of observations and actions State information used to decide what happens next (MDP/POMDP) P(S t S t 1 ) = P(S t S 1,..S t 1 ) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

9 Inside the RL agents An RL agent has one or more of these components: Policy given a state provide a distribution over the actions Value function given a state (state/action pair) estimate expected future reward Model agent s representation of the world (planning) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

10 RL agents taxonomy VALUE Function Actor Critic POLICY MODEL Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

11 Policy based methods (digression) Effective in high-dimensional / continous action spaces Can learn stochastic policies Better convergence properties Noisy gradients! Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

12 Reinforce Directly maximize the cumulative reward! J(θ) = E πθ [r] = d(s) π θ (s, a)r s,a Maximize J. Using the log trick we have: J θ = d(s) π θ (s, a) log π r s,a = E π [ log π r] θ θ Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

13 Primer Dynamic Programming 5 9 START 10 3 END 25 5 Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

14 Primer Dynamic Programming START END Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

15 Primer Dynamic Programming Bellman s Principle of Optimality An optimal policy has the property that whatever initial state and initial decision are, the remaining decisions must consitute an optimal policy (Bellman, 1957) V (x) = max {F (x, a) + βv (T (x, a))} a Γ(x) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

16 Q-Values The Q value Q(x, a) is the expected cumulative reward for picking action a in state x. We can act greedily or epsilon greedy. { 1 ɛ, ifq(ai, x) > Q(a π(a i x) = j, x) j ɛ, otherwise Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

17 Q-learning Think of Q-values as the length of the path in the graph. Use dynamic programing (Bellman equation): ˆQ t (x t, a t ) = r xt,a t + β max Q t (x t+1, a) a Q t+1 (x t, a t ) = Q t (x t, a t ) + γ }{{} ( ˆQ(x t, a t ) Q t (x t, a t )) }{{} learning rate derivative the of square error( ˆQ Q) 2 } {{ } Regress Q to ˆQ using SGD Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

18 Finally using deep learning for RL What role does Deep Learning play in RL? provides a compact form for Q (function approximator) θ t+1 = θ t + γ ( ˆQ(x t, a t ) Q θt (x t, a t )) Q θ θ }{{} derivative of the square error ( ˆQ Q) 2 θ Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

19 Q-learning in Theano? (theano pseudocode) x = TT. vector("x") q = TT.dot(Wout, TT.nnet.relu(TT.dot(W, x)+b)+ bout) forward = theano. function([x], q) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

20 Q-learning in Theano? (theano pseudocode) x = TT. vector("x") q = TT.dot(Wout, TT.nnet.relu(TT.dot(W, x)+b)+ bout) params = [W,b,Wout, bout] target q = TT. scalar(" target q") action = TT. iscalar( action ) lr = TT. scalar( lr ) gp = TT.grad((q[ action] target q ) 2, params) learn = theano. function([x,action, target q,lr], [], updates =[(p,p lr gp) for p,gp in zip(params,gp)]) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

21 Q-learning in Theano? (theano pseudocode) for,(x, x tp1, act, reward) in enumerate( memory): target q = reward + forward( x tp1 ). max() learn(x, act, target q, 1e 3) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

22 But Learning can be tricky x t = 2 x t+1 = 2 x t+2 = 2 Correlated samples break learning Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

23 Solution: replay buffer a t a t x t, r t q x t θ t+1 Q-learning Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

24 High variance and minibatches Reinforcement Learning is inherintly sequential Replay Buffer gives elegant solution to employ minibatches Minibatches means reduced variance in the gradients Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

25 Target network ˆQ changes as fast as Q fix ˆQ (target network) and update it periodically x? x Moving target Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

26 Other details SGD can be slow.. rely on RMSprop (or any new optimizer) Convolutional models are more efficient then MLPs DQN uses action repeat set to 4 DQN receives 4 frames of the game at a time (grayscale) ɛ is anealled from 1 to.1 Training takes time (roughly days) Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

Results Average Reward per Episode Average Reward on Breakout 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100

27 Results Average Reward per Episode Average Reward on Breakout Training Epochs Average Reward on Seaquest Training Epochs Source: Mnih et al., Human-level control through deep reinforcement learning, Nature 2015 Average Reward per Episode Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

28 Results Source: Mnih et al., Human-level control through deep reinforcement learning, Nature 2015 Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

29 Results Nature paper videos Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

30 Parallelization Gorila Source: Nair et al., Massively Parallel Methods for Deep Reinforcement Learning, ICML DL workshop Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

31 Results GAMES BEATING HIGHEST TIME (Days) Source: Nair et al., Massively Parallel Methods for Deep Reinforcement Learning, ICML DL workshop Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

32 Results Atlantis Breakout Robotank Boxing Road_Runner Krull Demon_Attack Double_Dunk* Wizard_of_Wor* Crazy_Climber Assault Time_Pilot Gopher Star_Gunner Name_This_Game Tennis JamesBond Pong Fishing_Derby Kung_Fu_Master Up_n_Down Tutankham Ice_Hockey Space_Invaders Zaxxon Bank_Heist QBert Battle_Zone Centipede Kangaroo Venture Asterix* Freeway RiverRaid Chopper_Command Hero Seaquest Beam_Rider Enduro Bowling Amidar Alien Gravitar* Frostbite Ms_Pacman Private_Eye* Montezuma_Revenge Asteroids* at human-level or above below human-level GORILA DQN 0% 200% 400% 600% 800% 1,000% 5,000% Human Score Source: Nair et al., Massively Parallel Methods for Deep Reinforcement Learning, ICML DL workshop Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

33 Where this doesn t work (straightforwardly) Continous control Robotics (experience is very explensive) Sparse rewards (Montezuma!?) Long term correlations (Montezuma!?) But this does not mean that RL+DL can not be the solution! Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

DeepMind Research Source: http://deepmind.com/publications.

34 DeepMind Research Source: Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

35 And we are hiring Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

36 Thank you Questions? Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

37 Possible exercise for the afternoon sessions I Pick one or several tasks from the Deep Learning Tutorials: Logistic Regression MLP AutoEncoders / Denoising AutoEncoders Stacked Denoising AutoEncoders Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

38 Possible exercise for the afternoon sessions II Compare different initialization for neural networks (MLPs and ConvNets) with rectifieres or tanh. In particular compare: The initialization proposed by Glorot et al. Sampling uniformally from [ 1 fan in, 1 fan in ] Setting all singular values to 1 (and biases to 0) How do different optimization algorithms help with this initializations? Extra kudos for interesting plots or analysis. Please make use of the Deep Learning Tutorials. Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

39 Possible exercise for the afternoon sessions III Requires convolutions Re-implement the AutoEncoder tutorial using convolutions both in the encoder and the decoder. Extra kudos for allowing pooling (or strides) in the encoder. Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

40 Possible exercise for the afternoon sessions IV Requires Reinforcement Learning Attempt to solve the Catch game. Not actual screenshots of the game Razvan Pascanu (Google DeepMind) Deep Learning and Reinforcement Learning 17 August / 40

The Option-Critic Architecture

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently