Machine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Size: px

Start display at page:

Download "Machine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt"

Laura Nelson
6 years ago
Views:

1 Machine Learning for Physicists Lecture 10 Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

2 Function/Image representation Image classification [Handwriting recognition] Convolutional nets Autoencoders Visualization by dimensional reduction Recurrent networks Word vectors Reinforcement learning

3 For more in-depth treatment, see David Silver s course on reinforcement learning (University College London):

4 time The simplest RL example ever A random walk, where the probability to go up is determined by the policy, and where the reward is given by the final position (ideal strategy: always go up!) (Note: this policy does not even depend on the current state) position reward

5 The simplest RL example ever A random walk, where the probability to go up is determined by the policy, and where the reward is given by the final position (ideal strategy: always go up!) (Note: this policy does not even depend on the current state) policy RL update (up) = 1 1+e = X t reward ln (a t R = x(t ln (a t a t = up or down + for up, - for down = ±e 1 (a t )=±(1 (a t )) (up) = for up (up) for down X t number of ln (a t ) = N up N N=number of time steps

6 The simplest RL example ever reward RL update R = x(t ) = N up N down =2N up N X = ln (a t t * R X ln (a t + a t = up or down N =2 (N up 2 )(N N up up ) (general analytical expression for average update, rare) Initially, when (up) = 1 2 : =2 N (N up 2 )2 =2 Var(N up )= N 2 > 0 (binomial distribution!)

7 In general: * X R t =2 The simplest RL example ever ln (a t ) N =2 (N N (N up Nup )+( N up 2 ) 2 )(N up N up ) (N up Nup ) expression for average = 2VarN up + 2( N N up 2 ) N up Nup = 2VarN up =2N (up)(1 (up)) (general analytical update, fully simplified, extremely rare) (up)(1 (up)) (up)

8 The simplest RL example ever (up) probability 3 learning attempts strong fluctuations! trajectory (=training episode) (This plot for N=100 time steps in a trajectory; eta=0.001)

9 Spread of the update step Y = N up Nup c = N up N/2 X =(Y + c)y X=update (Note: to get Var X, we need central moments (except of binomial distribution up to 4th moment) prefactor of 2) p Var(X) N 3 2 hxi N 1 (up) (This plot for N=100)

10 Optimal baseline suppresses spread! Y = N up Nup c = N up N/2 X =(Y + c)y with optimal baseline: X 0 Y 2 (Y + c) =(Y + c b)y b = hy 2 i N 3 2 p Var(X) p Var(X0 ) N 1 hxi (up) (This plot for N=100)

11 Note: Many update steps reduce relative spread M = number of update steps X = MX j=1 X j h p Var Xi = M hxi X = p M p VarX relative spread p Var X h Xi 1 p M

12 Homework Implement the RL update including the optimal baseline and run some stochastic learning attempts. Can you observe the improvement over the no-baseline results shown here? Note: You do not need to simulate the individual random walk trajectories, just exploit the binomial distribution.

13 The second-simplest RL example position actions: move or stay walker target site reward=number of time steps on target time See code on website: SimpleRL_WalkerTarget

14 RL in keras: categorical cross-entropy trick output = action probabilities (softmax) (a s) a=0 a=1 a=2 categorical cross-entropy X a distr. from net C = P (a)ln (a s) desired distribution Set P (a) =R for a=action that was taken P (a) =0 for all other actions a input = implements policy gradient

15 alpha-go Among the major board games, Go was not yet played on a superhuman level by any program (very large state space on a 19x19 board!) alpha-go beat the world s best player in 2017

16 alpha-go First: try to learn from human expert players Silver et al., Mastering the game of Go with deep neural networks and tree search (Google Deepmind team), Nature, January 2016

17 alpha-go Second: use policy gradient RL on games played against previous versions of the program Silver et al., Mastering the game of Go with deep neural networks and tree search (Google Deepmind team), Nature, January 2016

18 alpha-go Silver et al., Mastering the game of Go with deep neural networks and tree search (Google Deepmind team), Nature, January 2016

19 alpha-go Silver et al., Mastering the game of Go with deep neural networks and tree search (Google Deepmind team), Nature, January 2016

20 Q-learning An alternative to the policy gradient approach Introduce a quality function Q that predicts the future reward for a given state s and a given action a. Deterministic policy: just select the action with the largest Q!

21 Q maximal player & possible actions

22 Q-learning Introduce a quality function Q that predicts the future reward for a given state s and a given action a. Deterministic policy: just select the action with the largest Q! Discounted future reward: Q(s t,a t )=E[R t s t,a t ] R t = Reward at time step t: Discount factor: TX t 0 =t r t 0 < apple 1 r t 0 How do we obtain Q? t 0 t (assuming future steps to follow the policy!) depends on state and action at time t learning somewhat easier for smaller factor (short memory times)

23 Q-learning: Update rule Bellmann equation: (from optimal control theory) Q(s t,a t )=E[r t + max a Q(s t+1,a) s t,a t ] In practice, we do not know the Q function yet, so we cannot directly use the Bellmann equation. However, the following update rule has the correct Q function as a fixed point: Q new (s t,a t )=Q old (s t,a t )+ (r t + max a Q old (s t+1,a) Q old (s t,a t )) small (<1) update factor will be zero, once we have converged to the correct Q If we use a neural network to calculate Q, it will be trained to yield the new value in each step.

24 Q-learning: Exploration Initially, Q is arbitrary. It will be bad to follow this Q all the time. Therefore, introduce probability random action ( exploration )! of Follow Q: exploitation Do something random (new): exploration -greedy Reduce this randomness later!

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman