Dynamic Programming and Reinforcement Learning

Size: px

Start display at page:

Download "Dynamic Programming and Reinforcement Learning"

Harry Emil Bryan
5 years ago
Views:

1 Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall / 34

2 Supervised Machine Learning Learning from datasets A passive paradigm Focus on pattern recognition Daniel Russo (Columbia) Fall / 34

3 Reinforcement Learning Reward Action Outcome Environment Learning to attain a goal through interaction with a poorly understood environment. Daniel Russo (Columbia) Fall / 34

4 Canonical (and toy) RL environments Cart Pole Mountain Car Daniel Russo (Columbia) Fall / 34

5 Impressive new (and toy) RL environments Atari from pixels Daniel Russo (Columbia) Fall / 34

6 Challenges in Reinforcement Learning Partial Feedback The data one gathers depends on the actions they take. Delayed Consequences Rather than maximize the immediate benefit from the next interaction, one must consider the impact on future interactions. Daniel Russo (Columbia) Fall / 34

7 Dream Application: Management of Chronic Diseases Various researchers are working on mobile health interventions Daniel Russo (Columbia) Fall / 34

8 Dream Application: Intelligent Tutoring Systems *Picture shamelessly lifted from a slide of Emma Brunskill s Daniel Russo (Columbia) Fall / 34

Dream Application: Beyond Myopia in E-Commerce Online marketplaces and web services have repeated interactions with users, but are deigned to optimize the next

9 Dream Application: Beyond Myopia in E-Commerce Online marketplaces and web services have repeated interactions with users, but are deigned to optimize the next interaction. RL provides a framework for optimizing the cumulative value generated by such interactions. How useful will this turn out to be? Daniel Russo (Columbia) Fall / 34

10 Deep Reinforcement Learning RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Daniel Russo (Columbia) Fall / 34

11 Deep Reinforcement Learning RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Justified excitement Hope is to enable direct training of control systems based on complex sensory inputs (e.g. visual or auditory sensors) DeepMind s DQN learns to play Atari from pixels, without learning vision first. Daniel Russo (Columbia) Fall / 34

12 Deep Reinforcement Learning RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Justified excitement Hope is to enable direct training of control systems based on complex sensory inputs (e.g. visual or auditory sensors) DeepMind s DQN learns to play Atari from pixels, without learning vision first. Also a lot of less justified hype. Daniel Russo (Columbia) Fall / 34

13 Warning 1 This is an advanced PhD course. 2 It will be primarily theoretical. We will prove theorems when we can. The emphasis will be on precise understand of why methods work and why they may fail completely in simple cases. 3 There are tons of engineering tricks to Deep RL. I won t cover these. Daniel Russo (Columbia) Fall / 34

14 My Goals 1 Encourage great students to do research in this area. 2 Provide a fun platform for introducing technical tools to operations PhD students. Dynamic programming, stochastic approximation, exploration algorithms and regret analysis. 3 Sharpen my own understanding. Daniel Russo (Columbia) Fall / 34

15 Tentative Course Outline 1 Foundational Material on MDPs 2 Estimating Long Run Value 3 Exploration Algorithms * Additional topics as time permits Policy gradients and actor critic Rollout and Monte-Carlo tree search. Daniel Russo (Columbia) Fall / 34

16 Markov Decision Processes: A warmup On the white-board Shortest path in a directed graph Daniel Russo (Columbia) Fall / 34

17 Markov Decision Processes: A warmup On the white-board Shortest path in a directed graph Imagine while traversing the shortest path, you discover one of the routes is closed. How should you adjust your path? Daniel Russo (Columbia) Fall / 34

18 Example: Inventory Control Stochastic demand Orders have lead time Non-perishable inventory Inventory holding costs Finite selling season Daniel Russo (Columbia) Fall / 34

19 Example: Inventory Control Periods k = 0, 1, 2,..., N x k {0,..., 1000} u k {0,..., 1000 x k } w k {0, 1, 2,...} x k+1 = x k w k + u k current inventory inventory order demand (i.i.d w/ known dist.) Transition dynamics Daniel Russo (Columbia) Fall / 34

20 Example: Inventory Control Periods k = 0, 1, 2,..., N x k {0,..., 1000} u k {0,..., 1000 x k } w k {0, 1, 2,...} x k+1 = x k w k + u k current inventory inventory order demand (i.i.d w/ known dist.) Transition dynamics Cost function g(x, u, w) = c H x }{{} Holding cost + c L w x } {{ } Lost sales + c O (u) }{{} Order cost Daniel Russo (Columbia) Fall / 34

21 Example: Inventory Control Objective: min E N k=0 g(x k, u k, w k ) Daniel Russo (Columbia) Fall / 34

22 Example: Inventory Control Objective: min E N k=0 Minimize over what? g(x k, u k, w k ) Daniel Russo (Columbia) Fall / 34

23 Example: Inventory Control Objective: min E N k=0 Minimize over what? g(x k, u k, w k ) Over fixed sequences of controls u0, u 1,...? No, over policies (adaptive ordering strategies). Daniel Russo (Columbia) Fall / 34

24 Example: Inventory Control Objective: min E N k=0 Minimize over what? g(x k, u k, w k ) Over fixed sequences of controls u0, u 1,...? No, over policies (adaptive ordering strategies). Sequential decision making under uncertainty where Decisions have delayed consequences. Relevant information is revealed during the decision process. Daniel Russo (Columbia) Fall / 34

25 Further Examples Dynamic pricing (over a selling season) Trade execution (with market impact) Queuing admission control Consumption-savings models in economics Search models in economics Timing of maintenance and repairs Daniel Russo (Columbia) Fall / 34

26 Finite Horizon MDPs: formulation A discrete time dynamic system where x k+1 = f k (x k, u k, w k ) x k X k u k U k (x k ) w k k = 0, 1,..., N state control (i.i.d w/ known dist.) Assume state and control spaces are finite. Daniel Russo (Columbia) Fall / 34

27 Finite Horizon MDPs: formulation A discrete time dynamic system where x k+1 = f k (x k, u k, w k ) x k X k u k U k (x k ) w k k = 0, 1,..., N state control (i.i.d w/ known dist.) Assume state and control spaces are finite. The total cost incurred is N k=0 g k (x k, u k, w k ) } {{ } cost in period k Daniel Russo (Columbia) Fall / 34

28 Finite Horizon MDPs: formulation A policy is a sequence π = (µ 0, µ 1,..., µ N ) where µ k : x k u k U k (x k ). Expected cost of following π from state x 0 is J π (x 0 ) = E N k=0 g k (x k, u k, w k ) where x k+1 = f k (x k, u k, w k ) and E[ ] is over the w ks. Daniel Russo (Columbia) Fall / 34

29 Finite Horizon MDPs: formulation The optimal expected cost to go from x 0 is J (x 0 ) = min π Π J π(x 0 ) where Π consists of all feasible policies. We will see the same policy π is optimal for all initial states. So J (x) = J π (x) x Daniel Russo (Columbia) Fall / 34

30 Minor differences with Bertsekas Vol. I Bertsekas Uses a special terminal cost function g N (x N ) Can always take gn (x, u, w) to be independent of u, w. Lets the distribution of w k depend on k and x k. This can be embedded in the functions f k, g k. Daniel Russo (Columbia) Fall / 34

31 Principle of Optimality Regardless of the consequences of initial decisions, an optimal policy should be optimal in the sub-problem beginning in the current state and time period. Daniel Russo (Columbia) Fall / 34

32 Principle of Optimality Regardless of the consequences of initial decisions, an optimal policy should be optimal in the sub-problem beginning in the current state and time period. Sufficiency: Such policies exist and minimize total expected cost from any initial state. Necessity: A policy that is optimal from some initial state must behave optimally in any subproblem that is reached with positive probability. Daniel Russo (Columbia) Fall / 34

33 The Dynamic Programming Algorithm Set J N(x) = min E[g N(x, u, w)] u U N (x) x X n For k = N 1, N 2,... 0, set J k (x) = min E[g k(x, u, w)+j u U k k+1(f k (x, u, w))] x X k. (x) Daniel Russo (Columbia) Fall / 34

34 The Dynamic Programming Algorithm Proposition For all x X 0, J (x) = J 0 (x). The optimal cost to go is attained by a policy π = (µ 0,..., µ N ) where µ N (x) arg min u U N (x) E[g N (x, u, w)] x X N and for all k {0,..., N 1}, x X k µ k(x) arg min u U k (x) E[g k(x, u, w) + J k+1(f k (x, u, w))]. Daniel Russo (Columbia) Fall / 34

35 The Dynamic Programming Algorithm Class Exercise Argue this is true for a 2 period problem (N=1). Hint, recall the tower property of conditional expectation. E[Y ] = E[E[Y X]] Daniel Russo (Columbia) Fall / 34

36 A Tedious Proof For any policy π = (µ 0, µ 1 ) and initial state x 0, E π [g 0 (x 0, µ 0 (x 0 ), w 0 ) + g 1 (x 1, µ 1 (x 1 ), w 1 )] = E π [g 0 (x 0, µ 0 (x 0 ), w 0 ) + E[g 1 (x 1, µ 1 (x 1 ), w 1 ) x 1 ]] E π [g 0 (x 0, µ 0 (x 0 ), w 0 ) + min E[g 1(x 1, u, w 1 ) x 1 ]] u U(x 1 ) = E π [g 0 (x 0, µ 0 (x 0 ), w 0 ) + J1 (x 1 )] = E π [g 0 (x 0, µ 0 (x 0 ), w 0 ) + J1 (f 0 (x 0, µ 0 (x 0 ), w 0 ))] min E[g 0(x 0, u, w 0 ) + J1 (f 0 (x 0, u, w 0 )] u U(x 0 ) = J0 (x 0 ) Under π, every inequality is an equality. Daniel Russo (Columbia) Fall / 34

37 Markov Property Markov Chain A stochastic process (X 0, X 1, X 2,...) is a Markov chain if for each n N, conditioned on X n 1, X n is independent of (X 0,..., X n 2 ). That is P(X n = X n 1 ) = P(X n = X 0,..., X n 1 ) Daniel Russo (Columbia) Fall / 34

38 Markov Property Markov Chain A stochastic process (X 0, X 1, X 2,...) is a Markov chain if for each n N, conditioned on X n 1, X n is independent of (X 0,..., X n 2 ). That is P(X n = X n 1 ) = P(X n = X 0,..., X n 1 ) Without loss of generality we can view a Markov chain as the output of a stochastic recursion X n+1 = f n (X n, W n ) for an i.i.d sequence of disturbances (W 0, W 1,...). Daniel Russo (Columbia) Fall / 34

39 Markov Property Our problem is called a Markov decision process because P(x n+1 = x x 0, u 0, w 0,..., x n, u n ) = P(f n (x n, u n, w n ) = x x n, u n ) = P(x n+1 = x x n, u n ) Requires the encoding of the state is sufficiently rich. Daniel Russo (Columbia) Fall / 34

40 Inventory Control Revisited Suppose that inventory has a lead time of 2 periods. Orders can still be placed in any period. Is this an MDP with state=current inventory? Daniel Russo (Columbia) Fall / 34

41 Inventory Control Revisited Suppose that inventory has a lead time of 2 periods. Orders can still be placed in any period. Is this an MDP with state=current inventory? No! Transition probabilities depend on the order that is currently in transit. Daniel Russo (Columbia) Fall / 34

42 Inventory Control Revisited Suppose that inventory has a lead time of 2 periods. Orders can still be placed in any period. Is this an MDP with state=current inventory? No! Transition probabilities depend on the order that is currently in transit. This is an MDP if we augment the state space so x n = (current inventory, inventory arriving next period). Daniel Russo (Columbia) Fall / 34

43 State Augmentation In the extreme, choosing the state to be the full history x n 1 = (x 0, u 0,..., u n 2, x n 1 ) suffices since P( x n = x n 1, u n 1 ) = P( x n = x 0, u 0,..., x n 1, u n 1 ). Daniel Russo (Columbia) Fall / 34

44 State Augmentation In the extreme, choosing the state to be the full history x n 1 = (x 0, u 0,..., u n 2, x n 1 ) suffices since P( x n = x n 1, u n 1 ) = P( x n = x 0, u 0,..., x n 1, u n 1 ). For the next few weeks we will assume the Markov property holds. Daniel Russo (Columbia) Fall / 34

45 State Augmentation In the extreme, choosing the state to be the full history x n 1 = (x 0, u 0,..., u n 2, x n 1 ) suffices since P( x n = x n 1, u n 1 ) = P( x n = x 0, u 0,..., x n 1, u n 1 ). For the next few weeks we will assume the Markov property holds. Computational tractability usually requires a compact state representation. Daniel Russo (Columbia) Fall / 34

46 Example: selling an asset An instance of optimal stopping. Deadline to sell within N periods. Potential buyers make offers in sequence. The agent chooses to accept or reject each offer The asset is sold once an offer is accepted. Offers are no longer available once declined. Offers are statistically independent. Profits can be invested with interest rate r > 0 per period. Daniel Russo (Columbia) Fall / 34

47 Example: selling an asset An instance of optimal stopping. Deadline to sell within N periods. Potential buyers make offers in sequence. The agent chooses to accept or reject each offer The asset is sold once an offer is accepted. Offers are no longer available once declined. Offers are statistically independent. Profits can be invested with interest rate r > 0 per period. Class Exercise 1 Formulate this as a finite horizon MDP. 2 Write down the DP algorithm. Daniel Russo (Columbia) Fall / 34

48 Example: selling an asset Special terminal state t (costless and absorbing) x k t is the offer considered at time k. x 0 = 0 is fictitious null offer. g k (x k, sell) = (1 + r) N k x k. x k = w k 1 for independent w 0, w 1,... Daniel Russo (Columbia) Fall / 34

49 Example: selling an asset DP Algorithm Jk (t) = 0 k JN(x) = x Jk (x) = max{(1 + r) N k x, E[Jk+1(w k )]} A threshold policy is optimal: Sell x k E[J k+1(w k )] (1 + r) N k Daniel Russo (Columbia) Fall / 34

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture