EE365: Markov Decision Processes

Size: px

Start display at page:

Download "EE365: Markov Decision Processes"

Elaine Hunter
6 years ago
Views:

1 EE365: Markov Decision Processes Markov decision processes Markov decision problem Examples 1

2 Markov decision processes 2

3 Markov decision processes add input (or action or control) to Markov chain with costs input selects from a set of possible transition probabilities input is function of state (in standard information pattern) 3

4 Definition: Dynamical system form x t+1 = f t(x t, u t, w t), t = 0, 1,..., T 1 state x t X action or input u t U uncertainty or disturbance w t W dynamics functions f t : X U W X x 0, w 0,..., w T 1 are independent RVs variation (state dependent input space): u t U t(x t) U (U t(x t) is set of allowed actions in state x t at time t) 4

5 Policy action is function of state: u t = µ t(x t), t = 0,..., T 1 µ t : X U is state feedback function at time t µ = (µ 0,..., µ T 1) is the policy (or control law) X T number of possible policies: U very large for any case of interest for each t = 0,..., T 1, for each x X, we can choose µ t(x) U 5

6 Closed-loop system with policy, ( closed-loop ) dynamics is x t+1 = F t(x t, w t) = f t(x t, µ t(x t), w t), t = 0, 1,..., T 1 F t are closed-loop state transition functions x 0,..., x T is Markov 6

7 Information patterns u t = µ t(x t) is standard information pattern action is function of current state also called state feedback control some nonstandard information patterns: full information (or prescient): u t = µ t(x 0, w 0,..., w T 1) no information: u t = µ t() (i.e., u 0,..., u T 1 are fixed) initial state (also called open-loop): u t = µ t(x 0) state and disturbance: u t = µ t(x t, w t) 7

8 Cost function total cost is J = E ( T 1 ) g t(x t, u t, w t) + g T (x T ) t=0 stage cost functions g t : X U W R terminal cost function g T : X R variation: allow g t to take on value + to encode constraints on state-action pairs ( for rewards, when we maximize) we sometimes write J µ to show dependence of cost on policy 8

9 Closed-loop stage cost functions closed-loop stage cost functions: G t(x) = E g wt t(x, µ t(x), w t), t = 0,..., T 1 (note that x t w t) closed-loop terminal cost function: G T (x) = g T (x) 9

10 Cost function: Special cases deterministic cost: g t do not depend on w t time-invariant: g 0,..., g T are the same terminal cost only: g 0 = = g T 1 = 0 state-control separable (deterministic case): g t(x t, u t, w t) = q t(x t) + r t(u t) q t : X R is state cost function r t : U R is action cost function 10

11 Value iteration to compute cost we can use value iteration to compute J (deterministic cost for simplicity) take V T (x) = g T (x), V t(x) = g t(x, µ t(x)) + E V t+1(f t(x, µ t(x), w t)), t = T 1,..., 0 (expectation is over w t) J = π 0V 0 computation cost is T X W operations (fewer for sparse transitions) 11

12 Concrete form X = {1,..., n}, U = {1,..., m} transition probabilities (time-invariant case) given by P ijk = Prob(x t+1 = j x t = i, u t = k) P ijk is probability that next state is j, when current state is i and control action k is taken P is 3-D array (often sparse) in state i, action chooses next state distribution from choices P i,:,k = [P i1k P ink ], k = 1,..., m for time-varying case, P is 4-D array (!!) 12

13 Concrete form stage costs (time-invariant case) given by C ijk, i, j = 1,..., n, k = 1,..., m C ijk is cost when state i transitions to state j with action k C is 3-D array (often sparse); can assume that C ijk = 0 when P ijk = 0 state-action separable case: C ijk = q i + r k 13

14 Markov decision problem 14

15 Markov decision process Markov decision process (MDP) defined by (action dependent) state transition functions f 0,..., f T 1 distributions of x 0, w 0..., w T 1 stage cost functions g 0,..., g T 1 terminal cost function g T policy defined by state feedback functions µ 0,..., µ T 1 combining Markov decision problem with policy, we get closed-loop Markov chain with costs 15

16 Markov decision problem given Markov decision process, cost with policy µ is J µ Markov decision problem: find a policy µ that minimizes J µ X T number of possible policies: U (very there can be multiple optimal policies we will see how to find an optimal policy next lecture large for any case of interest) 16

17 Examples 17

18 Trading simple trading model for one asset: hold (integer) number of shares q t [Q min, Q max ] in period t buy u t shares at time t, u t [Q min q t, Q max q t], so q t+1 = q t + u t price p t {P 1,..., P k } is Markov; p t known before u t is chosen revenue is u tp t T (u t) S((q t) ) T (u t) 0 is transaction cost S((q t) ) 0 is shorting cost q 0 = 0; we require q T = 0 maximize total expected revenue over t = 0,..., T 1 18

19 Trading MDP model: state is x t = (q t, p t) stage cost is negative revenue terminal cost is g T (0) = 0; g T (q) = for q 0 (trading) policy gives number of assets to buy (sell) as function of time t, current holdings q t, and price p t presumably, good policy buys when p t is low and sells when p t is high 19

20 Variations how do we handle (model) the following, and what assumptions would we need to make? price movements that depend on u t (price impact) imperfect fulfillment (i.e., you might not buy or sell the full amount u t) price movements that depend on a signal s t {S 1,..., S r} that you know at time t 20

EE266 Homework 5 Solutions

EE266 Homework 5 Solutions EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The