Part 4: Markov Decision Processes

Size: px

Start display at page:

Download "Part 4: Markov Decision Processes"

Matilda Ward
6 years ago
Views:

1 Markov decision processes c Vikram Krishnamurthy Part 4: Markov Decision Processes Aim: This part covers discrete time Markov Decision processes whose state is completely observed. The key ideas covered is stochastic dynamic programming. We apply stochastic dynamic programming to solve fully observed Markov decision processes (MDPs). Later we will tackle Partially Observed Markov Decision Processes (POMDPs). Issues such as general state spaces and measurability are omitted. Instead we focus on structural aspects of stochastic dynamic programming.

2 Markov decision processes c Vikram Krishnamurthy History Classical control (Freq. domain): Root locus, Bode diagrams, stability, PID, 1950s. State Space Theory - Kalman 1960s Modern Control (Time domain): State variable feedback, observability, controllability Optimal and Stochastic Control 1960s 1990s Dynamic Programming (Bellman) LQ and Markov Decision Processes (1960s) Partially observed Stochastic Control = Filtering + control Stochastic Adaptive Control (1980s & 1990s) Robust stochastic control H control (1990s) Scheduling control of computer networks, manufacturing systems (1990s). Neurodynamic programming (Re-inforcement learning) 1990s.

3 Markov decision processes c Vikram Krishnamurthy Applications Control in Telecom and Sensor Networks: Admission, Access and Power control wireless networks, computer networks. Sensor Scheduling and Optimal search. Robotic navigation and Intelligent Control. Process scheduling and manufacturing Aeronautics: Auto-pilots, missile guidance systems, satellite navigation systems

4 Markov decision processes c Vikram Krishnamurthy Fully Observed MDP 1. Discrete-time dynamic system: State {x k } X. For time k = 0,1,...,N evolves as Observations: y k = x k Control: u k U k (x k ). x k+1 = A k (x k,u k,w k ) x 0 π 0 ( ) Process Noise: w k iid 2. Policy class: Consider admissible policy π = {µ 0,...,µ N 1 } where u k = µ k (x k ) U k (x k ). 3. Cost function: additive cost function { J π (x 0 ) = E c N (x N )+ c N denotes terminal cost. N 1 k=0 Aim: Compute the optimal policy c k (x k,µ k (x k )) } (1) J (x 0 ) = min π Π J π(x 0 ) Π is set of admissible policies. J (x 0 ) is called optimal cost (value) function.

5 Markov decision processes c Vikram Krishnamurthy Terminology Finite Horizon: N finite. Fully observed: y k = x k Partially observed: y k = C k (x k,v k ) (next part) Infinite horizon: 1. Average cost: J π (x 0 ) = lim N { N 1 } 1 N E c(x k,µ(x k )) k=0 2 Discounted cost: ρ (0,1) { } J π (x 0 ) = E ρ k c(x k,µ(x k )) k=0 Remarks: 1. Average cost problems need more technical conditions and somewhat harder than discounted cost problems. 2. Stochastic dynamic optimization or Sequential decision making problem.

6 Markov decision processes c Vikram Krishnamurthy Application Examples 2.1 Finite state Markov Decision Processes (MDP) x k is a S state Markov chain. Transition prob: P ij (u) = P(x k+1 = j x k = i,u k = u), i,j {1,...,S}. Cost function as in (1). Numerous applications in OR, EE, Gambling theory. Benchmark Example: Machine (or Sensor) Replacement State: x k {0,1} machine state x k = 0 operational; x k = 1 failed. Control: u k {0,1}. u k = 0 keep machine; u k = 1 replace by new one Trans prob matrices: Let θ = P(x k+1 = 1 x k = 0). P(0) = 1 θ θ, P(1) = Cost: Minimize E{ N 1 k=0 c(x k,u k )} where c(0,0) = 0, c(1,0) = C, c(x,1) = R

7 Markov decision processes c Vikram Krishnamurthy Other fully observed problems 1. Linear Quadratic (LQ) control Fully observed problem x k+1 = A k x k +B k u k +w k { J π (x 0 ) = E Q k 0 and R k > 0. x N Q N x N + N 1 k=0 u kr k u k +x kq k x k } w k is zero mean finite variance white noise (not necessarily Gaussian) For detailed analysis and design examples, see Anderson and Moore. 1. Linear control methods have explicit solutions 2. Maybe applied to nonlinear systems operating on a small signal basis linearization 3. Selection of Q and R involve engineering judgment. 4. Partially observed problem - LQG control is widely used.

8 Markov decision processes c Vikram Krishnamurthy Optimal Stopping Problems (termination control) Asset Selling problem: You want to sell an asset. Offers w 0, w 1,..., w N 1 are iid. If offer k accepted: invest w k at fixed interest rate r. If you reject offer, wait till next offer. Rejected offers cannot be renewed Offer N 1 must be accepted if other offers were rejected. Aim: What is optimal policy for accepting and rejecting? Formulation: w k : offer value real valued (say). Control space: accept (sell) u 1, reject (dont sell) u 2 State space: reals + T (termination state) x k+1 = A k (x k,u k,w k ), k = 1,...,N 1 T if x k = T or x k T and u k = u 1 = w k otherwise { } Reward: E c N (x N )+ N 1 k=0 c k (x k,u k,w k ) x N if x N T c N (x N ) = 0 otherwise (1+r) N k x k if x k T c k (x k,u k,w k ) = 0 otherwise

9 Markov decision processes c Vikram Krishnamurthy Rational Thief problem: Thief can chose night k to retire with earnings x k or rob a house and bring w k. Thief is caught with probability p. If caught, lose all money. Assume w k are iid with mean w. Compute optimal policy over N nights. 3. Scheduling Problems: Job scheduling on single processor: N jobs to be done in sequential order. Job i requires time random T i. T i are iid. If job i is completed at time t, reward is ρ t R i, 0 < ρ < 1. Find schedule that maximizes total reward. See Bertsekas or Ross or Puterman for a wealth of examples. Journals such as IEEE Auto Control, Machine Learning, Annals of Operations Research.

10 Markov decision processes c Vikram Krishnamurthy Dynamic Programming (DP) 3.1 Principle of Optimality (Bellman 1962). An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first selection. a b 1 2 c If shortest distance from 1 to 2 is via a,b,c,d, then shortest distance from b to 2 is via c,d. DP is an algorithm which uses the principle of optimality to determine optimal policy. DP is widely used control, OR, discrete optimization. DP yields a functional equation of the form: J k (x) = min u [ c k (x,u)+ IR d ] J k+1 (A k (x,u,w))p(w)dw

11 Markov decision processes c Vikram Krishnamurthy * 2 2 * Figure 1: Deterministic Shortest path problem

12 Markov decision processes c Vikram Krishnamurthy DP for Shortest path problem Compute shortest distance and path from node 1 to 5. Let c ij = distance from node i to j. c 12 = 3, c 13 = 2, c 23 = 1, c 24 = 2, c 34 = 2, c 35 = 7, c 45 = 4. Define: J i = min dist from node i to 5. u i = next point in min path from i to 5. DP yields: J 5 = 0 J 4 = 4; u 4 = 5. J 3 = min u {4,5} c 3u +J u = min{c 34 +J 4,c 35 +J 5} = min{2+4,7} = 6; u 3 = 4. J 2 = min u {3,4} c 2u +J u = min{c 23 +J 3,c 24 +J 4} = min{1+6,2+4} = 6; u 2 = 4. J 1 = min{c 12 +J 2,c 13 +J 3} = min{3+6,2+6} = 8; u 1 = 3. Bactracking: shortest path from 1 to 5 is via 3,4.

13 Markov decision processes c Vikram Krishnamurthy Remarks on DP 1. We have worked backwards from node 5 to node 1. Called Backward DP Forward DP (FDP) yields identical result. 2. Minimum path problem arises in optimal network flow design, critical path design, Viterbi algorithm. 3. It is important to note that the cost (distance) between nodes is path independent. If it is path dependent, DP does not yield optimal soln. 4. Backtracking is a characteristic of DP. Life can only be understood going backwards; but it must be lived going forwards 5. We will focus on solving stochastic control problems via DP. (SDP) 6. Proof on principle of optimality is straightforward. (proof by contradiction).

14 Markov decision processes c Vikram Krishnamurthy Stochastic Dynamic Programming 4.1 Principle of Optimality Stochastic Control version: Consider fully observed problem with cost function { } N 1 J π (x 0 ) = E c N (x N )+ c k (x k,µ k (x k )) x 0 k=0 Let π = {µ 0,µ 1,...,µ N 1 } be optimal policy. Suppose state at time i is x i when using π. Consider subproblem of minimizing E { c N (x N )+ N 1 k=i c k (x k,µ k (x k )) x i } Then {µ i,µ i+1,...,µ N 1 } is optimal for this subproblem.

15 Markov decision processes c Vikram Krishnamurthy Soln via Stochastic Dynamic Programming Recall fully observed system is x k+1 = A k (x k,u k,w k ), k = 0,1,...,N 1 Aim: Determine optimal policy to minimize J π (x 0 ) { } N 1 J π (x 0 ) = E c N (x N )+ c k (x k,µ k (x k )) x 0 k=0 Outline: The solution consists of two stages: 1. Backwards SDP to determine optimal policy this is data independent (offline). Creates a lookup table for state x k it gives optimal u k. 2. Forward implementation of controller: Given x k pick optimal u k table lookup. k opt control if x k = e 1 opt control if x k = e 2 1 u 1,1 u 2,1 2 u 1,2 u 2,2.. N 1 u 1,N 1 u 2,N 1.

16 Markov decision processes c Vikram Krishnamurthy Given SDP solution x k+1 = A k (x k,u k,w k ), k = 0,1,...,N 1 { } N 1 J π (x 0 ) = E c N (x N )+ c k (x k,µ k (x k )) x 0 k=0 For every initial state x 0, the optimal cost J (x 0 ) = inf π J π (x 0 ) = J 0 (x 0 ). Here J 0 (x 0 ) is given by the last step of the backward DP algorithm for k = N,N 1,...,0: J N (x N ) = c N (x N ) (terminal cost) J k (x) = min [c k(x,u k )+E{J(x k+1 ) x k = x}], u k U k (x) [ ] = min c k (x,u)+ J k+1 (A k (x,u,w))p(w)dw u U k (x) IR [ µ k(x) = argmin ck (x,u)+ J k+1 (A k (x,u,w))p(w)dw ]. u k U k (x) J k (x) defn = min µ k,...,µ N E{ IR N c t (x t,u t ) x k = x} t=k is called value- to-go function.

17 Markov decision processes c Vikram Krishnamurthy Remarks on SDP Mathematical rigor: If w k D k where D k is countable, then above results are rigorous. J k (x) = min u U k (x) [ c k (x,u)+ w D k J k+1 (A k (x,u,w))p(w) Otherwise, the results are informal in that we have not specified the function spaces to which u, x and w belong to. For general case measurable selection theorems are required. (i) Need c k (x,u) to be a measurable function (ii) For min to replace inf need compactness and lower semi-continuity. General conditions (Hernandez Lerma & Lasserre). (a) U k (x) compact, c k (x, ) is lower semi-cont on U k (x) for all x X. (b) IR J k+1(a k (x,u,w))p(w)dw is lower semi-cont on U k (x) for every x X and every cont bounded function J k+1 on X. (does not work for LQ!) (a) can be replaced by inf-compactness of c k (x,u): For x X, r IR, {u U k (x) c k (x,u) r} compact. ]

18 Markov decision processes c Vikram Krishnamurthy Finite State Markov Decision Processes (MDP) Markov chain x k {1,2,...,S} P(x k+1 = j x k = i,u k = u) = P ij (u), i,j {1,...,S}. Aim: Minimize J π (x 0 ) = E { c N (x N )+ DP yields: For i = 1,2,...,S J N (i) = c N (i), N 1 k=0 c k (x k,µ k (x k )) J k (i) = min u k [c k (i,u k )+E{J(x k+1 ) x k = i}], = min u k = min u k k = N 1,N 2,...,1,0 [ ] S c k (i,u k )+ J k+1 (j)p(x k+1 = j x k = i,u k ), [ c k (i,u k )+ j=1 ] S J k+1 (j)p ij (u k ), j=1 }

19 Markov decision processes c Vikram Krishnamurthy In matrix vector notation: J k = min u [c k (u)+p(u)j k+1 ]

20 Markov decision processes c Vikram Krishnamurthy Lookup Table Dynamic programming creates a look-up table [ ] S J k (i) = min c k (i,u k )+ J k+1 (j)p ij (u k ) u k j=1 [ ] S u i,k = µ k(x k = i) defn = argmin c k (i,u k )+ J k+1 (j)p ij (u k ) u k Thus we have a look-up table j=1 k opt control if x k = 1 opt control if x k = 2 1 u 1,1 u 2,1 2 u 1,2 u 2,2 3 u 1,3 u 2,3... N 2 u 1,N 2 u 2,N 2 N 1 u 1,N 1 u 2,N 1 Remarks: (i) Requires O(N S) memory. (ii) If N = (infinite horizon), and if c k (x k,u k ) = ρ k c(x k,u k ), then entries u i,k in lookup table converge to values indpt of k. Steady (stationary) state policy; O(S) memory

21 Markov decision processes c Vikram Krishnamurthy Controller Implementation: 1. Set k = 0, Initial condition: x 0 = i. 2. Select optimal control as u x k,k = µ k(x k ) from DP lookup table Instantaneous cost = c k (x k,u x k,k ) 3. Markov chain evolves randomly according to P(u x k,k ). Generates new state x k If k = N 1, stop. Else, set k = k +1 and go to step 2. Markov chain prob P(u x k,k ) x k u x k,k Lookup table

22 Markov decision processes c Vikram Krishnamurthy Design Example: Machine Replacement Recall that x k {0,1} and u k {0,1}. P(0) = 1 θ θ, c(0) = c P(1) = 1 0, c(1) = R 1 0 R Cost: Minimize E{ N 1 k=0 c(x k,u k )} DP Soln for Control Policy: J N(1) = 0 J N (2) 0 For k = N 1,N 2,...,1,0 J k(1) = min J k (2) [c(u)+p(u)j k+1] u {0,1} = min (1 θ)j k+1(1)+j k+1 (2), R+J k+1(1) c+j k+1 (2) R+J k+1 (1)

23 Markov decision processes c Vikram Krishnamurthy Design Example: Continued e.g. if N = 4, θ = 0.1, c = 4, R = 3: k J k (1) J k (2) u 1,k u 2,k e.g. if N = 4, θ = 0.1, c = 4, R = 6: k J k (1) J k (2) u 1,k u 2,k

24 control input machine state state 0 = machine operational, state 1 = machine failed time Markov Decision Process: Machine Repair Example 0 = leave as is, 1 = repair machine Not profitable to repair after time time Markov decision processes c Vikram Krishnamurthy

25 Markov decision processes c Vikram Krishnamurthy Remarks: 1. Note DP minimizes expected cost. Actual cost of a sample path is a random variable which may be lower than the expected value. 2. In the machine repair example, there is a threshold after which it becomes infeasible to repair the machine.

26 Markov decision processes c Vikram Krishnamurthy Perspective 1. DP is a widely used optimization algorithm in: Stochastic control, combinatorial optimization, operations research. We have looked at Backward DP Forward DP is similar yields identical result e.g. Viterbi algorithm is a shortest path algorithm. 2. DP for fully observed stochastic control. LQ and MDP have explicit solutions. Most other problems do not. 3. We considered additive cost functions. Risk sensitive control considers exponential cost functions, see Elliott et.al. 4. Why feedback control is essential (next slide) Things not covered: 1. Engineering LQ control Anderson & Moore 2. Detailed mathematics Bertsekas & Shreve 3. Numerical approximations for solving DP. 4. Properties of value-to-go function Ross

27 Markov decision processes c Vikram Krishnamurthy Why Feedback Control is essential 1. Open loop systems are a special case of closed loop systems for both deterministic and stochastic systems 2. In deterministic systems, for every closed loop systems, there is an equivalent open loop system, Example: Linear case. 3. For stochastic systems closed loop (feedback) and open loop systems are not equivalent. We show that for a stochastic system: (i) no open loop system has the same properties of a feedback system (ii) Feedback always achieves a better cost that open loop in optimal control.

28 Markov decision processes c Vikram Krishnamurthy Feedback is essential in stochastic systems Consider stochastic system x k+1 = x k +u k +w k where x 0, w k are white noise with variance σ 2. Closed loop: Suppose feedback is u k = x k. Then x k+1 = w k Open loop: x k+1 = x k +u k +w k = x 0 + k m=0 u m + k m=0 w m So mean is E{x k } = k 1 m=0 u m and variance is E{x 2 0}+ m E{w2 m} = (k +1)σ 2 No open loop system can produce x k = w k 1. Cost: Suppose J = E{ N k=0 x2 k }. Closed loop: J = E{x n w2 k } = (k +1)σ2 Open loop: J = k E{(x 0 + k m u m + k m w m) 2 } (k +1)(k +2)σ2 1 2 Feedback is superior to any open loop control

29 Markov decision processes c Vikram Krishnamurthy Infinite horizon results x k+1 = A(x k,u k,w k ), k = 0,1,..., { N 1 } J π (x 0 ) = lim E ρ k c(x k,µ k (x k )) N 0 < ρ < 1 is discount factor. k=0 Admissible policies: π = {µ 0,µ 1,...,} where u k = µ k (x k ). Optimal cost is J (x) = min π Π J π(x) Define class of stationary policies: π = {µ,µ,...}. To simplify notation call J π as J µ. Require J π (x 0 ) to be finite. Examples include (i) Stochastic shortest path: ρ = 1, cost free termination state. Termination is inevitable. (ii) Discounted problems with bounded cost c(x, u) M We will only consider discounted problems. Then { } J π (x 0 ) = E ρ k c(x k,µ k (x k )) k=0

30 Markov decision processes c Vikram Krishnamurthy DP for finite horizon version Consider minimizing cost { E ρ n J(x N )+ N 1 k=0 ρ k c(x k,u k ) } DP recursion yields for k = 0,...,N 1: J k (x) = min u E{ρ N k c(x,u)+j k+1 (A(x,u,w))} initialized by J N (x) = ρ N J(x). Define V k (x) = J N k(x) ρ N k Then DP can be written for k = 0,1,...,N 1 as V k+1 (x) = min u E{c(x,u)+ρV k (A(x,u,w))} 8.2 Main Result lim k V k (x) = V (x) where V is optimal value function for infinite horizon. Bellman s equation holds V (x) = min u E{c(x,u)+ρV (A(x,u,w))}

31 Markov decision processes c Vikram Krishnamurthy Define the operator (TV)(x) defn = min u E{c(x,u)+ρV(A(x,u,w))} for any V(x). Then Bellman s eqn is: TV = V T is monotonic: Suppose V and V are s.t. V(x) V (x) for all x. Then (TV)(x) (TV )(x) x Define also for any stationary policy µ (T µ )V(x) = E{c(x,µ(x))+ρV(A(x,µ(x),w))} Result: (Bertsekas Vol.2, pg.12.) For every stationary policy µ, the associated cost satisfies V µ = T µ V µ This result means: For any stationary policy µ, the policy cost V µ can be computed by solving V µ = T µ V µ. In finite state MDP case, V µ can be computed exactly since T µ V µ = c(µ)+ρp(µ)v µ. Thus V µ = c(µ)+ρp(µ)v µ = [I ρp(µ)]v µ = c(µ) Note: Bellman s equation is a functional equation. It can rarely be solved explicitly.

32 Markov decision processes c Vikram Krishnamurthy Infinite Horizon MDPs: Numerical methods Bellman s equation (V = TV ) for finite state MDP is [ ] S V (i) = min c(i,u)+ρ P ij (u)v (j) u In vector notation j=1 V = min u [c(u)+ρp(u)v ] where V and c(u) are S dim vectors. Recall optimal policy µ for MDP allocates u k = µ (x k ), i.e. we need to construct the one row lookup table k opt control if x k = 1 opt control if x k = 2 Any k u 1 u 2 1. Linear Programming: Since lim N T N V = V for all V, we have using monotonicity of T: V TV = V V = TV Thus V is largest V that satisfies V TV. maxλ 1 s.t. λ c(u)+ρp(u)λ LP with S U constraints. In queuing problems (that satisfy conservation laws ) these form a polymatroid.

33 Markov decision processes c Vikram Krishnamurthy Value Iteration: Successive iteration method for solving V = TV, i.e., a finite horizon approximation. Initialize V 0. Then successive approximation procedure is V k+1 = TV k i.e. V k+1 = min u [c(u)+ρp(u)v k ] Contraction mapping type proof of convergence One can show V k V 2ρ 1 ρ V k V k 1 3. Policy Iteration: For any stationary policy µ recall T µ V = [c(u)+ρp(u)v] for any V. Recall cost function corresponding to µ, i.e., V µ satisfies T µ V µ = V µ This means that for any stationary policy µ we can solve for V µ : V µ = c(µ)+ρp(µ)v µ = (I ρp(µ))v µ = c(µ) Policy Iteration algorithm: Initialize µ 0 arbitrarily Iterations: (i) Policy evaluation: V µ k is solution of linear equation [I ρp(µ k )]V µ k = c(µ k ) (ii) Policy improvement : u k+1 = min u [c(u)+ρp(u)v k ]

34 Markov decision processes c Vikram Krishnamurthy Structural Results When is optimal policy monotone in state? Two concepts: submodularity, stochastic orders. Submodular: φ(x,u) is submodular in (x,u) if φ(x,u+1) φ(x,u) φ(x+1,u+1) φ(x+1,u). Examples: The following are submodular in (x, u) (i): φ(x,u) = xu. (ii) φ(x) or φ(u) is trivially submodular. (iii) max(x, u) (iv) The sum of submodular functions is submodular. Theorem [Topkis] Consider φ : X U IR. If φ(x,u) is submodular, then u (x) = argmin u φ(x,u) x. First order stochastic dominance: Then π 1 first order stochastically dominates π 2 if X i=j π 1(i) X i=j π 2(i) for j = 1,...,X. This is denoted as π 1 s π 2 or π 2 s π 1. Example: π 1 = [0.3, 0.2, 0.5], π 2 = [0.2, 0.4, 0.4] not orderable. Theorem: Let V denote the set of all X dimensional vectors v with nondecreasing components, i.e., v 1 v 2 v X. Then π 1 s π 2 iff for all v V, v π 1 v π 2.

35 Markov decision processes c Vikram Krishnamurthy Monotone Policies (A1) c(x,u,k) x. (A2) P x (u) s P x+1 (u). (A3) c(x,u,k) is submodular in (x,u) That is: c(x,u+1,k) c(x,u,k) x (A4) P(u) is tail supermodular: j l (P xj(u+1) P xj (u)) is increasing in x. Theorem: Assume that a finite horizon Markov decision process satisfies conditions (A1), (A2), (A3) and (A4). Then µ k(x) x. Same proof applies for infinite horizon discounted cost and average cost.

36 Markov decision processes c Vikram Krishnamurthy Q k (i,u) defn = c(i,u,k)+j k+1p i (u) J k (i) = min Q k(i,u), µ k(i) = argminq k (i,u) u U u U [ ] where J k+1 = J k+1 (1),...,J k+1 (X) Step 1. Assuming (A1) and (A2), Q k (i,u) i. Therefore J k (i) i. Step 2. Assuming (A3) and (A4), Q k (i,u) is submodular. Therefore µ k(i) = argmin u U Q k (i,u) i. Step 1: Use mathematical induction. Q N (i,u) = c(i,n) i by (A1). Suppose Q k+1 (j,ū) j. J k+1 (j) = minūq k+1 (j,ū) j. Next P i (u) r P i+1 (u) by (A2). So J k+1p i (u) J k+1p i+1 (u). Finally c(i,u,k) i (A1), c(i,u,k)+j k+1p i (u) c(i+1,u,k)+j k+1p i+1 (u). Step 2: Consider Q k (i,u) = c(i,u,k)+j k+1p i (u). By (A3), c(i, u, k) is submodular. Applying (A4), since elements of J k+1 are decreasing, J k+1p i (u) is submodular.

37 Markov decision processes c Vikram Krishnamurthy How does Optimal Cost depend on Transition Matrix Consider two MDPs with identical costs but different transition matrices P and P. (A1) c(x,u,k) x. (A2) P x (u) s P x+1 (u). (A5) P x (u) s Px (u) x. Theorem. [Muller 1997]: Optimal cost incurred by policy µ (x;p) is smaller than that incurred by µ (x; P). Proof: Q k (i,u) = c(i,u,k)+j k+1p i (u) Q k (i,u) = c(i,u,k)+ J k+1 P i (u) The proof is by induction. Clearly J N (i) = J N (i) = c(i,n) for all i X. Suppose J k+1 (i) J k+1 (i) for all i X. Therefore J k+1p i (u) J k+1p i (u). By (A1), (A2), Jk+1 (i) is decreasing in i. By (A5), P i r Pi. Therefore J k+1p i J k+1 P i. So c(i,u,k)+j k+1p i (u) c(i,u,k)+ J k+1 P i (u) or equivalently, Q k (i,u) Q k (i,u).

38 Markov decision processes c Vikram Krishnamurthy Neuro-Dynamic Programming Methods The next two methods are simulation based. That is, although parameters are unknown, system can be simulated or observed under any choice of actions. They form the core of re-inforcement learning or neuro-dynamic programming. The key idea in them is the Robbin s Munro stochastic approximation algorithm: Result: Robbins Munro Algorithm. Aim: Solve the algebraic equation X = E{H(X)} where H is a noisy function. That is we can measure samples Y n = H(X n ). Algorithm X n+1 = X n +γ n (Y n X n ) Key idea behind stochastic approx is to replace E{H(X)} by the sample Y n = H(X n ). Remarks: The implicit assumption is that E{H(X)} cannot be computed in closed form this is true when the density function is unknown. Step size γ n = 1/n typically. Stochastic approximations are widely used in adaptive signal processing e.g. adaptive filtering algorithms such as LMS and RLS algorithm. Recursive EM algorithm covered earlier is another example.

39 Markov decision processes c Vikram Krishnamurthy Q-learning: Simulation based. Define Q-factor [ ] S Q(i,u) = c(i,u)+ρ P ij (u)v (j) j=1 From Bellman s equation this yields [ ] S Q(i,u) = c(i,u)+ρ P ij (u)minq(j,u ) u j=1 The trick above expresses Q as E{min( )}: [ ] Q(i,u) = c(i,u)+ρe{minq(x k+1,u ) x k = i} u Hence can be solved via Robbins-Munro algorithm: Q k+1 (i,u) = Q k (i,u) ( ( +γ c(i,u)+ρmin Qk (j,u ) ) ) Q k (i,u) u Note: j is generated from (i,u) via simulation P ij (u). Remarks: (i) The above recursion does not require knowledge of P(u). (ii) Q learning is merely a stochastic approx algorithm! (iii) NDP is widely used in Artificial intelligence where it is called Reinforcement learning. 5. Temporal difference methods: These can be used to compute by simulation the cost of a policy (details omitted).

40 Markov decision processes c Vikram Krishnamurthy Summary and Extensions Stochastic Dynamic programming (SDP) involves solving a functional equation. This yields a (possibly infinite dimensional) lookup table. There are 2 types of problems considered (i) Finite horizon (ii) Infinite horizon steady state controller. For infinite horizon finite state MDPs there are several numerical algorithms: e.g. Policy iteration, value iteration, linear programming, neurodynamic programming. We have not covered cont-time finite state MDPs. These arise in control of queuing systems e.g. telecomms. By a process called uniformization, a cont-mdp can be covered to an equivalent discrete-time MDP. A generalization of cont-time MDPs are semi-markov Decision processes. These are widely studied in discrete event systems. Finally, MDPs with constraints can also be considered. Often the optimal policy is randomized.

Stochastic Optimal Control

Stochastic Optimal Control Lecturer: Eilyan Bitar, Cornell ECE Scribe: Kevin Kircher, Cornell MAE These notes summarize some of the material from ECE 5555 (Stochastic Systems) at Cornell in the fall of