Dynamic Programming and Stochastic Control

Size: px

Start display at page:

Download "Dynamic Programming and Stochastic Control"

Berenice Rodgers
5 years ago
Views:

1 Dynamic Programming and Stochastic Control Dr. Alex Leong Department of Electrical Engineering (EIM-E) Paderborn University, Germany Dr. Alex Leong DP and Stochastic Control Paderborn University 1 / 158

2 Outline 1 Introduction Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 2 / 158

3 Introduction What is dynamic programming (DP)? Method for solving multi-stage decision problems (Sequential decision making). There is often some randomness to what happens in future. Optimize set of decisions to achieve a good overall outcome. Richard Bellman popularized DP in the 1950s Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 3 / 158

4 Examples 1) Inventory control A store sells a product, e.g. Ice cream. Order supplies once a week. Sales during the week are random. How much supply should the store get to maximize expected profit over summer? Order too little, can t meet demand. Order too much, storage/refrigeration cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 4 / 158

5 Examples 2) Parts replacement e.g. bus engine. At the start of each month, decide whether the engine on a bus should be replaced, to maximize expected profit? If replace, profit = earnings - replacement cost - maintenance. If don t replace, profit = earnings - maintenance. Earnings will decrease if engine breaks down. P(Breakdown) is age dependent. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 5 / 158

6 Examples 3) Formula 1 engines, replace or not? 20 races, 4 engines (in 2017) Decide whether to replace engine at the start of each race, to maximize chance of winning championship. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 6 / 158

7 Examples 4) Queueing (see Figure 1) Packets arrive at queues 1 and 2. If both queues transmit at same time, have collision. If collision, retransmit at next time with a certain probability. Choose retransmission probabilities to maximize throughput. Figure 1: Queueing Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 7 / 158

8 Examples 5) LQR (Linear Quadratic Regulator) Linear System: x k+1 = Ax k + Bu k (Deterministic Problem) Assume knowledge of x k at time k (Perfect state info) Choose sequence of u k to N 1 min u 0,u 1,...,u N 1 k=0 N = number of stages = horizon. N finite finite horizon. (xk T Qx k + uk T Ru k) + xn T Qx N Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 8 / 158

9 Examples 6) x k+1 = Ax k + Bu k + w k w k = Random noise. Assume x k known (Perfect state info) Choose sequence of u k to [ N 1 ] min E (x T u 0,u 1,...,u k Qx k + uk T Ru k) + xn T Qx N N 1 k=0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 9 / 158

10 Examples 7) LQG (Linear Quadratic Gaussian) Control x k+1 = Ax k + Bu k + w k y k = Cx k + v k v k, w k Gaussian noise. Case of imperfect state info. Based on measurements y k, choose u k to [ N 1 ] min E (x T u 0,u 1,...,u k Qx k + uk T Ru k) + xn T Qx N N 1 k=0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 10 / 158

11 Examples 8) Infinite horizon min lim u 0,u 1,...,u N 1 N [ N 1 1 N E k=0 (x T k Qx k + u T k Ru k) + x T N Qx N Note: Here we divide by N, otherwise summation often blows up. ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 11 / 158

12 Examples 9) Shortest paths (see Figure 2) Find shortest path from A to stage D (Deterministic Problem). Can solve using the Viterbi algorithm (1967) Can be regarded as a special case of (forward) DP. Applications: decoding of convolutional codes (communications) channel equalization (communications) estimation of hidden Markov models (signal processing) Figure 2: Shortest paths problem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 12 / 158

13 Outline 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic Structure of Dynamic Programming Problem Dynamic Programming Principle of Optimality Dynamic Programming Algorithm Shortest Path Problems Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 13 / 158

14 Basic structure of stochastic DP problem Two ingredients, discrete time system and cost function 1. Discrete time system x k+1 = f k (x k, u k, w k ), k = 0, 1,..., N 1 (or k = 1, 2,...N) k is time index. x k is state at time k, summarizes past information that is relevant for future optimization. u k is control/decision/action at time k, lies in a set U k (x k ) which may depend on k and x k. w k is random disturbance (noise), with a probability distribution P(. k, x k, u k ) which may depend on k, x k, u k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 14 / 158

15 Basic structure of stochastic DP problem x k+1 = f k (x k, u k, w k ), k = 0, 1,..., N 1 N is horizon, or number of times control is applied. f k is function that describes how system evolves over time. Examples fk = Ax k + Bu k + w k (linear system) f k = x k u k + w k (non-linear) f k = cos x k + w k sin u k (non-linear) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 15 / 158

16 Basic structure of stochastic DP problem 2. Cost function which is additive over time [ N 1 ] E g k (x k, u k, w k ) + g N (x N ) k=0 Expectation is used because of random w k. g k is function that represents cost at time k. Examples g k = x k + u k g k = xk 2 + Cu2 k, where C is a constant. g N (x N ) is terminal cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 16 / 158

17 Basic structure of stochastic DP problem Objective: Minimize the cost function over the controls u 0 = µ 0 (x 0 ), u 1 = µ 1 (x 1 ),..., u N 1 = µ N 1 (x N 1 ) Choice of u k depends on x k. Optimization over policies: rules/functions µ k for generating u k for every possible value of x k. Expected cost of policy π = (µ 0, µ 1,..., µ N 1 ) starting at x 0 is J π (x 0 ) = E [ N 1 k=0 g k (x k, µ k (x k ), w k ) + g N (x N ) Optimal policy: π = argminj π (x 0 ) π Optimal cost starting at x 0 : J (x 0 ) = minj π (x 0 ) π ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 17 / 158

18 Examples 1) Inventory example x k = amount of stock at time k. u k = stock ordered at time k. w k = demand at time k, with some probability distribution e.g. uniform. System: x k+1 = x k + u k w k (= f k (x k, u k, w k )) x k can be negative with this model. Alternative model: x k+1 = max(0, x k + u k w k ). Cost function at time k: g k (x k, u k, w k ) = r(x k ) + Cu k r(x k ) is penalty for holding excess stock. C is cost per item. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 18 / 158

19 Examples 1) Inventory example (cont.) Terminal cost: R(x N ) is penalty for having excess stock at the end. [ N 1 ] Cost function: E k=0 (r(x k) + Cu k ) + R(x N ) Amount u k to order can depend on inventory level x k. Can have constraints on u k, e.g. x k + u k max. storage. Optimization over policies: Find the rule which tells you how much to order for every possible stock level x k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 19 / 158

20 Examples 2) Example 6 of previous section System x k+1 = Ax k +Bu k +w k }{{} f k Cost function N 1 E (xk T Qx k+uk T Ru k) + xn T }{{} Qx N }{{} k=0 g k g N (x N ) Objective: Determine u k = µ k (x k ), k = 0, 1,..., N 1, to minimize the cost function. Solution turns out to be u k = L kx k for some matrices L k. (Derived in later lecture) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 20 / 158

21 Examples 3) Shortest paths (see Figure 3) Figure 3: Shortest path problem x k = which node we re in at stage k. u k = which path we take to get to stage k + 1 w k = zero Cost function = Sum of values along the paths we choose. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 21 / 158

22 Open loop vs. Closed loop Open loop: Controls (u 0, u 1,..., u N 1 ) chosen at beginning (time 0). Closed loop: Policy (µ 0, µ 1,..., µ N 1 ) chosen, where at time k, µ k (x k ) = u k can depend on x k. Can adapt to conditions. e.g. Inventory problem. If current stock level: xk high order less. x k low order more. Closed loop is always at least as good as open loop. For deterministic problems, open loop is as good as closed loop can predict exactly the future states given initial state and sequence of controls. For stochastic problems, generally should use closed loop. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 22 / 158

23 D.P. Principle of Optimality Intuition 5 B 3 D A F 6 4 C 4 E Figure 4: Shortest path problem Consider the shortest path problem in Figure 4. Shortest path from A to F shown in red: A C D F Shortest path from C to F: C D F. Subpath of shortest path from A F. Shortest path from D to F: D F. Subpath of shortest path from A F. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 23 / 158

24 D.P. Principle of Optimality Observation Shortest path from A to F contains shortest paths from intermediate nodes to F. Why? Suppose there is a shorter path from C to F which is not C D F. Then can construct a new path A C... F (new shortest path) which is shorter than A C D F contradicts A C D F being the shortest. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 24 / 158

25 D.P. Principle of Optimality Formal statement: Basic problem ( N 1 ) mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 Let π = {µ 0, µ 1,..., µ N 1 } be the optimal policy. Consider the tail subproblem ( N 1 ) min E g k (x k, µ k (x k ), w k ) + g N (x N ), µ i,µ i+1,...,µ N 1 k=i where we are at state x i at time i and we wish to minimize the cost to go from time i to time N. D.P. Principle of optimality then says that {µ i, µ i+1,..., µ N 1 } is optimal for the tail subproblem. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 25 / 158

26 D.P. Principle of Optimality Proof : If { µ i,..., µ N 1 } is a better policy for tail subproblem, then {µ 0, µ 1,..., µ i 1, µ i,..., µ N 1 } would be a better policy for original problem contradiction of {µ 0, µ 1,..., µ N 1 } being optimal. How can we make use of the D.P. principle? Idea: Construct an optimal policy in stages. Solve tail subproblem involving last stage, to obtain µ N 1 Solve tail subproblem involving last two stages, making use of µ N 1, to obtain µ N 2 Solve tail subproblem involving last three stages, making use of µ N 2, µ N 1, to obtain µ N 3... Solve tail subproblem involving last N stages, making use of µ 1,.., µ N 1, to obtain µ 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 26 / 158

27 D.P. Algorithm Basic problem: { N 1 } mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 D.P. Algorithm: For each possible x k, compute: J N (x N ) = g N (x N ), J k (x k ) = for k = N 1, N 2,..., 1, 0 Theorem: min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))}, u k U k (x k ) 1 Optimal cost J (x 0 ) = J 0 (x 0 ), where J 0 (x 0 ) is quantity computed by D.P. algorithm. 2 Let µ k (.) be the function that generates the minimum u k in the D.P. algorithm, i.e. µ k (x k) = uk. Then {µ 0, µ 1,..., µ N 1 } is the optimal policy to the basic problem. Proof: See later Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 27 / 158

28 D.P. Algorithm Comments: D.P. algorithm needs to be run for all possible states x k. Solves all tail subproblems (don t know which subproblem you need at the start). Can be computationally expensive if number of states/controls is large. Often done on computer. Suboptimal methods can reduce complexity. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 28 / 158

29 Inventory Example x k = level of stock at time k. u k = amount ordered at time k. w k = demand at time k. x k+1 = max(0, x k + u k w k ) = f k (x k, u k, w k ), excess demand is lost. Storage constraint: x k + u k 2 Cost at time k = Purchasing cost + storage cost }{{} (x k +u k w k ) 2 }{{} cost per item=1euro = u k + (x k + u k w k ) 2 = g k (x k, u k, w k ) Terminal cost g N (x N ) = 0. Probability distribution of w k : P(w k = 0) = 0.1, P(w k = 1) = 0.7, P(w k = 2) = 0.2 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 29 / 158

30 Inventory Example Problem: Find the optimal policy for horizon N = 3, i.e. { 2 } min E g k (x k, µ k (x k ), w k ) (µ 0,µ 1,µ 2 ) k=0 Apply D.P. algorithm: J 3 (x 3 ) = g 3 (x 3 ) = 0 J k (x k ) = min u k U k E{u k + (x k + u k w k ) 2 + J k+1 (max(0, x k + u k w k ))}, Question: What values can x k take? k = 2, 1, 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 30 / 158

31 Inventory Example Period 2: Compute J 2 (x 2 ) for all possible values of x 2 J 2 (0) = min E{u 2 + (0 + u 2 w 2 ) 2 + J 3 (x 3 )} u 2 {0,1,2} }{{} = min u 2 {0,1,2} u 2 + E{(u 2 w 2 ) 2 } =0 for all x 3 = min u 2 {0,1,2} u 2 + (u 2 0) (u 2 1) (u 2 2) If u 2 = 0: u u (u 2 1) (u 2 2) 2 = = 1.5 If u 2 = 1: = 1.3 If u 2 = 2: = 3.1 J 2 (0) = 1.3 and µ 2 (0) = 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 31 / 158

32 Inventory Example J 2 (1) = min u 2 + (1 + u 2 ) (1 + u 2 1) (1 + u 2 2) u 2 {0,1} If u 2 = 0: 0.3 (check this!) If u 2 = 1: 2.1 J 2 (1) = 0.3 and µ 2 (1) = 0 J 2 (2) = min E{u 2 + (2 + u 2 w 2 ) 2 } = = 1.1 u 2 {0} J 2 (2) = 1.1 and µ 2 (2) = 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 32 / 158

33 Inventory Example Period 1: Compute J 1 (x 1 ) for all possible values of x 1. J 1 (0) = min E{u 1 + (u 1 w 1 ) 2 + J 2 (max(0, 0 + u 1 w 1 ))} u 1 {0,1,2} = min u 1 {0,1,2} u 1 + (u J 2 (max(0, u 1 ))0.1 + ((u 1 1) 2 + J 2 (max(0, u 1 1)))0.7 + ((u 1 2) 2 + J 2 (max(0, u 1 2)))0.2 u 1 = 0: J 2 (0) (1 + J 2 (0) }{{}}{{} )0.7 + (4 + J 2(0) )0.2 = 2.8 }{{} from previous stage u 1 = 1: 1 + (1 + J 2 (1))0.1 + J 2 (0) }{{}}{{} (1 + J 2(0) )0.2 = 2.5 }{{} from previous stage u 1 = 2: 2 + (4 + J 2 (2))0.1 + (1 + J 2 (1))0.7 + J 2 (0)0.2 = 3.6 J 1 (0) = 2.5 and µ 1 (0) = 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 33 / 158

34 Inventory Example J 1 (1) = min E{u 1 + (1 + u 1 w 1 ) 2 + J 2 (max(0, 1 + u 1 w 1 ))} u 1 {0,1} u 1 = 0: 1.5(check!) u 1 = 1: 2.68 J 1 (1) = 1.5, and µ 1 (1) = 0 J 1 (2) = 1.68, µ 1 (2) = 0 (check!) Period 0: Compute J 0 (x 0 ) for all possible x 0 (Tutorial problem) Solution: J 0 (0) = 3.7, J 0 (1) = 2.1, J 0 (2) = µ 0 (0) = 1, µ 0 (1) = 0, µ 0 (2) = 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 34 / 158

35 Scheduling Example Example: Scheduling problem (deterministic problem) Four operations need to be performed: A, B, C, D. B has to occur after A, D has to occur after C. Costs: c AB = 2, c AC = 3, c AD = 4, c BC = 3, c BD = 1, c CA = 4, c CB = 4, c CD = 6, c DA = 3, c DB = 3. Startup costs: S A = 5, S C = 3. What is the optimal order? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 35 / 158

36 Scheduling Example 6 ABC 6 ABCD AB 2 8 A AC C 3 CA ACB 3 ACD 1 2 CAB 1 ACBD 3 ACDB 1 CABD 6 5 CD 4 3 CAD 3 CADB 3 2 CDA 2 CDAB Minimum cost to go in red Figure: Scheduling Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 36 / 158

37 Scheduling Example Use D.P. algorithm Let State = Set of operations already performed, see Figure Scheduling. No terminal costs for this problem. Tail subproblems of length 1. Easy, only one choice at each state, e.g. if state ACD, next operation has to be B. Tail subproblems of length 2. State AB, only one choice, next operation is C. State AC, if next operation is B: cost = = 5. State AC, if next operation is D: cost = = 9. Choose B. State CA, if next operation is B: cost = = 3. State CA, if next operation is D: cost = = 7. Choose B. State CD, only one choice, next operation is A. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 37 / 158

38 Scheduling Example Tail subproblems of length 3. State A, if next operation is B: cost = = 11. State A, if next operation is C: cost = = 8. Choose C State C, if next operation is A: cost = = 7. State C, if next operation is D: cost = = 11. Choose A. Original problem of length 4. If start with A: cost = = 13 If start with C: cost = = 10 Choose C Therefore, the optimal sequence = CABD, and the optimal cost = 10. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 38 / 158

39 Proof that D.P. Algorithm gives Optimal Solution Basic problem: { N 1 } mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 D.P. Algorithm: For each possible x k, compute: J N (x N ) = g N (x N ), J k (x k ) = for k = N 1, N 2,..., 1, 0 Theorem: min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))}, u k U k (x k ) 1 Optimal cost J (x 0 ) = J 0 (x 0 ), where J 0 (x 0 ) is quantity computed by D.P. algorithm. 2 Let µ k (.) be the function that generates the minimum u k in D.P. algorithm i.e µ k (x k) = uk. Then {µ 0, µ 1,..., µ N 1 } is the optimal policy to the basic problem. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 39 / 158

40 Proof that D.P. Algorithm gives Optimal Solution Notation: Given policy π = (µ 0, µ 1,..., µ N 1 ), let π k = (µ k, µ k+1,..., µ N 1 ) = tail policy and J k (x k) = min π k for tail subproblem. E{ N 1 i=k g i(x i, µ i (x i ), w i ) + g N (x N )} be the optimal cost Let J k (x k ) = quantity computed by D.P algorithm. Want to show that J k (x k) = J k (x k ), for all x k, k. Proof is by mathematical induction Initial step (k = N): By definition of J k (x k), J N (x N) = g N (x N ) By definition of D.P algorithm J N (x N ) = g N (x N ) J N (x N) = J N (x N ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 40 / 158

41 Proof that D.P. Algorithm gives Optimal Solution Induction step: Assume J l (x l) = J l (x l ) for l = N, N 1,..., k + 1 Want to show that J k (x k) = J k (x k ) From the definition of J k (x k), { N 1 Jk (x k) = min E π k i=k { = min E (µ k,π k+1 ) { = min µ k E g i (x i, µ i (x i ), w i ) + g N (x N ) g k (x k, µ k (x k ), w k ) + g k (x k, µ k (x k ), w k )+min π k+1e N 1 i=k+1 [ N 1 i=k+1 } g i (x i, µ i (x i ), w i ) + g N (x N ) g i (x i, µ i (x i ), w i )+g N (x N ) by D.P principle (optimize tail subproblem then µ k ) } ]} Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 41 / 158

42 Proof that D.P. Algorithm gives Optimal Solution = min µ k E{g k (x k, µ k (x k ), w k ) + J k+1 (f k(x k, µ k (x k ), w k ))} by definition of J k+1 (x k+1) = min µ k E{g k (x k, µ k (x k ), w k ) + J k+1 (f k (x k, µ k (x k ), w k ))} by induction hypothesis = min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))} using fact that u k U k (x k ) minf (x, µ(x)) = min F (x, u). µ u U(x) = J k (x k ) from D.P. algorithm equations So Jk (x k) = J k (x k ), and µ k (x k) = uk is the optimal policy. By induction, this is true for k = N, N 1,..., 1, 0. In particular, J (x 0 ) = J 0 (x 0) = J 0 (x 0 ) is the optimal cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 42 / 158

43 Shortest Paths in a Trellis Initial state s Artificial Terminal state t Stage 0 a^1_ Stage 1 Stage 2 Stage N-1 Stage N Figure 6: Shortest paths in a trellis Find shortest path from a node in Stage 1 to a node in Stage N states nodes controls arcs aij k : cost of transition from state i at stage k to state j at stage k + 1. ait N : terminal cost of state i cost function = length of path from s to t Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 43 / 158

44 Shortest Paths in a Trellis D.P. Algorithm: J N (i) = a N it J k (i) = min j [a k ij + J k+1 (j)], k = N 1,..., 1, 0 Optimal cost = J 0 (s) = length of shortest path from s to t. Example: Find shortest path from stage 1 to stage 3 in Figure Shortest path in red Stage 1 Stage 2 Stage 3 Figure 7: Shortest paths example Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 44 / 158

45 Shortest Paths in a Trellis Redraw as a trellis with initial and terminal node, see Figure s Stage Stage 1 Stage 2 Stage 3 Figure 8: Redrawn shortest paths example 0 0 t Here N = 3. Call the top node state 1 and bottom node state 2. Stage N: J 3 (1) = 0 J 3 (2) = 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 45 / 158

46 Shortest Paths in a Trellis Stage 2: Stage 1: Stage 0: J 2 (1) = min{a J 3(1), a J 3(2)} = min{ , } = 100 J 2 (2) = min{a J 3(1), a J 3(2)} = min{ , } = 350 J 1 (1) = min{a J 2(1), a J 2(2)} = min{ , } = 400 J 1 (2) = min{a J 2(1), a J 2(2)} = min{ , } = 250 J 0 (s) = min{0 + J 1 (1), 0 + J 1 (2)} = 250 Shortest path to original problem shown in red in Figure 7. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 46 / 158

47 Forward D.P. Algorithm Observe that optimal path s t is also optimal path t s if directions of arcs are reversed. Shortest path algorithm can be run forwards in time (see Bertsekas for equations). Figure 9 shows the result of forward D.P. on shortest paths example. Forward D.P. useful in real-time applications, where data arrives just before you need to make a decision. Viterbi algorithm uses this idea Shortest paths is a deterministic problem, so forward D.P. works. For stochastic problems, no such concept of forward D.P. Impossible to guarantee that any given state can be reached Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 47 / 158

48 Forward D.P. Algorithm s t Figure 9: Forward D.P. on shortest paths example Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 48 / 158

49 Viterbi Algorithm Applications Estimation of hidden Markov models (HMMs) xk = Markov chain state transitions in x k not observed (hidden). observe z k, r(z, i, j) = probability we observe z given a transition in Markov chain x k from state i to j. Estimation problem: Given Z N = {z 1, z 2,..., z N }, find a sequence ˆX N = {ˆx 0, ˆx 1,..., ˆx N } over all possible {x 0, x 1,..., x N } that maximizes P(X N Z N ). Note that P(X N Z N ) = P(X N,Z N ) P(Z N ), and P(Z N ) is constant given Z N So max P(X N Z N ) max P(X N, Z N ) max ln P(X N, Z N ) {x 0,...,x N } {x 0,...,x N } {x 0,...,x N } Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 49 / 158

50 Viterbi Algorithm Applications After some calculations (see Bertsekas), can show that problem is equivalent to: min ln(π x 0 ) {x 0,...,x N } N ln(π xk 1 x k r(z k, x k 1, x k )) k=1 where π x0 = probability of initial state, π xk 1 x k = transition probabilities of Markov chain, and ln π x0 and ln(π xk 1 x k r(z k, x k 1, x k )) can be regarded as lengths of the different stages shortest path problem through a trellis Decoding of convolutional codes Channel equalization in presence of ISI (Inter-symbol interference) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 50 / 158

51 General Shortest Path Problems No trellis structure e.g. Find the shortest path from each node to node 5 in Figure Figure 10: General shortest path problem Graph with N + 1 nodes {1, 2,..., N, t} a ij = cost of moving from node i to node j. Find the shortest path from each node i to node t. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 51 / 158

52 General Shortest Path Problems Assume some a ij s can be negative, but cycles have non-negative length. Then shortest path will not involve more than N arcs. Reformulate as a trellis-type shortest path problem with N arcs, by allowing arcs from node i to itself with cost a ii = 0 D.P. algorithm: J N 1 (i) = a it J k (i) = min j {a ij + J k+1 (j)}, k = N 2,..., 1, 0 This algorithm is essentially the Bellman-Ford algorithm. Other algorithms have also been invented, e.g. Dijkstra s algorithm which can be used when all a ij s are positive. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 52 / 158

53 Outline 3 Problems with Perfect State Information Linear Quadratic Control Optimal Stopping Problems Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 53 / 158

54 Problems with Perfect State Information Will study some problems where analytical solutions can be obtained: Linear quadratic control Optimal stopping problems + others in Chapter 4 of Bertsekas Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 54 / 158

55 Linear Quadratic Control (Linear) System: x k+1 = Ax k + Bu k + w k, k = 0, 1,..., N 1 (Quadratic) Cost function: E { N 1 } (xk T Qx k + uk T Ru k) + xn T Qx N k=0 Problem: Determine optimal policy to minimize cost function x k, u k, w k are column vectors A, B, Q, R are matrices. w k are independent and zero mean. Q is positive semi-definite. R is positive definite. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 55 / 158

56 Linear Quadratic Control Definition: A symmetric matrix M is positive semi-definite if x T Mx 0, vectors x M is positive definite if x T Mx > 0, x 0 One characterization: M is positive semi definite all eigenvalues of M are 0. M is positive definite all eigenvalues of M are > 0. D.P. algorithm applied to this problem gives: J N (x N ) = x T N Qx N J k (x k ) = min u k {x T k Qx k + u T k Ru k + J k+1 (Ax k + Bu k + w k )}, k = N 1,..., 1, 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 56 / 158

57 Linear Quadratic Control Turns out that minimization can be done analytically J N 1 (x N 1 ) = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 + (Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 )} = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 + x T N 1 AT QAx N 1 + x T N 1 AT QBu N 1 + x T N 1 AT Qw N 1 + u T N 1 BT QAx N 1 + u T N 1 BT QBu N 1 + u T N 1 BT Qw N 1 + w T N 1 QAx N 1 + w T N 1 QBu N 1 + w T N 1 Qw N 1} = x T N 1 (AT QA + Q)x N 1 + E{w T N 1 Qw N 1} + min u N 1 {u T N 1 (R + BT QB)u N 1 + 2x T N 1 AT QBu N 1 } Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 57 / 158

58 Linear Quadratic Control Digression Problem: minf (x) x How to solve? For unconstrained scalar problems, can differentiate and set derivative equal to 0. e.g. min(x 2) 2 d, x dx (x 2)2 = 2(x 2) = 0 x = 2. Similarly, differentiate u T N 1 (R + BT QB)u N 1 + 2x T N 1 AT QBu N 1 with respect to the vector u N 1 and set equal to zero Note that (u T Au) (a T u) = 2Au, = a, u u where a and u are column vectors, and A is a symmetric matrix. Using above formulas, obtain 2(R + B T QB)u N 1 + 2B T QAx N 1 = 0 u N 1 = (R + BT QB) 1 B T QAx N 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 58 / 158

59 Linear Quadratic Control Substituting u N 1 = (R + BT QB) 1 B T QAx N 1 back into expression for J N 1 (x N 1 ), we obtain J N 1 (x N 1 ) = x T N 1 (AT QA + Q)x N 1 + E{w T N 1 Qw N 1} + x T N 1 AT QB(R + B T QB) 1 (R + B T QB)(R + B T QB) 1 B T QAx N 1 2x T N 1 AT QB(R + B T QB) 1 B T QAx N 1 = x T N 1 (AT QA + Q)x N 1 x T N 1 AT QB(R + B T QB) 1 B T QAx N 1 + E{w T N 1 Qw N 1} = x T N 1 (AT QA + Q A T QB(R + B T QB) 1 B T QA)x N 1 + E{w T N 1 Qw N 1} = x T N 1 K N 1x N 1 + E{w T N 1 Qw N 1} with K N 1 = A T QA + Q A T QB(R + B T QB) 1 B T QA Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 59 / 158

60 Linear Quadratic Control Continuing on, can show that un 2 = (BT K N 1 B + R) 1 B T K N 1 Ax N 2, and more generally (tutorial problem) that µ k (x k) = (B T K k+1 B + R) 1 B T K k+1 Ax k where K N = Q, K k = A T K k+1 A + Q A T K k+1 B(B T K k+1 B + R) 1 B T K k+1 A Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 60 / 158

61 Certainty Equivalence Certainty Equivalence: Optimal policy is the same as solving problem for the deterministic system: x k+1 = Ax k + Bu k + E[w k ], where w k is replaced by its expected value E[w k ] = 0, i.e. the standard LQR problem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 61 / 158

62 Asymptotic Behaviour Definition: A pair of matrices (A, B), where A is n n, B is n m, is controllable if the n nm matrix [ B AB A 2 B... A n 1 B ] has full rank (all rows linearly independent) A pair (A, C), where A is n n, C is m n, is observable if (A T, C T ) is controllable. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 62 / 158

63 Asymptotic Behaviour Theorem If (A, B) is controllable and Q can be written as Q = C T C, where (A, C) is observable, then: 1 K k K as k, with K satisfying the algebraic Riccati equation K = A T KA + Q A T KB(B T KB + R) 1 B T KA 2 The steady state controller µ (x k ) = Lx k, where L = (B T KB + R) 1 B T KA, stabilizes the system, i.e. the eigenvalues of A + BL have magnitude < 1. Proof: See Bertsekas Note: If u k = Lx k, then x k+1 = Ax k + Bu k + w k = (A + BL)x k + w k. x k stays bounded when the eigenvalues of A + BL have magnitude < 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 63 / 158

64 Other Variations x k+1 = A k x k + B k u k + w k A k, B k random, unknown, independent. Optimal policy: where K N = Q, µ k (k) = (R + E{BT k K k+1b}) 1 E{B T k K k+1a}x k, K k = E{A T k K k+1a T k } + Q E{A T k K k+1b k }(E{B T k K k+1b k } + R) 1 E{B T k K k+1a k } may not have certainty equivalence may not have steady state solution x k+1 = Ax k + B k u k + w k B k is random, independent, and is only revealed to us at time k. Motivation: Wireless channels Similar to Leong, Dey, Anand, Optimal LQG control over continuous fading channels, Proc. IFAC World Congress, Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 64 / 158

65 Optimal Stopping Problems At each state, there is a stop control that stops the system, i.e moves to and stays in a stop state. Pure stopping problem: if only other control is continue. For pure stopping problems, policy characterized by partition of set of states into: stop region continue region, which may depend on time. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 65 / 158

66 Example (Asset selling) A person has an asset for sale, e.g. a house. At each time k = 0, 1,..., N 1, person receives a random offer w k for the asset. Assume w k s are independent. Either accept w k at time k + 1, and invest money at interest rate r, or reject w k and wait for offer w k+1. Must accept last offer w N 1 at time N if every previous offer was rejected. Find policy that maximizes (expected) revenue at the N-th period. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 66 / 158

67 Example (Asset selling) States: If x k = T : asset already sold (= stop state) If x k = w k 1 : offer currently under consideration. Controls: {accept, reject} System evolves as: x k+1 = f k (x k, w k, u k ) { T, if 1) xk = T or 2) x = k T and u k = accept w k, otherwise. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 67 / 158

68 Example (Asset selling) Rewards at time k: g N (x N ) = { xn, if x N T ; 0, otherwise. { (1 + r) g k (x k, u k, w k ) = N k x k, if x k T and u k = accept ; 0, otherwise. (For compound interest over n years, final amount = (1 + r) n initial amount.) Note: From the way the rewards are defined, gk is non-zero for only one k {0, 1,..., N 1}. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 68 / 158

69 Example (Asset selling) Expected total reward [ N 1 ] = E g k (x k, u k, w k ) + g N (x N ) k=0 D.P. algorithm (for reward maximization): { xn, if x J N (x N ) = g N (x N ) = N T ; 0, otherwise. J k (x k ) = max u k E[g k (x k, u k, w k ) + J k+1 (x k+1 )] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 69 / 158

70 Example (Asset selling) If x k = T, then g k (x k, u k, w k ) = 0 and J k+1 (x k+1 ) = 0, by property of g k being non-zero for only one k, and reward being incurred prior to time k If x k T, then { (1 + r) E[g k (x k, u k, w k )+J k+1 (x k+1 )] = N k x k, if u k = accept; 0 + E[J k+1 (w k )], if u k = reject. So J k (x k ) = maxe[g k (x k, u k, w k ) + J k+1 (x k+1 )] u k { max((1 + r) = N k x k, E[J k+1 (w k )]), if x k T, 0, if x k = T, and optimal policy is of the form: u k = accept if (1 + r) N k x k > E[J k+1 (w k )] { accept, if x or u k = k > E[J k+1(w k )] ; (1+r) N k reject, otherwise. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 70 / 158

71 Example (Asset selling) Let α k = E[J k+1(w k )] (1 + r) N k Can show (see Bertsekas) that α k α k+1 for all k if w k are i.i.d. Intuition: offer acceptable at time k should also be acceptable at time k + 1. See Figure 11 α 1 α 2 Accept Reject α N N-1 k Figure 11: Asset selling Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 71 / 158

72 Example (Asset selling) Can also show that if w k are i.i.d and N, then optimal policy converges to the stationary policy: { accept, if xk > ᾱ u k = reject, if x k ᾱ where ᾱ is constant. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 72 / 158

73 General Stopping Problems Pure stopping problem - stop or continue only possible controls General stoping problem - stop or choose a control u k from U(x k ) (where U has more than one element) Consider time invariant case: f (x k, u k, w k ), g(x k, u k, w k ) don t depend on k, and w k is i.i.d. Stop at time k with cost t(x k ) Must stop by last stage. D.P. algorithm: J N (x N ) = t(x N ), J k (x k ) = min[t(x k ), min E{g(x k, u k, w k ) + J k+1 (f (x k, u k, w k ))}] u k U(x k ) Optimal to stop when t(x k ) min E{g(x k, u k, w k ) + J k+1 (f (x k, u k, w k ))} u k U(x k ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 73 / 158

74 General Stopping Problems Stopping set at time k (set of states where you stop) defined as T k = {x t(x) min E[g(x, u, w) + J k+1(f (x, u, w))]} u U(x) Note that J N 1 (x) J N (x) for all x, since J N (x) = t(x) and [ ] J N 1 (x) = min t(x), min E[g(x, u, w) + J k+1(f (x, u, w))] u U(x) t(x) = J N (x) Can show that J k (x) J k+1 (x) (Monotonicity principle: tutorial problem) Then we have : T 0 T 1 T 2... T k T k+1... T N 1 i.e. set of states in which we stop increases with time. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 74 / 158

75 Special Case If f (x, u, w) T N 1 for all x T N 1, u U(x), w, i.e. the set T N 1 is absorbing, then T 0 = T 1 = T 2 = = T N 1. Proof: See Bertsekas Simplifies optimal policy, called the one step lookahead policy. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 75 / 158

76 Special Case E.g. Asset selling with past offers retained Same situation as before, except that previously rejected offers can be accepted at a later time. State evolves as (instead of x k+1 = w k before) x k+1 = max(x k, w k ) Can show (see Bertsekas) that T N 1 = {x x ᾱ} for some constant ᾱ This set is absorbing, since best currently received offer cannot decrease over time. optimal policy at every time k is to accept if best offer > ᾱ Have constant threshold ᾱ even for finite horizon N Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 76 / 158

77 Outline 4 Problems with Imperfect State Information Reformulation as Perfect State Information Problem Linear Quadratic Control with Noisy Measurements Sufficient Statistics Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 77 / 158

78 Problems with Imperfect State Information State x k not known to controller. Instead have noisy observations z k of the form: z 0 = h 0 (x 0, v 0 ), z k = h k (x k, u k 1, v k ), k = 1, 2,..., N 1, where v k is observation noise, with a probability distribution P v (. x 0,..., x k, u 0,..., u k 1, w 0,..., w k 1, v 0,..., v k 1 ) which can depend on states, controls and disturbances Examples h x (x k, u k 1, v k ) = x k + v k, h k (x k, u k 1, v k ) = sin x k + u k 1 v k Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 78 / 158

79 Problems with Imperfect State Information Initial state x 0 is random with distribution P x0. u k U k, where U k does not depend on (unknown) x k. Information vector, i.e. information available to controller at time k, defined as I 0 = z 0, I k = (z 0,..., z k, u 0,..., u k 1 ), k = 1, 2,..., N 1 Policies π = (µ 0,..., µ N 1 ), where now µ k (I k ) U k (before µ k (x k )). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 79 / 158

80 Basic Problem with Imperfect State Information Find π that minimizes the cost function { N 1 } J π = E g k (x k, µ k (I k ), w k ) + g N (x N ) s.t. system equation k=0 and measurement equation x k+1 = f k (x k, µ k (I k ), w k ) z k = h k (x k, µ k 1 (I k 1 ), v k ) Question: How to solve this problem? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 80 / 158

81 Reformulation as Perfect State Information Problem Idea: Define new system where the state is I k. Then have D.P. algorithm etc. By definition I k+1 = (z 0,..., z k, z k+1, u 0,..., u k 1, u k ) = (z 0,..., z k, u 0,..., u }{{ k 1, z } k+1, u k ) I k I k+1 = (I k, u k, z k+1 ). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 81 / 158

82 Reformulation as Perfect State Information Problem Regard I k+1 = (I k, u k, z k+1 ) as a dynamical system with state I k, control u k and disturbance z k+1 Next note that E[g k (x k, u k, w k )] = E[E[g k (x k, u k, w k ) I k, u k ]] (Recall that E[X ] = E[E[X Y ]]) Define g k (I k, u k ) = E[g k (x k, u k, w k ) I k, u k ] = cost per stage of new system, and g N (I N ) = E[g N I N ] = terminal cost. Cost function becomes { N 1 } E k=0 g k(x k, µ k (I k ), w k ) + g N (x N ) { N 1 } = E k=0 g k(i k, µ k (I k )) + g N (I N ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 82 / 158

83 Reformulation as Perfect State Information Problem D.P. algorithm for reformulated perfect state information problem is: J N (I N ) = g N (I N ) = E[g N (x N ) I N ] J k (I k ) = min u k U k E{ g k (I k, u k ) + J k+1 (I k, u k, z k+1 )} = min u k U k E{g k (x k, u k, w k ) + J k+1 (I k, u k, z k+1 ) I k }, k = N 1,..., Optimal cost J = E{J 0 (z 0 )} Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 83 / 158

84 Linear Quadratic Control with Noisy Measurements System Cost function x k+1 = Ax k + Bu k + w k N 1 E (xk T Qx k + uk T Ru k) + xn T }{{} Qx N }{{} k=0 g k (x k,u k,w k ) g N (x N ) Observations z k = Cx k + v k w k are independent, zero mean. From D.P. Algorithm: J N (I N ) = E[xN T Qx N I N ], Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 84 / 158

85 Linear Quadratic Control with Noisy Measurements { J N 1 (I N 1 ) = mine x u N 1 T Qx N 1 + un 1 T Ru N 1 N 1 } +E[(Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 ) I N ] I N 1 = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 +(Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 ) I N 1 } (Using the tower property that E(E(X Y ) Z) = E(X Z) if Y contains more information than Z) =... ( expand, simplify and use E(w N 1 I N 1 ) = 0.) = E[xN 1 T (AT QA + Q)x N 1 I N 1 ] + E[wN 1 T Qw N 1 I N 1 ] { + min u T u N 1 (B T QB + R)u N 1 + 2E[x N 1 I N 1 ] T A T } QBu N 1 N 1 Differentiate with respect to u N 1 and set equal to zero: 2(B T QB + R)u N 1 + 2B T QAE[x N 1 I N 1 ] = 0 u N 1 = (BT QB + R) 1 B T QAE[x N 1 I N 1 ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 85 / 158

86 Linear Quadratic Control with Noisy Measurements Substituting expression for un 1 back in: J N 1 (I N 1 ) = E[x T N 1 (AT QA + Q)x N 1 I N 1 ] + E[w T N 1 Qw N 1] + E[x N 1 I N 1 ] T A T QB(B T QB + R) 1 (B T QB + R) (B T QB + R) 1 B T QAE[x N 1 I N 1 ] 2E[x N 1 I N 1 ] T A T QB(B T QB + R) 1 B T QAE[x N 1 I N 1 ] = E[x T N 1 (AT QA + Q)x N 1 I N 1 ] + E(w T N 1 Qw N 1) E(x N 1 I N 1 ) T A T QB(B T QB + R) 1 B T QAE(x N 1 I N 1 ) = E[xN 1 T (AT QA + Q)x N 1 I N 1 ] + E(wN 1 T Qw N 1) [ + E (x N 1 E[x N 1 I N 1 ]) T A T QB(B T QB + R) 1 ] B T QA(x N 1 E[x N 1 I N 1 ]) I N 1 E[xN 1 T AT QB(B T QB + R) 1 B T QA x N 1 I N 1 ] }{{} P N 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 86 / 158

87 Linear Quadratic Control with Noisy Measurements We have J N 1 (I N 1 ) = E[xN 1 T K N 1x N 1 I N 1 ] + E[wN 1 T Qw N 1] + E[(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E(x N 1 I N 1 )) I N 1 ] where P N 1 = A T QB(B T QB + R) 1 B T QA K N 1 = A T QA + Q P N 1. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 87 / 158

88 Linear Quadratic Control with Noisy Measurements For period N 2, J N 2 (I N 2 ) = mine{x u N 2 T Qx N 2 + un 2 T Ru N 2 + J N 1 (I N 1 ) I N 2 } N 2 [ ] = E{xN 2 T Qx N 2 I N 2 } + min u T u N 2 Ru N 2 + E{xN 1 T K N 1x N 1 I N 2 } N 2 ] + E [(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E[x N 1 I N 1 ]) I N 2 + E(w T N 1 Qw N 1) Then can obtain u N 2 = (BT K N 1 B + R) 1 B T K N 1 AE[x N 2 I N 2 ] Note that in the above the term ] E [(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E[x N 1 I N 1 ]) I N 2 can be taken outside the minimization (see Bertsekas for proof). Intuition: estimation error x k E[x k I k ] can t be influenced by choice of control. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 88 / 158

89 Linear Quadratic Control with Noisy Measurements Continuing on, general solution is: µ k (I k) = uk = (BT K k+1 B + R) 1 B T K k+1 AE[x k I k ] = L k E[x k I k ] where K N = Q P k = A T K k+1 B(B T K k+1 B + R) 1 B T K k+1 A K k = A T K k+1 A + Q P k Comparison with perfect state information case: L k matrix the same x k is replaced by E[x k I k ] How to compute E[x k I k ]? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 89 / 158

90 Linear Quadratic Control with Noisy Measurements Summary so far: System x k+1 = Ax k + Bu k + w k z k = Cx k + v k Problem min E [ N 1 ] (xk T Qx k + uk T Ru k) + xn T Qx N k=0 Optimal solution is µ k (I k) = (B T K k+1 B + R) 1 B T K k+1 AE[x k I k ] = L k E[x k I k ] where I k = (z 0,..., z k, u 0,..., u k 1 ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 90 / 158

91 Linear Quadratic Control with Noisy Measurements Optimal controller can be decomposed into two parts: 1) An estimator which computes E[x k I k ]. 2) An actuator which multiplies E[x k I k ] with L k. L k is the same gain matrix as in the perfect state information case, only replace x k with E[x k I k ]. Estimator and actuator can be designed separately. Known as the separation principle/theorem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 91 / 158

92 LQG Control Remaining problem: How do we compute E[x k I k ]? Very difficult problem in general (subject called non-linear filtering). When system is linear and w k, v k are Gaussian, E[x k I k ] can be computed analytically. Procedure/algorithm is known as the Kalman Filter (ref: Anderson and Moore, Optimal Filtering ), and the overall controller is called the LQG (linear quadratic Gaussian) controller Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 92 / 158

93 Kalman Filter System: x k+1 = Ax k + Bu k + w k z k = Cx k + v k w k N(0, Σ w ) i.i.d, Σ w = E[w k wk T ] v k N(0, Σ v ) i.i.d, Σ v = E[v k vk T ] Define state estimates ˆx k k = E[x k I k ] ˆx k+1 k = E[x k+1 I k ] and estimation error covariance matrices Σ k k = E[(x k ˆx k k )(x k ˆx k k ) T I k ] Σ k+1 k = E[(x k+1 ˆx k+1 k )(x k+1 ˆx k+1 k ) T I k ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 93 / 158

94 Kalman Filter Then ˆx k k, ˆx k+1 k, Σ k k, Σ k+1 k can be computed recursively using the Kalman Filter equations: ˆx k k = ˆx k k 1 + Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1 (z k C ˆx k k 1 ) ˆx k+1 k = Aˆx k k + Bu k Σ k k = Σ k k 1 Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1 CΣ k k 1 Σ k+1 k = AΣ k k A T + Σ w, k = 0, 1,..., N 1 Proof: see Bertsekas, or Anderson and Moore. Beware: Many people who work in Kalman filtering like to use Q for Σ w, R for Σ v, K k for the Kalman gain Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1, but here Q, R, K k have been used for different things. People also use P k+1 k for Σ k+1 k, P k k for Σ k k etc. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 94 / 158

95 Kalman Filter Properties In general the mean squared error is minimized when ˆx k = E[x k I k ] E[(x k ˆx k ) T (x k ˆx k ) I k ] Kalman filter equations compute E[x k I k ] when noises are Gaussian, and (optimal) estimates are linear functions of the measurements z k. Even when noises are not Gaussian, ˆx k k computed by Kalman filter equations gives the best linear estimate of x k. Useful suboptimal solution when noises are non-gaussian. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 95 / 158

96 Kalman Filter Properties Recall that if the pair (A, B) is controllable and (A, Q 1/2 ) is observable, optimal controller has a steady state solution. Similarly, if (A, C) is observable, and (A, Σ 1/2 w ) is controllable, then Σ k k 1 converges to a steady state value Σ as k, where Σ satisfies the algebraic Riccati equation Σ = AΣA T AΣC T (CΣC T + Σ v ) 1 CΣA T + Σ w So we have a steady state estimator: ˆx k k = ˆx k k 1 + ΣC T (CΣC T + Σ v ) 1 (z k C ˆx k k 1 ) ˆx k+1 k = Aˆx k k + Bu k Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 96 / 158

97 Sufficient Statistics Information vector I k = (z 0,.., z k, u 0,..., u k 1 ) Dimension of I k increases with time k. Inconvenient for large k Sufficient statistic: function S k (I k ) which summarizes all essential content in I k for computing the optimal control, i.e. µ k (I k) = µ(s k (I k )) for some function µ. S k (I k ) preferably of smaller dimension than I k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 97 / 158

98 Examples of Sufficient Statistics 1) I k itself 2) Conditional state distribution/belief state P xk I k, assuming that distribution of v k depends only on x k 1, u k 1, w k 1. If number of states is finite then P xk I k is a vector. e.g. if states are 1, 2,..., n, then P xk I k = P(x k = 1 I k ) P(x k = 2 I k )... P(x k = n I k ) Dimension of vector is n, which doesn t grow with k 3) Special case: E[x k I k ] is a sufficient statistic for LQG problem (though not a sufficient statistic in general). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 98 / 158

99 Conditional State Distribution For conditional state distribution, P xk I k can be generated recursively, as P xk+1 I k+1 = Φ k (P xk I k, u k, z k+1 ) for some function Φ k (,, ). Then D.P. algorithm can be written as J k (P xk I k ) = min E[g k (x k, u k, w k ) + J k+1 (Φ k (P xk I u k U k, u k, z k+1 )) I k ]. k General formula for Φ k (,, ) can be derived, but is quite complicated (see Bertsekas). Will derive some examples from first principles. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 99 / 158

100 Example 1: Search Problem At each period, decide whether to search a site that may contain a treasure. If treasure is present and we search, we find it with probability β and take it. States: {treasure present, treasure not present} Controls: {search, no search} Regard each search result as (imperfect) observation of the state. Let p k = probability treasure present at start of time k. If not search, pk+1 = p k. If search and find treasure, pk+1 = 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 100 / 158

101 Example 1 If search and don t find treasure, p k+1 = P(treasure present at k don t find at k) = P(treasure present at k don t find at k) P(don t find at k) p k (1 β) = p k (1 β) + (1 p k ), with (1 p k ) corresponding to treasure not present & don t find. Thus p k+1 = p k, not search at time k 0, search and find treasure. p k (1 β) p k (1 β)+(1 p k ), = Φ k (p k, u k, z k+1 ) function. search and don t find treasure Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 101 / 158

102 Example 1 Now let treasure be worth V, each search costs C, and once we decide not to search we can t search again at future times. D.P. algorithm gives: J k (p k ) = max {no search,search} [ 0, C + p k βv ( + (1 p k β) J p k (1 β) ) k+1 p k (1 β) + 1 p k [ = max 0, C + p k βv {no search,search} ( + (1 p k β) J p k (1 β) k+1 p k (1 β) + 1 p k ] + p k βj k+1 (0) (where p k βj k+1 (0) = 0 since treasure already found) Can show that J k (p k ) = 0, p k C βv, and that it is optimal to search iff expected reward p k βv cost of search C. (Tutorial problem) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 102 / 158 ) ]

103 Example 2: Research Paper* A process {P e,k } evolves in the following way, for k = 1,..., N: P e,k+1 = { P, ν k+1 γ e,k+1 = 1 AP e,k A T + Q, ν k+1 γ e,k+1 = 0, P, A, Q are some matrices {γ e,k } is i.i.d Bernoulli process with P(γ e,k = 1) = λ e, P(γ e,k = 0) = 1 λ e, k ν k {0, 1} {P e,k } is not observed at all (no observation z k ). *Leong, Quevedo, Dolz, Dey On Remote State Estimation in the Presence of an Eavesdropper Proc. IFAC World Congress, 2017 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 103 / 158

Stochastic Optimal Control

Stochastic Optimal Control Lecturer: Eilyan Bitar, Cornell ECE Scribe: Kevin Kircher, Cornell MAE These notes summarize some of the material from ECE 5555 (Stochastic Systems) at Cornell in the fall of