Dynamic Programming and Stochastic Control

Size: px
Start display at page:

Download "Dynamic Programming and Stochastic Control"

Transcription

1 Dynamic Programming and Stochastic Control Dr. Alex Leong Department of Electrical Engineering (EIM-E) Paderborn University, Germany Dr. Alex Leong DP and Stochastic Control Paderborn University 1 / 158

2 Outline 1 Introduction Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 2 / 158

3 Introduction What is dynamic programming (DP)? Method for solving multi-stage decision problems (Sequential decision making). There is often some randomness to what happens in future. Optimize set of decisions to achieve a good overall outcome. Richard Bellman popularized DP in the 1950s Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 3 / 158

4 Examples 1) Inventory control A store sells a product, e.g. Ice cream. Order supplies once a week. Sales during the week are random. How much supply should the store get to maximize expected profit over summer? Order too little, can t meet demand. Order too much, storage/refrigeration cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 4 / 158

5 Examples 2) Parts replacement e.g. bus engine. At the start of each month, decide whether the engine on a bus should be replaced, to maximize expected profit? If replace, profit = earnings - replacement cost - maintenance. If don t replace, profit = earnings - maintenance. Earnings will decrease if engine breaks down. P(Breakdown) is age dependent. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 5 / 158

6 Examples 3) Formula 1 engines, replace or not? 20 races, 4 engines (in 2017) Decide whether to replace engine at the start of each race, to maximize chance of winning championship. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 6 / 158

7 Examples 4) Queueing (see Figure 1) Packets arrive at queues 1 and 2. If both queues transmit at same time, have collision. If collision, retransmit at next time with a certain probability. Choose retransmission probabilities to maximize throughput. Figure 1: Queueing Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 7 / 158

8 Examples 5) LQR (Linear Quadratic Regulator) Linear System: x k+1 = Ax k + Bu k (Deterministic Problem) Assume knowledge of x k at time k (Perfect state info) Choose sequence of u k to N 1 min u 0,u 1,...,u N 1 k=0 N = number of stages = horizon. N finite finite horizon. (xk T Qx k + uk T Ru k) + xn T Qx N Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 8 / 158

9 Examples 6) x k+1 = Ax k + Bu k + w k w k = Random noise. Assume x k known (Perfect state info) Choose sequence of u k to [ N 1 ] min E (x T u 0,u 1,...,u k Qx k + uk T Ru k) + xn T Qx N N 1 k=0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 9 / 158

10 Examples 7) LQG (Linear Quadratic Gaussian) Control x k+1 = Ax k + Bu k + w k y k = Cx k + v k v k, w k Gaussian noise. Case of imperfect state info. Based on measurements y k, choose u k to [ N 1 ] min E (x T u 0,u 1,...,u k Qx k + uk T Ru k) + xn T Qx N N 1 k=0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 10 / 158

11 Examples 8) Infinite horizon min lim u 0,u 1,...,u N 1 N [ N 1 1 N E k=0 (x T k Qx k + u T k Ru k) + x T N Qx N Note: Here we divide by N, otherwise summation often blows up. ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 11 / 158

12 Examples 9) Shortest paths (see Figure 2) Find shortest path from A to stage D (Deterministic Problem). Can solve using the Viterbi algorithm (1967) Can be regarded as a special case of (forward) DP. Applications: decoding of convolutional codes (communications) channel equalization (communications) estimation of hidden Markov models (signal processing) Figure 2: Shortest paths problem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 12 / 158

13 Outline 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic Structure of Dynamic Programming Problem Dynamic Programming Principle of Optimality Dynamic Programming Algorithm Shortest Path Problems Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 13 / 158

14 Basic structure of stochastic DP problem Two ingredients, discrete time system and cost function 1. Discrete time system x k+1 = f k (x k, u k, w k ), k = 0, 1,..., N 1 (or k = 1, 2,...N) k is time index. x k is state at time k, summarizes past information that is relevant for future optimization. u k is control/decision/action at time k, lies in a set U k (x k ) which may depend on k and x k. w k is random disturbance (noise), with a probability distribution P(. k, x k, u k ) which may depend on k, x k, u k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 14 / 158

15 Basic structure of stochastic DP problem x k+1 = f k (x k, u k, w k ), k = 0, 1,..., N 1 N is horizon, or number of times control is applied. f k is function that describes how system evolves over time. Examples fk = Ax k + Bu k + w k (linear system) f k = x k u k + w k (non-linear) f k = cos x k + w k sin u k (non-linear) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 15 / 158

16 Basic structure of stochastic DP problem 2. Cost function which is additive over time [ N 1 ] E g k (x k, u k, w k ) + g N (x N ) k=0 Expectation is used because of random w k. g k is function that represents cost at time k. Examples g k = x k + u k g k = xk 2 + Cu2 k, where C is a constant. g N (x N ) is terminal cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 16 / 158

17 Basic structure of stochastic DP problem Objective: Minimize the cost function over the controls u 0 = µ 0 (x 0 ), u 1 = µ 1 (x 1 ),..., u N 1 = µ N 1 (x N 1 ) Choice of u k depends on x k. Optimization over policies: rules/functions µ k for generating u k for every possible value of x k. Expected cost of policy π = (µ 0, µ 1,..., µ N 1 ) starting at x 0 is J π (x 0 ) = E [ N 1 k=0 g k (x k, µ k (x k ), w k ) + g N (x N ) Optimal policy: π = argminj π (x 0 ) π Optimal cost starting at x 0 : J (x 0 ) = minj π (x 0 ) π ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 17 / 158

18 Examples 1) Inventory example x k = amount of stock at time k. u k = stock ordered at time k. w k = demand at time k, with some probability distribution e.g. uniform. System: x k+1 = x k + u k w k (= f k (x k, u k, w k )) x k can be negative with this model. Alternative model: x k+1 = max(0, x k + u k w k ). Cost function at time k: g k (x k, u k, w k ) = r(x k ) + Cu k r(x k ) is penalty for holding excess stock. C is cost per item. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 18 / 158

19 Examples 1) Inventory example (cont.) Terminal cost: R(x N ) is penalty for having excess stock at the end. [ N 1 ] Cost function: E k=0 (r(x k) + Cu k ) + R(x N ) Amount u k to order can depend on inventory level x k. Can have constraints on u k, e.g. x k + u k max. storage. Optimization over policies: Find the rule which tells you how much to order for every possible stock level x k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 19 / 158

20 Examples 2) Example 6 of previous section System x k+1 = Ax k +Bu k +w k }{{} f k Cost function N 1 E (xk T Qx k+uk T Ru k) + xn T }{{} Qx N }{{} k=0 g k g N (x N ) Objective: Determine u k = µ k (x k ), k = 0, 1,..., N 1, to minimize the cost function. Solution turns out to be u k = L kx k for some matrices L k. (Derived in later lecture) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 20 / 158

21 Examples 3) Shortest paths (see Figure 3) Figure 3: Shortest path problem x k = which node we re in at stage k. u k = which path we take to get to stage k + 1 w k = zero Cost function = Sum of values along the paths we choose. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 21 / 158

22 Open loop vs. Closed loop Open loop: Controls (u 0, u 1,..., u N 1 ) chosen at beginning (time 0). Closed loop: Policy (µ 0, µ 1,..., µ N 1 ) chosen, where at time k, µ k (x k ) = u k can depend on x k. Can adapt to conditions. e.g. Inventory problem. If current stock level: xk high order less. x k low order more. Closed loop is always at least as good as open loop. For deterministic problems, open loop is as good as closed loop can predict exactly the future states given initial state and sequence of controls. For stochastic problems, generally should use closed loop. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 22 / 158

23 D.P. Principle of Optimality Intuition 5 B 3 D A F 6 4 C 4 E Figure 4: Shortest path problem Consider the shortest path problem in Figure 4. Shortest path from A to F shown in red: A C D F Shortest path from C to F: C D F. Subpath of shortest path from A F. Shortest path from D to F: D F. Subpath of shortest path from A F. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 23 / 158

24 D.P. Principle of Optimality Observation Shortest path from A to F contains shortest paths from intermediate nodes to F. Why? Suppose there is a shorter path from C to F which is not C D F. Then can construct a new path A C... F (new shortest path) which is shorter than A C D F contradicts A C D F being the shortest. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 24 / 158

25 D.P. Principle of Optimality Formal statement: Basic problem ( N 1 ) mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 Let π = {µ 0, µ 1,..., µ N 1 } be the optimal policy. Consider the tail subproblem ( N 1 ) min E g k (x k, µ k (x k ), w k ) + g N (x N ), µ i,µ i+1,...,µ N 1 k=i where we are at state x i at time i and we wish to minimize the cost to go from time i to time N. D.P. Principle of optimality then says that {µ i, µ i+1,..., µ N 1 } is optimal for the tail subproblem. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 25 / 158

26 D.P. Principle of Optimality Proof : If { µ i,..., µ N 1 } is a better policy for tail subproblem, then {µ 0, µ 1,..., µ i 1, µ i,..., µ N 1 } would be a better policy for original problem contradiction of {µ 0, µ 1,..., µ N 1 } being optimal. How can we make use of the D.P. principle? Idea: Construct an optimal policy in stages. Solve tail subproblem involving last stage, to obtain µ N 1 Solve tail subproblem involving last two stages, making use of µ N 1, to obtain µ N 2 Solve tail subproblem involving last three stages, making use of µ N 2, µ N 1, to obtain µ N 3... Solve tail subproblem involving last N stages, making use of µ 1,.., µ N 1, to obtain µ 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 26 / 158

27 D.P. Algorithm Basic problem: { N 1 } mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 D.P. Algorithm: For each possible x k, compute: J N (x N ) = g N (x N ), J k (x k ) = for k = N 1, N 2,..., 1, 0 Theorem: min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))}, u k U k (x k ) 1 Optimal cost J (x 0 ) = J 0 (x 0 ), where J 0 (x 0 ) is quantity computed by D.P. algorithm. 2 Let µ k (.) be the function that generates the minimum u k in the D.P. algorithm, i.e. µ k (x k) = uk. Then {µ 0, µ 1,..., µ N 1 } is the optimal policy to the basic problem. Proof: See later Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 27 / 158

28 D.P. Algorithm Comments: D.P. algorithm needs to be run for all possible states x k. Solves all tail subproblems (don t know which subproblem you need at the start). Can be computationally expensive if number of states/controls is large. Often done on computer. Suboptimal methods can reduce complexity. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 28 / 158

29 Inventory Example x k = level of stock at time k. u k = amount ordered at time k. w k = demand at time k. x k+1 = max(0, x k + u k w k ) = f k (x k, u k, w k ), excess demand is lost. Storage constraint: x k + u k 2 Cost at time k = Purchasing cost + storage cost }{{} (x k +u k w k ) 2 }{{} cost per item=1euro = u k + (x k + u k w k ) 2 = g k (x k, u k, w k ) Terminal cost g N (x N ) = 0. Probability distribution of w k : P(w k = 0) = 0.1, P(w k = 1) = 0.7, P(w k = 2) = 0.2 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 29 / 158

30 Inventory Example Problem: Find the optimal policy for horizon N = 3, i.e. { 2 } min E g k (x k, µ k (x k ), w k ) (µ 0,µ 1,µ 2 ) k=0 Apply D.P. algorithm: J 3 (x 3 ) = g 3 (x 3 ) = 0 J k (x k ) = min u k U k E{u k + (x k + u k w k ) 2 + J k+1 (max(0, x k + u k w k ))}, Question: What values can x k take? k = 2, 1, 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 30 / 158

31 Inventory Example Period 2: Compute J 2 (x 2 ) for all possible values of x 2 J 2 (0) = min E{u 2 + (0 + u 2 w 2 ) 2 + J 3 (x 3 )} u 2 {0,1,2} }{{} = min u 2 {0,1,2} u 2 + E{(u 2 w 2 ) 2 } =0 for all x 3 = min u 2 {0,1,2} u 2 + (u 2 0) (u 2 1) (u 2 2) If u 2 = 0: u u (u 2 1) (u 2 2) 2 = = 1.5 If u 2 = 1: = 1.3 If u 2 = 2: = 3.1 J 2 (0) = 1.3 and µ 2 (0) = 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 31 / 158

32 Inventory Example J 2 (1) = min u 2 + (1 + u 2 ) (1 + u 2 1) (1 + u 2 2) u 2 {0,1} If u 2 = 0: 0.3 (check this!) If u 2 = 1: 2.1 J 2 (1) = 0.3 and µ 2 (1) = 0 J 2 (2) = min E{u 2 + (2 + u 2 w 2 ) 2 } = = 1.1 u 2 {0} J 2 (2) = 1.1 and µ 2 (2) = 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 32 / 158

33 Inventory Example Period 1: Compute J 1 (x 1 ) for all possible values of x 1. J 1 (0) = min E{u 1 + (u 1 w 1 ) 2 + J 2 (max(0, 0 + u 1 w 1 ))} u 1 {0,1,2} = min u 1 {0,1,2} u 1 + (u J 2 (max(0, u 1 ))0.1 + ((u 1 1) 2 + J 2 (max(0, u 1 1)))0.7 + ((u 1 2) 2 + J 2 (max(0, u 1 2)))0.2 u 1 = 0: J 2 (0) (1 + J 2 (0) }{{}}{{} )0.7 + (4 + J 2(0) )0.2 = 2.8 }{{} from previous stage u 1 = 1: 1 + (1 + J 2 (1))0.1 + J 2 (0) }{{}}{{} (1 + J 2(0) )0.2 = 2.5 }{{} from previous stage u 1 = 2: 2 + (4 + J 2 (2))0.1 + (1 + J 2 (1))0.7 + J 2 (0)0.2 = 3.6 J 1 (0) = 2.5 and µ 1 (0) = 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 33 / 158

34 Inventory Example J 1 (1) = min E{u 1 + (1 + u 1 w 1 ) 2 + J 2 (max(0, 1 + u 1 w 1 ))} u 1 {0,1} u 1 = 0: 1.5(check!) u 1 = 1: 2.68 J 1 (1) = 1.5, and µ 1 (1) = 0 J 1 (2) = 1.68, µ 1 (2) = 0 (check!) Period 0: Compute J 0 (x 0 ) for all possible x 0 (Tutorial problem) Solution: J 0 (0) = 3.7, J 0 (1) = 2.1, J 0 (2) = µ 0 (0) = 1, µ 0 (1) = 0, µ 0 (2) = 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 34 / 158

35 Scheduling Example Example: Scheduling problem (deterministic problem) Four operations need to be performed: A, B, C, D. B has to occur after A, D has to occur after C. Costs: c AB = 2, c AC = 3, c AD = 4, c BC = 3, c BD = 1, c CA = 4, c CB = 4, c CD = 6, c DA = 3, c DB = 3. Startup costs: S A = 5, S C = 3. What is the optimal order? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 35 / 158

36 Scheduling Example 6 ABC 6 ABCD AB 2 8 A AC C 3 CA ACB 3 ACD 1 2 CAB 1 ACBD 3 ACDB 1 CABD 6 5 CD 4 3 CAD 3 CADB 3 2 CDA 2 CDAB Minimum cost to go in red Figure: Scheduling Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 36 / 158

37 Scheduling Example Use D.P. algorithm Let State = Set of operations already performed, see Figure Scheduling. No terminal costs for this problem. Tail subproblems of length 1. Easy, only one choice at each state, e.g. if state ACD, next operation has to be B. Tail subproblems of length 2. State AB, only one choice, next operation is C. State AC, if next operation is B: cost = = 5. State AC, if next operation is D: cost = = 9. Choose B. State CA, if next operation is B: cost = = 3. State CA, if next operation is D: cost = = 7. Choose B. State CD, only one choice, next operation is A. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 37 / 158

38 Scheduling Example Tail subproblems of length 3. State A, if next operation is B: cost = = 11. State A, if next operation is C: cost = = 8. Choose C State C, if next operation is A: cost = = 7. State C, if next operation is D: cost = = 11. Choose A. Original problem of length 4. If start with A: cost = = 13 If start with C: cost = = 10 Choose C Therefore, the optimal sequence = CABD, and the optimal cost = 10. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 38 / 158

39 Proof that D.P. Algorithm gives Optimal Solution Basic problem: { N 1 } mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 D.P. Algorithm: For each possible x k, compute: J N (x N ) = g N (x N ), J k (x k ) = for k = N 1, N 2,..., 1, 0 Theorem: min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))}, u k U k (x k ) 1 Optimal cost J (x 0 ) = J 0 (x 0 ), where J 0 (x 0 ) is quantity computed by D.P. algorithm. 2 Let µ k (.) be the function that generates the minimum u k in D.P. algorithm i.e µ k (x k) = uk. Then {µ 0, µ 1,..., µ N 1 } is the optimal policy to the basic problem. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 39 / 158

40 Proof that D.P. Algorithm gives Optimal Solution Notation: Given policy π = (µ 0, µ 1,..., µ N 1 ), let π k = (µ k, µ k+1,..., µ N 1 ) = tail policy and J k (x k) = min π k for tail subproblem. E{ N 1 i=k g i(x i, µ i (x i ), w i ) + g N (x N )} be the optimal cost Let J k (x k ) = quantity computed by D.P algorithm. Want to show that J k (x k) = J k (x k ), for all x k, k. Proof is by mathematical induction Initial step (k = N): By definition of J k (x k), J N (x N) = g N (x N ) By definition of D.P algorithm J N (x N ) = g N (x N ) J N (x N) = J N (x N ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 40 / 158

41 Proof that D.P. Algorithm gives Optimal Solution Induction step: Assume J l (x l) = J l (x l ) for l = N, N 1,..., k + 1 Want to show that J k (x k) = J k (x k ) From the definition of J k (x k), { N 1 Jk (x k) = min E π k i=k { = min E (µ k,π k+1 ) { = min µ k E g i (x i, µ i (x i ), w i ) + g N (x N ) g k (x k, µ k (x k ), w k ) + g k (x k, µ k (x k ), w k )+min π k+1e N 1 i=k+1 [ N 1 i=k+1 } g i (x i, µ i (x i ), w i ) + g N (x N ) g i (x i, µ i (x i ), w i )+g N (x N ) by D.P principle (optimize tail subproblem then µ k ) } ]} Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 41 / 158

42 Proof that D.P. Algorithm gives Optimal Solution = min µ k E{g k (x k, µ k (x k ), w k ) + J k+1 (f k(x k, µ k (x k ), w k ))} by definition of J k+1 (x k+1) = min µ k E{g k (x k, µ k (x k ), w k ) + J k+1 (f k (x k, µ k (x k ), w k ))} by induction hypothesis = min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))} using fact that u k U k (x k ) minf (x, µ(x)) = min F (x, u). µ u U(x) = J k (x k ) from D.P. algorithm equations So Jk (x k) = J k (x k ), and µ k (x k) = uk is the optimal policy. By induction, this is true for k = N, N 1,..., 1, 0. In particular, J (x 0 ) = J 0 (x 0) = J 0 (x 0 ) is the optimal cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 42 / 158

43 Shortest Paths in a Trellis Initial state s Artificial Terminal state t Stage 0 a^1_ Stage 1 Stage 2 Stage N-1 Stage N Figure 6: Shortest paths in a trellis Find shortest path from a node in Stage 1 to a node in Stage N states nodes controls arcs aij k : cost of transition from state i at stage k to state j at stage k + 1. ait N : terminal cost of state i cost function = length of path from s to t Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 43 / 158

44 Shortest Paths in a Trellis D.P. Algorithm: J N (i) = a N it J k (i) = min j [a k ij + J k+1 (j)], k = N 1,..., 1, 0 Optimal cost = J 0 (s) = length of shortest path from s to t. Example: Find shortest path from stage 1 to stage 3 in Figure Shortest path in red Stage 1 Stage 2 Stage 3 Figure 7: Shortest paths example Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 44 / 158

45 Shortest Paths in a Trellis Redraw as a trellis with initial and terminal node, see Figure s Stage Stage 1 Stage 2 Stage 3 Figure 8: Redrawn shortest paths example 0 0 t Here N = 3. Call the top node state 1 and bottom node state 2. Stage N: J 3 (1) = 0 J 3 (2) = 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 45 / 158

46 Shortest Paths in a Trellis Stage 2: Stage 1: Stage 0: J 2 (1) = min{a J 3(1), a J 3(2)} = min{ , } = 100 J 2 (2) = min{a J 3(1), a J 3(2)} = min{ , } = 350 J 1 (1) = min{a J 2(1), a J 2(2)} = min{ , } = 400 J 1 (2) = min{a J 2(1), a J 2(2)} = min{ , } = 250 J 0 (s) = min{0 + J 1 (1), 0 + J 1 (2)} = 250 Shortest path to original problem shown in red in Figure 7. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 46 / 158

47 Forward D.P. Algorithm Observe that optimal path s t is also optimal path t s if directions of arcs are reversed. Shortest path algorithm can be run forwards in time (see Bertsekas for equations). Figure 9 shows the result of forward D.P. on shortest paths example. Forward D.P. useful in real-time applications, where data arrives just before you need to make a decision. Viterbi algorithm uses this idea Shortest paths is a deterministic problem, so forward D.P. works. For stochastic problems, no such concept of forward D.P. Impossible to guarantee that any given state can be reached Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 47 / 158

48 Forward D.P. Algorithm s t Figure 9: Forward D.P. on shortest paths example Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 48 / 158

49 Viterbi Algorithm Applications Estimation of hidden Markov models (HMMs) xk = Markov chain state transitions in x k not observed (hidden). observe z k, r(z, i, j) = probability we observe z given a transition in Markov chain x k from state i to j. Estimation problem: Given Z N = {z 1, z 2,..., z N }, find a sequence ˆX N = {ˆx 0, ˆx 1,..., ˆx N } over all possible {x 0, x 1,..., x N } that maximizes P(X N Z N ). Note that P(X N Z N ) = P(X N,Z N ) P(Z N ), and P(Z N ) is constant given Z N So max P(X N Z N ) max P(X N, Z N ) max ln P(X N, Z N ) {x 0,...,x N } {x 0,...,x N } {x 0,...,x N } Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 49 / 158

50 Viterbi Algorithm Applications After some calculations (see Bertsekas), can show that problem is equivalent to: min ln(π x 0 ) {x 0,...,x N } N ln(π xk 1 x k r(z k, x k 1, x k )) k=1 where π x0 = probability of initial state, π xk 1 x k = transition probabilities of Markov chain, and ln π x0 and ln(π xk 1 x k r(z k, x k 1, x k )) can be regarded as lengths of the different stages shortest path problem through a trellis Decoding of convolutional codes Channel equalization in presence of ISI (Inter-symbol interference) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 50 / 158

51 General Shortest Path Problems No trellis structure e.g. Find the shortest path from each node to node 5 in Figure Figure 10: General shortest path problem Graph with N + 1 nodes {1, 2,..., N, t} a ij = cost of moving from node i to node j. Find the shortest path from each node i to node t. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 51 / 158

52 General Shortest Path Problems Assume some a ij s can be negative, but cycles have non-negative length. Then shortest path will not involve more than N arcs. Reformulate as a trellis-type shortest path problem with N arcs, by allowing arcs from node i to itself with cost a ii = 0 D.P. algorithm: J N 1 (i) = a it J k (i) = min j {a ij + J k+1 (j)}, k = N 2,..., 1, 0 This algorithm is essentially the Bellman-Ford algorithm. Other algorithms have also been invented, e.g. Dijkstra s algorithm which can be used when all a ij s are positive. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 52 / 158

53 Outline 3 Problems with Perfect State Information Linear Quadratic Control Optimal Stopping Problems Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 53 / 158

54 Problems with Perfect State Information Will study some problems where analytical solutions can be obtained: Linear quadratic control Optimal stopping problems + others in Chapter 4 of Bertsekas Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 54 / 158

55 Linear Quadratic Control (Linear) System: x k+1 = Ax k + Bu k + w k, k = 0, 1,..., N 1 (Quadratic) Cost function: E { N 1 } (xk T Qx k + uk T Ru k) + xn T Qx N k=0 Problem: Determine optimal policy to minimize cost function x k, u k, w k are column vectors A, B, Q, R are matrices. w k are independent and zero mean. Q is positive semi-definite. R is positive definite. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 55 / 158

56 Linear Quadratic Control Definition: A symmetric matrix M is positive semi-definite if x T Mx 0, vectors x M is positive definite if x T Mx > 0, x 0 One characterization: M is positive semi definite all eigenvalues of M are 0. M is positive definite all eigenvalues of M are > 0. D.P. algorithm applied to this problem gives: J N (x N ) = x T N Qx N J k (x k ) = min u k {x T k Qx k + u T k Ru k + J k+1 (Ax k + Bu k + w k )}, k = N 1,..., 1, 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 56 / 158

57 Linear Quadratic Control Turns out that minimization can be done analytically J N 1 (x N 1 ) = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 + (Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 )} = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 + x T N 1 AT QAx N 1 + x T N 1 AT QBu N 1 + x T N 1 AT Qw N 1 + u T N 1 BT QAx N 1 + u T N 1 BT QBu N 1 + u T N 1 BT Qw N 1 + w T N 1 QAx N 1 + w T N 1 QBu N 1 + w T N 1 Qw N 1} = x T N 1 (AT QA + Q)x N 1 + E{w T N 1 Qw N 1} + min u N 1 {u T N 1 (R + BT QB)u N 1 + 2x T N 1 AT QBu N 1 } Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 57 / 158

58 Linear Quadratic Control Digression Problem: minf (x) x How to solve? For unconstrained scalar problems, can differentiate and set derivative equal to 0. e.g. min(x 2) 2 d, x dx (x 2)2 = 2(x 2) = 0 x = 2. Similarly, differentiate u T N 1 (R + BT QB)u N 1 + 2x T N 1 AT QBu N 1 with respect to the vector u N 1 and set equal to zero Note that (u T Au) (a T u) = 2Au, = a, u u where a and u are column vectors, and A is a symmetric matrix. Using above formulas, obtain 2(R + B T QB)u N 1 + 2B T QAx N 1 = 0 u N 1 = (R + BT QB) 1 B T QAx N 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 58 / 158

59 Linear Quadratic Control Substituting u N 1 = (R + BT QB) 1 B T QAx N 1 back into expression for J N 1 (x N 1 ), we obtain J N 1 (x N 1 ) = x T N 1 (AT QA + Q)x N 1 + E{w T N 1 Qw N 1} + x T N 1 AT QB(R + B T QB) 1 (R + B T QB)(R + B T QB) 1 B T QAx N 1 2x T N 1 AT QB(R + B T QB) 1 B T QAx N 1 = x T N 1 (AT QA + Q)x N 1 x T N 1 AT QB(R + B T QB) 1 B T QAx N 1 + E{w T N 1 Qw N 1} = x T N 1 (AT QA + Q A T QB(R + B T QB) 1 B T QA)x N 1 + E{w T N 1 Qw N 1} = x T N 1 K N 1x N 1 + E{w T N 1 Qw N 1} with K N 1 = A T QA + Q A T QB(R + B T QB) 1 B T QA Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 59 / 158

60 Linear Quadratic Control Continuing on, can show that un 2 = (BT K N 1 B + R) 1 B T K N 1 Ax N 2, and more generally (tutorial problem) that µ k (x k) = (B T K k+1 B + R) 1 B T K k+1 Ax k where K N = Q, K k = A T K k+1 A + Q A T K k+1 B(B T K k+1 B + R) 1 B T K k+1 A Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 60 / 158

61 Certainty Equivalence Certainty Equivalence: Optimal policy is the same as solving problem for the deterministic system: x k+1 = Ax k + Bu k + E[w k ], where w k is replaced by its expected value E[w k ] = 0, i.e. the standard LQR problem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 61 / 158

62 Asymptotic Behaviour Definition: A pair of matrices (A, B), where A is n n, B is n m, is controllable if the n nm matrix [ B AB A 2 B... A n 1 B ] has full rank (all rows linearly independent) A pair (A, C), where A is n n, C is m n, is observable if (A T, C T ) is controllable. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 62 / 158

63 Asymptotic Behaviour Theorem If (A, B) is controllable and Q can be written as Q = C T C, where (A, C) is observable, then: 1 K k K as k, with K satisfying the algebraic Riccati equation K = A T KA + Q A T KB(B T KB + R) 1 B T KA 2 The steady state controller µ (x k ) = Lx k, where L = (B T KB + R) 1 B T KA, stabilizes the system, i.e. the eigenvalues of A + BL have magnitude < 1. Proof: See Bertsekas Note: If u k = Lx k, then x k+1 = Ax k + Bu k + w k = (A + BL)x k + w k. x k stays bounded when the eigenvalues of A + BL have magnitude < 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 63 / 158

64 Other Variations x k+1 = A k x k + B k u k + w k A k, B k random, unknown, independent. Optimal policy: where K N = Q, µ k (k) = (R + E{BT k K k+1b}) 1 E{B T k K k+1a}x k, K k = E{A T k K k+1a T k } + Q E{A T k K k+1b k }(E{B T k K k+1b k } + R) 1 E{B T k K k+1a k } may not have certainty equivalence may not have steady state solution x k+1 = Ax k + B k u k + w k B k is random, independent, and is only revealed to us at time k. Motivation: Wireless channels Similar to Leong, Dey, Anand, Optimal LQG control over continuous fading channels, Proc. IFAC World Congress, Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 64 / 158

65 Optimal Stopping Problems At each state, there is a stop control that stops the system, i.e moves to and stays in a stop state. Pure stopping problem: if only other control is continue. For pure stopping problems, policy characterized by partition of set of states into: stop region continue region, which may depend on time. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 65 / 158

66 Example (Asset selling) A person has an asset for sale, e.g. a house. At each time k = 0, 1,..., N 1, person receives a random offer w k for the asset. Assume w k s are independent. Either accept w k at time k + 1, and invest money at interest rate r, or reject w k and wait for offer w k+1. Must accept last offer w N 1 at time N if every previous offer was rejected. Find policy that maximizes (expected) revenue at the N-th period. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 66 / 158

67 Example (Asset selling) States: If x k = T : asset already sold (= stop state) If x k = w k 1 : offer currently under consideration. Controls: {accept, reject} System evolves as: x k+1 = f k (x k, w k, u k ) { T, if 1) xk = T or 2) x = k T and u k = accept w k, otherwise. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 67 / 158

68 Example (Asset selling) Rewards at time k: g N (x N ) = { xn, if x N T ; 0, otherwise. { (1 + r) g k (x k, u k, w k ) = N k x k, if x k T and u k = accept ; 0, otherwise. (For compound interest over n years, final amount = (1 + r) n initial amount.) Note: From the way the rewards are defined, gk is non-zero for only one k {0, 1,..., N 1}. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 68 / 158

69 Example (Asset selling) Expected total reward [ N 1 ] = E g k (x k, u k, w k ) + g N (x N ) k=0 D.P. algorithm (for reward maximization): { xn, if x J N (x N ) = g N (x N ) = N T ; 0, otherwise. J k (x k ) = max u k E[g k (x k, u k, w k ) + J k+1 (x k+1 )] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 69 / 158

70 Example (Asset selling) If x k = T, then g k (x k, u k, w k ) = 0 and J k+1 (x k+1 ) = 0, by property of g k being non-zero for only one k, and reward being incurred prior to time k If x k T, then { (1 + r) E[g k (x k, u k, w k )+J k+1 (x k+1 )] = N k x k, if u k = accept; 0 + E[J k+1 (w k )], if u k = reject. So J k (x k ) = maxe[g k (x k, u k, w k ) + J k+1 (x k+1 )] u k { max((1 + r) = N k x k, E[J k+1 (w k )]), if x k T, 0, if x k = T, and optimal policy is of the form: u k = accept if (1 + r) N k x k > E[J k+1 (w k )] { accept, if x or u k = k > E[J k+1(w k )] ; (1+r) N k reject, otherwise. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 70 / 158

71 Example (Asset selling) Let α k = E[J k+1(w k )] (1 + r) N k Can show (see Bertsekas) that α k α k+1 for all k if w k are i.i.d. Intuition: offer acceptable at time k should also be acceptable at time k + 1. See Figure 11 α 1 α 2 Accept Reject α N N-1 k Figure 11: Asset selling Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 71 / 158

72 Example (Asset selling) Can also show that if w k are i.i.d and N, then optimal policy converges to the stationary policy: { accept, if xk > ᾱ u k = reject, if x k ᾱ where ᾱ is constant. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 72 / 158

73 General Stopping Problems Pure stopping problem - stop or continue only possible controls General stoping problem - stop or choose a control u k from U(x k ) (where U has more than one element) Consider time invariant case: f (x k, u k, w k ), g(x k, u k, w k ) don t depend on k, and w k is i.i.d. Stop at time k with cost t(x k ) Must stop by last stage. D.P. algorithm: J N (x N ) = t(x N ), J k (x k ) = min[t(x k ), min E{g(x k, u k, w k ) + J k+1 (f (x k, u k, w k ))}] u k U(x k ) Optimal to stop when t(x k ) min E{g(x k, u k, w k ) + J k+1 (f (x k, u k, w k ))} u k U(x k ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 73 / 158

74 General Stopping Problems Stopping set at time k (set of states where you stop) defined as T k = {x t(x) min E[g(x, u, w) + J k+1(f (x, u, w))]} u U(x) Note that J N 1 (x) J N (x) for all x, since J N (x) = t(x) and [ ] J N 1 (x) = min t(x), min E[g(x, u, w) + J k+1(f (x, u, w))] u U(x) t(x) = J N (x) Can show that J k (x) J k+1 (x) (Monotonicity principle: tutorial problem) Then we have : T 0 T 1 T 2... T k T k+1... T N 1 i.e. set of states in which we stop increases with time. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 74 / 158

75 Special Case If f (x, u, w) T N 1 for all x T N 1, u U(x), w, i.e. the set T N 1 is absorbing, then T 0 = T 1 = T 2 = = T N 1. Proof: See Bertsekas Simplifies optimal policy, called the one step lookahead policy. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 75 / 158

76 Special Case E.g. Asset selling with past offers retained Same situation as before, except that previously rejected offers can be accepted at a later time. State evolves as (instead of x k+1 = w k before) x k+1 = max(x k, w k ) Can show (see Bertsekas) that T N 1 = {x x ᾱ} for some constant ᾱ This set is absorbing, since best currently received offer cannot decrease over time. optimal policy at every time k is to accept if best offer > ᾱ Have constant threshold ᾱ even for finite horizon N Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 76 / 158

77 Outline 4 Problems with Imperfect State Information Reformulation as Perfect State Information Problem Linear Quadratic Control with Noisy Measurements Sufficient Statistics Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 77 / 158

78 Problems with Imperfect State Information State x k not known to controller. Instead have noisy observations z k of the form: z 0 = h 0 (x 0, v 0 ), z k = h k (x k, u k 1, v k ), k = 1, 2,..., N 1, where v k is observation noise, with a probability distribution P v (. x 0,..., x k, u 0,..., u k 1, w 0,..., w k 1, v 0,..., v k 1 ) which can depend on states, controls and disturbances Examples h x (x k, u k 1, v k ) = x k + v k, h k (x k, u k 1, v k ) = sin x k + u k 1 v k Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 78 / 158

79 Problems with Imperfect State Information Initial state x 0 is random with distribution P x0. u k U k, where U k does not depend on (unknown) x k. Information vector, i.e. information available to controller at time k, defined as I 0 = z 0, I k = (z 0,..., z k, u 0,..., u k 1 ), k = 1, 2,..., N 1 Policies π = (µ 0,..., µ N 1 ), where now µ k (I k ) U k (before µ k (x k )). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 79 / 158

80 Basic Problem with Imperfect State Information Find π that minimizes the cost function { N 1 } J π = E g k (x k, µ k (I k ), w k ) + g N (x N ) s.t. system equation k=0 and measurement equation x k+1 = f k (x k, µ k (I k ), w k ) z k = h k (x k, µ k 1 (I k 1 ), v k ) Question: How to solve this problem? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 80 / 158

81 Reformulation as Perfect State Information Problem Idea: Define new system where the state is I k. Then have D.P. algorithm etc. By definition I k+1 = (z 0,..., z k, z k+1, u 0,..., u k 1, u k ) = (z 0,..., z k, u 0,..., u }{{ k 1, z } k+1, u k ) I k I k+1 = (I k, u k, z k+1 ). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 81 / 158

82 Reformulation as Perfect State Information Problem Regard I k+1 = (I k, u k, z k+1 ) as a dynamical system with state I k, control u k and disturbance z k+1 Next note that E[g k (x k, u k, w k )] = E[E[g k (x k, u k, w k ) I k, u k ]] (Recall that E[X ] = E[E[X Y ]]) Define g k (I k, u k ) = E[g k (x k, u k, w k ) I k, u k ] = cost per stage of new system, and g N (I N ) = E[g N I N ] = terminal cost. Cost function becomes { N 1 } E k=0 g k(x k, µ k (I k ), w k ) + g N (x N ) { N 1 } = E k=0 g k(i k, µ k (I k )) + g N (I N ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 82 / 158

83 Reformulation as Perfect State Information Problem D.P. algorithm for reformulated perfect state information problem is: J N (I N ) = g N (I N ) = E[g N (x N ) I N ] J k (I k ) = min u k U k E{ g k (I k, u k ) + J k+1 (I k, u k, z k+1 )} = min u k U k E{g k (x k, u k, w k ) + J k+1 (I k, u k, z k+1 ) I k }, k = N 1,..., Optimal cost J = E{J 0 (z 0 )} Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 83 / 158

84 Linear Quadratic Control with Noisy Measurements System Cost function x k+1 = Ax k + Bu k + w k N 1 E (xk T Qx k + uk T Ru k) + xn T }{{} Qx N }{{} k=0 g k (x k,u k,w k ) g N (x N ) Observations z k = Cx k + v k w k are independent, zero mean. From D.P. Algorithm: J N (I N ) = E[xN T Qx N I N ], Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 84 / 158

85 Linear Quadratic Control with Noisy Measurements { J N 1 (I N 1 ) = mine x u N 1 T Qx N 1 + un 1 T Ru N 1 N 1 } +E[(Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 ) I N ] I N 1 = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 +(Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 ) I N 1 } (Using the tower property that E(E(X Y ) Z) = E(X Z) if Y contains more information than Z) =... ( expand, simplify and use E(w N 1 I N 1 ) = 0.) = E[xN 1 T (AT QA + Q)x N 1 I N 1 ] + E[wN 1 T Qw N 1 I N 1 ] { + min u T u N 1 (B T QB + R)u N 1 + 2E[x N 1 I N 1 ] T A T } QBu N 1 N 1 Differentiate with respect to u N 1 and set equal to zero: 2(B T QB + R)u N 1 + 2B T QAE[x N 1 I N 1 ] = 0 u N 1 = (BT QB + R) 1 B T QAE[x N 1 I N 1 ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 85 / 158

86 Linear Quadratic Control with Noisy Measurements Substituting expression for un 1 back in: J N 1 (I N 1 ) = E[x T N 1 (AT QA + Q)x N 1 I N 1 ] + E[w T N 1 Qw N 1] + E[x N 1 I N 1 ] T A T QB(B T QB + R) 1 (B T QB + R) (B T QB + R) 1 B T QAE[x N 1 I N 1 ] 2E[x N 1 I N 1 ] T A T QB(B T QB + R) 1 B T QAE[x N 1 I N 1 ] = E[x T N 1 (AT QA + Q)x N 1 I N 1 ] + E(w T N 1 Qw N 1) E(x N 1 I N 1 ) T A T QB(B T QB + R) 1 B T QAE(x N 1 I N 1 ) = E[xN 1 T (AT QA + Q)x N 1 I N 1 ] + E(wN 1 T Qw N 1) [ + E (x N 1 E[x N 1 I N 1 ]) T A T QB(B T QB + R) 1 ] B T QA(x N 1 E[x N 1 I N 1 ]) I N 1 E[xN 1 T AT QB(B T QB + R) 1 B T QA x N 1 I N 1 ] }{{} P N 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 86 / 158

87 Linear Quadratic Control with Noisy Measurements We have J N 1 (I N 1 ) = E[xN 1 T K N 1x N 1 I N 1 ] + E[wN 1 T Qw N 1] + E[(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E(x N 1 I N 1 )) I N 1 ] where P N 1 = A T QB(B T QB + R) 1 B T QA K N 1 = A T QA + Q P N 1. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 87 / 158

88 Linear Quadratic Control with Noisy Measurements For period N 2, J N 2 (I N 2 ) = mine{x u N 2 T Qx N 2 + un 2 T Ru N 2 + J N 1 (I N 1 ) I N 2 } N 2 [ ] = E{xN 2 T Qx N 2 I N 2 } + min u T u N 2 Ru N 2 + E{xN 1 T K N 1x N 1 I N 2 } N 2 ] + E [(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E[x N 1 I N 1 ]) I N 2 + E(w T N 1 Qw N 1) Then can obtain u N 2 = (BT K N 1 B + R) 1 B T K N 1 AE[x N 2 I N 2 ] Note that in the above the term ] E [(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E[x N 1 I N 1 ]) I N 2 can be taken outside the minimization (see Bertsekas for proof). Intuition: estimation error x k E[x k I k ] can t be influenced by choice of control. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 88 / 158

89 Linear Quadratic Control with Noisy Measurements Continuing on, general solution is: µ k (I k) = uk = (BT K k+1 B + R) 1 B T K k+1 AE[x k I k ] = L k E[x k I k ] where K N = Q P k = A T K k+1 B(B T K k+1 B + R) 1 B T K k+1 A K k = A T K k+1 A + Q P k Comparison with perfect state information case: L k matrix the same x k is replaced by E[x k I k ] How to compute E[x k I k ]? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 89 / 158

90 Linear Quadratic Control with Noisy Measurements Summary so far: System x k+1 = Ax k + Bu k + w k z k = Cx k + v k Problem min E [ N 1 ] (xk T Qx k + uk T Ru k) + xn T Qx N k=0 Optimal solution is µ k (I k) = (B T K k+1 B + R) 1 B T K k+1 AE[x k I k ] = L k E[x k I k ] where I k = (z 0,..., z k, u 0,..., u k 1 ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 90 / 158

91 Linear Quadratic Control with Noisy Measurements Optimal controller can be decomposed into two parts: 1) An estimator which computes E[x k I k ]. 2) An actuator which multiplies E[x k I k ] with L k. L k is the same gain matrix as in the perfect state information case, only replace x k with E[x k I k ]. Estimator and actuator can be designed separately. Known as the separation principle/theorem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 91 / 158

92 LQG Control Remaining problem: How do we compute E[x k I k ]? Very difficult problem in general (subject called non-linear filtering). When system is linear and w k, v k are Gaussian, E[x k I k ] can be computed analytically. Procedure/algorithm is known as the Kalman Filter (ref: Anderson and Moore, Optimal Filtering ), and the overall controller is called the LQG (linear quadratic Gaussian) controller Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 92 / 158

93 Kalman Filter System: x k+1 = Ax k + Bu k + w k z k = Cx k + v k w k N(0, Σ w ) i.i.d, Σ w = E[w k wk T ] v k N(0, Σ v ) i.i.d, Σ v = E[v k vk T ] Define state estimates ˆx k k = E[x k I k ] ˆx k+1 k = E[x k+1 I k ] and estimation error covariance matrices Σ k k = E[(x k ˆx k k )(x k ˆx k k ) T I k ] Σ k+1 k = E[(x k+1 ˆx k+1 k )(x k+1 ˆx k+1 k ) T I k ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 93 / 158

94 Kalman Filter Then ˆx k k, ˆx k+1 k, Σ k k, Σ k+1 k can be computed recursively using the Kalman Filter equations: ˆx k k = ˆx k k 1 + Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1 (z k C ˆx k k 1 ) ˆx k+1 k = Aˆx k k + Bu k Σ k k = Σ k k 1 Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1 CΣ k k 1 Σ k+1 k = AΣ k k A T + Σ w, k = 0, 1,..., N 1 Proof: see Bertsekas, or Anderson and Moore. Beware: Many people who work in Kalman filtering like to use Q for Σ w, R for Σ v, K k for the Kalman gain Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1, but here Q, R, K k have been used for different things. People also use P k+1 k for Σ k+1 k, P k k for Σ k k etc. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 94 / 158

95 Kalman Filter Properties In general the mean squared error is minimized when ˆx k = E[x k I k ] E[(x k ˆx k ) T (x k ˆx k ) I k ] Kalman filter equations compute E[x k I k ] when noises are Gaussian, and (optimal) estimates are linear functions of the measurements z k. Even when noises are not Gaussian, ˆx k k computed by Kalman filter equations gives the best linear estimate of x k. Useful suboptimal solution when noises are non-gaussian. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 95 / 158

96 Kalman Filter Properties Recall that if the pair (A, B) is controllable and (A, Q 1/2 ) is observable, optimal controller has a steady state solution. Similarly, if (A, C) is observable, and (A, Σ 1/2 w ) is controllable, then Σ k k 1 converges to a steady state value Σ as k, where Σ satisfies the algebraic Riccati equation Σ = AΣA T AΣC T (CΣC T + Σ v ) 1 CΣA T + Σ w So we have a steady state estimator: ˆx k k = ˆx k k 1 + ΣC T (CΣC T + Σ v ) 1 (z k C ˆx k k 1 ) ˆx k+1 k = Aˆx k k + Bu k Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 96 / 158

97 Sufficient Statistics Information vector I k = (z 0,.., z k, u 0,..., u k 1 ) Dimension of I k increases with time k. Inconvenient for large k Sufficient statistic: function S k (I k ) which summarizes all essential content in I k for computing the optimal control, i.e. µ k (I k) = µ(s k (I k )) for some function µ. S k (I k ) preferably of smaller dimension than I k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 97 / 158

98 Examples of Sufficient Statistics 1) I k itself 2) Conditional state distribution/belief state P xk I k, assuming that distribution of v k depends only on x k 1, u k 1, w k 1. If number of states is finite then P xk I k is a vector. e.g. if states are 1, 2,..., n, then P xk I k = P(x k = 1 I k ) P(x k = 2 I k )... P(x k = n I k ) Dimension of vector is n, which doesn t grow with k 3) Special case: E[x k I k ] is a sufficient statistic for LQG problem (though not a sufficient statistic in general). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 98 / 158

99 Conditional State Distribution For conditional state distribution, P xk I k can be generated recursively, as P xk+1 I k+1 = Φ k (P xk I k, u k, z k+1 ) for some function Φ k (,, ). Then D.P. algorithm can be written as J k (P xk I k ) = min E[g k (x k, u k, w k ) + J k+1 (Φ k (P xk I u k U k, u k, z k+1 )) I k ]. k General formula for Φ k (,, ) can be derived, but is quite complicated (see Bertsekas). Will derive some examples from first principles. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 99 / 158

100 Example 1: Search Problem At each period, decide whether to search a site that may contain a treasure. If treasure is present and we search, we find it with probability β and take it. States: {treasure present, treasure not present} Controls: {search, no search} Regard each search result as (imperfect) observation of the state. Let p k = probability treasure present at start of time k. If not search, pk+1 = p k. If search and find treasure, pk+1 = 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 100 / 158

101 Example 1 If search and don t find treasure, p k+1 = P(treasure present at k don t find at k) = P(treasure present at k don t find at k) P(don t find at k) p k (1 β) = p k (1 β) + (1 p k ), with (1 p k ) corresponding to treasure not present & don t find. Thus p k+1 = p k, not search at time k 0, search and find treasure. p k (1 β) p k (1 β)+(1 p k ), = Φ k (p k, u k, z k+1 ) function. search and don t find treasure Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 101 / 158

102 Example 1 Now let treasure be worth V, each search costs C, and once we decide not to search we can t search again at future times. D.P. algorithm gives: J k (p k ) = max {no search,search} [ 0, C + p k βv ( + (1 p k β) J p k (1 β) ) k+1 p k (1 β) + 1 p k [ = max 0, C + p k βv {no search,search} ( + (1 p k β) J p k (1 β) k+1 p k (1 β) + 1 p k ] + p k βj k+1 (0) (where p k βj k+1 (0) = 0 since treasure already found) Can show that J k (p k ) = 0, p k C βv, and that it is optimal to search iff expected reward p k βv cost of search C. (Tutorial problem) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 102 / 158 ) ]

103 Example 2: Research Paper* A process {P e,k } evolves in the following way, for k = 1,..., N: P e,k+1 = { P, ν k+1 γ e,k+1 = 1 AP e,k A T + Q, ν k+1 γ e,k+1 = 0, P, A, Q are some matrices {γ e,k } is i.i.d Bernoulli process with P(γ e,k = 1) = λ e, P(γ e,k = 0) = 1 λ e, k ν k {0, 1} {P e,k } is not observed at all (no observation z k ). *Leong, Quevedo, Dolz, Dey On Remote State Estimation in the Presence of an Eavesdropper Proc. IFAC World Congress, 2017 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 103 / 158

Stochastic Optimal Control

Stochastic Optimal Control Stochastic Optimal Control Lecturer: Eilyan Bitar, Cornell ECE Scribe: Kevin Kircher, Cornell MAE These notes summarize some of the material from ECE 5555 (Stochastic Systems) at Cornell in the fall of

More information

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE 6.21 DYNAMIC PROGRAMMING LECTURE LECTURE OUTLINE Deterministic finite-state DP problems Backward shortest path algorithm Forward shortest path algorithm Shortest path examples Alternative shortest path

More information

Handout 4: Deterministic Systems and the Shortest Path Problem

Handout 4: Deterministic Systems and the Shortest Path Problem SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas

More information

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE Stopping problems Scheduling problems Minimax Control 1 PURE STOPPING PROBLEMS Two possible controls: Stop (incur a one-time stopping cost, and move

More information

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE Rollout algorithms Cost improvement property Discrete deterministic problems Approximations of rollout algorithms Discretization of continuous time

More information

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE 6.21 DYNAMIC PROGRAMMING LECTURE LECTURE OUTLINE Deterministic finite-state DP problems Backward shortest path algorithm Forward shortest path algorithm Shortest path examples Alternative shortest path

More information

Final exam solutions

Final exam solutions EE365 Stochastic Control / MS&E251 Stochastic Decision Models Profs. S. Lall, S. Boyd June 5 6 or June 6 7, 2013 Final exam solutions This is a 24 hour take-home final. Please turn it in to one of the

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 8: Introduction to Stochastic Dynamic Programming Instructor: Shiqian Ma March 10, 2014 Suggested Reading: Chapter 1 of Bertsekas,

More information

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods EC316a: Advanced Scientific Computation, Fall 2003 Notes Section 4 Discrete time, continuous state dynamic models: solution methods We consider now solution methods for discrete time models in which decisions

More information

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming Dynamic Programming: An overview These notes summarize some key properties of the Dynamic Programming principle to optimize a function or cost that depends on an interval or stages. This plays a key role

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE Suboptimal control Cost approximation methods: Classification Certainty equivalent control: An example Limited lookahead policies Performance bounds

More information

Introduction to Dynamic Programming

Introduction to Dynamic Programming Introduction to Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes Outline 2/65 1

More information

Optimization Methods. Lecture 16: Dynamic Programming

Optimization Methods. Lecture 16: Dynamic Programming 15.093 Optimization Methods Lecture 16: Dynamic Programming 1 Outline 1. The knapsack problem Slide 1. The traveling salesman problem 3. The general DP framework 4. Bellman equation 5. Optimal inventory

More information

Part 4: Markov Decision Processes

Part 4: Markov Decision Processes Markov decision processes c Vikram Krishnamurthy 2013 1 Part 4: Markov Decision Processes Aim: This part covers discrete time Markov Decision processes whose state is completely observed. The key ideas

More information

Dynamic Portfolio Execution Detailed Proofs

Dynamic Portfolio Execution Detailed Proofs Dynamic Portfolio Execution Detailed Proofs Gerry Tsoukalas, Jiang Wang, Kay Giesecke March 16, 2014 1 Proofs Lemma 1 (Temporary Price Impact) A buy order of size x being executed against i s ask-side

More information

Lecture 2 Dynamic Equilibrium Models: Three and More (Finite) Periods

Lecture 2 Dynamic Equilibrium Models: Three and More (Finite) Periods Lecture 2 Dynamic Equilibrium Models: Three and More (Finite) Periods. Introduction In ECON 50, we discussed the structure of two-period dynamic general equilibrium models, some solution methods, and their

More information

CHAPTER 5: DYNAMIC PROGRAMMING

CHAPTER 5: DYNAMIC PROGRAMMING CHAPTER 5: DYNAMIC PROGRAMMING Overview This chapter discusses dynamic programming, a method to solve optimization problems that involve a dynamical process. This is in contrast to our previous discussions

More information

Dynamic Programming and Optimal Control Volume 1

Dynamic Programming and Optimal Control Volume 1 Dynamic Programming and Optimal Control Volume 1 SECOND EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology Selected Theoretical Problem Solutions Athena Scientific, Belmont, MA, 2000 WWW

More information

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall 2017 1 / 34 Supervised Machine Learning

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Reinforcement Learning and Optimal Control. Chapter 1 Exact Dynamic Programming DRAFT

Reinforcement Learning and Optimal Control. Chapter 1 Exact Dynamic Programming DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 1 Exact Dynamic Programming DRAFT This is Chapter 1 of the draft textbook Reinforcement

More information

SYLLABUS AND SAMPLE QUESTIONS FOR MSQE (Program Code: MQEK and MQED) Syllabus for PEA (Mathematics), 2013

SYLLABUS AND SAMPLE QUESTIONS FOR MSQE (Program Code: MQEK and MQED) Syllabus for PEA (Mathematics), 2013 SYLLABUS AND SAMPLE QUESTIONS FOR MSQE (Program Code: MQEK and MQED) 2013 Syllabus for PEA (Mathematics), 2013 Algebra: Binomial Theorem, AP, GP, HP, Exponential, Logarithmic Series, Sequence, Permutations

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

Scenario Generation and Sampling Methods

Scenario Generation and Sampling Methods Scenario Generation and Sampling Methods Güzin Bayraksan Tito Homem-de-Mello SVAN 2016 IMPA May 9th, 2016 Bayraksan (OSU) & Homem-de-Mello (UAI) Scenario Generation and Sampling SVAN IMPA May 9 1 / 30

More information

EE266 Homework 5 Solutions

EE266 Homework 5 Solutions EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The

More information

Lecture 7: Bayesian approach to MAB - Gittins index

Lecture 7: Bayesian approach to MAB - Gittins index Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

TDT4171 Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

More information

IEOR 3106: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 16, 2012

IEOR 3106: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 16, 2012 IEOR 306: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 6, 202 Four problems, each with multiple parts. Maximum score 00 (+3 bonus) = 3. You need to show

More information

IEOR E4004: Introduction to OR: Deterministic Models

IEOR E4004: Introduction to OR: Deterministic Models IEOR E4004: Introduction to OR: Deterministic Models 1 Dynamic Programming Following is a summary of the problems we discussed in class. (We do not include the discussion on the container problem or the

More information

Introduction to Sequential Monte Carlo Methods

Introduction to Sequential Monte Carlo Methods Introduction to Sequential Monte Carlo Methods Arnaud Doucet NCSU, October 2008 Arnaud Doucet () Introduction to SMC NCSU, October 2008 1 / 36 Preliminary Remarks Sequential Monte Carlo (SMC) are a set

More information

Optimal Stopping. Nick Hay (presentation follows Thomas Ferguson s Optimal Stopping and Applications) November 6, 2008

Optimal Stopping. Nick Hay (presentation follows Thomas Ferguson s Optimal Stopping and Applications) November 6, 2008 (presentation follows Thomas Ferguson s and Applications) November 6, 2008 1 / 35 Contents: Introduction Problems Markov Models Monotone Stopping Problems Summary 2 / 35 The Secretary problem You have

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result

More information

RMSC 4005 Stochastic Calculus for Finance and Risk. 1 Exercises. (c) Let X = {X n } n=0 be a {F n }-supermartingale. Show that.

RMSC 4005 Stochastic Calculus for Finance and Risk. 1 Exercises. (c) Let X = {X n } n=0 be a {F n }-supermartingale. Show that. 1. EXERCISES RMSC 45 Stochastic Calculus for Finance and Risk Exercises 1 Exercises 1. (a) Let X = {X n } n= be a {F n }-martingale. Show that E(X n ) = E(X ) n N (b) Let X = {X n } n= be a {F n }-submartingale.

More information

Dynamic Programming (DP) Massimo Paolucci University of Genova

Dynamic Programming (DP) Massimo Paolucci University of Genova Dynamic Programming (DP) Massimo Paolucci University of Genova DP cannot be applied to each kind of problem In particular, it is a solution method for problems defined over stages For each stage a subproblem

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

Computer Vision Group Prof. Daniel Cremers. 7. Sequential Data

Computer Vision Group Prof. Daniel Cremers. 7. Sequential Data Group Prof. Daniel Cremers 7. Sequential Data Bayes Filter (Rep.) We can describe the overall process using a Dynamic Bayes Network: This incorporates the following Markov assumptions: (measurement) (state)!2

More information

Yao s Minimax Principle

Yao s Minimax Principle Complexity of algorithms The complexity of an algorithm is usually measured with respect to the size of the input, where size may for example refer to the length of a binary word describing the input,

More information

I R TECHNICAL RESEARCH REPORT. A Framework for Mixed Estimation of Hidden Markov Models. by S. Dey, S. Marcus T.R

I R TECHNICAL RESEARCH REPORT. A Framework for Mixed Estimation of Hidden Markov Models. by S. Dey, S. Marcus T.R TECHNICAL RESEARCH REPORT A Framework for Mixed Estimation of Hidden Markov Models by S. Dey, S. Marcus T.R. 98-31 I R INSTITUTE FOR SYSTEMS RESEARCH ISR develops, applies and teaches advanced methodologies

More information

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models IEOR E4707: Foundations of Financial Engineering c 206 by Martin Haugh Martingale Pricing Theory in Discrete-Time and Discrete-Space Models These notes develop the theory of martingale pricing in a discrete-time,

More information

Reasoning with Uncertainty

Reasoning with Uncertainty Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

Markov Chains (Part 2)

Markov Chains (Part 2) Markov Chains (Part 2) More Examples and Chapman-Kolmogorov Equations Markov Chains - 1 A Stock Price Stochastic Process Consider a stock whose price either goes up or down every day. Let X t be a random

More information

Interpolation. 1 What is interpolation? 2 Why are we interested in this?

Interpolation. 1 What is interpolation? 2 Why are we interested in this? Interpolation 1 What is interpolation? For a certain function f (x we know only the values y 1 = f (x 1,,y n = f (x n For a point x different from x 1,,x n we would then like to approximate f ( x using

More information

Chapter 4 - Insurance Benefits

Chapter 4 - Insurance Benefits Chapter 4 - Insurance Benefits Section 4.4 - Valuation of Life Insurance Benefits (Subsection 4.4.1) Assume a life insurance policy pays $1 immediately upon the death of a policy holder who takes out the

More information

Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete)

Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete) Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete) Ying Chen Hülya Eraslan March 25, 2016 Abstract We analyze a dynamic model of judicial decision

More information

Math-Stat-491-Fall2014-Notes-V

Math-Stat-491-Fall2014-Notes-V Math-Stat-491-Fall2014-Notes-V Hariharan Narayanan December 7, 2014 Martingales 1 Introduction Martingales were originally introduced into probability theory as a model for fair betting games. Essentially

More information

AMH4 - ADVANCED OPTION PRICING. Contents

AMH4 - ADVANCED OPTION PRICING. Contents AMH4 - ADVANCED OPTION PRICING ANDREW TULLOCH Contents 1. Theory of Option Pricing 2 2. Black-Scholes PDE Method 4 3. Martingale method 4 4. Monte Carlo methods 5 4.1. Method of antithetic variances 5

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2018 Last Time: Markov Chains We can use Markov chains for density estimation, p(x) = p(x 1 ) }{{} d p(x

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

Economic optimization in Model Predictive Control

Economic optimization in Model Predictive Control Economic optimization in Model Predictive Control Rishi Amrit Department of Chemical and Biological Engineering University of Wisconsin-Madison 29 th February, 2008 Rishi Amrit (UW-Madison) Economic Optimization

More information

CS 3331 Numerical Methods Lecture 2: Functions of One Variable. Cherung Lee

CS 3331 Numerical Methods Lecture 2: Functions of One Variable. Cherung Lee CS 3331 Numerical Methods Lecture 2: Functions of One Variable Cherung Lee Outline Introduction Solving nonlinear equations: find x such that f(x ) = 0. Binary search methods: (Bisection, regula falsi)

More information

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Stochastic Games and Bayesian Games

Stochastic Games and Bayesian Games Stochastic Games and Bayesian Games CPSC 532l Lecture 10 Stochastic Games and Bayesian Games CPSC 532l Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games 4 Analyzing Bayesian

More information

Lecture 5 January 30

Lecture 5 January 30 EE 223: Stochastic Estimation and Control Spring 2007 Lecture 5 January 30 Lecturer: Venkat Anantharam Scribe: aryam Kamgarpour 5.1 Secretary Problem The problem set-up is explained in Lecture 4. We review

More information

Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks

Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks Spring 2009 Main question: How much are patents worth? Answering this question is important, because it helps

More information

1 Answers to the Sept 08 macro prelim - Long Questions

1 Answers to the Sept 08 macro prelim - Long Questions Answers to the Sept 08 macro prelim - Long Questions. Suppose that a representative consumer receives an endowment of a non-storable consumption good. The endowment evolves exogenously according to ln

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

SYLLABUS AND SAMPLE QUESTIONS FOR MS(QE) Syllabus for ME I (Mathematics), 2012

SYLLABUS AND SAMPLE QUESTIONS FOR MS(QE) Syllabus for ME I (Mathematics), 2012 SYLLABUS AND SAMPLE QUESTIONS FOR MS(QE) 2012 Syllabus for ME I (Mathematics), 2012 Algebra: Binomial Theorem, AP, GP, HP, Exponential, Logarithmic Series, Sequence, Permutations and Combinations, Theory

More information

Lecture Quantitative Finance Spring Term 2015

Lecture Quantitative Finance Spring Term 2015 implied Lecture Quantitative Finance Spring Term 2015 : May 7, 2015 1 / 28 implied 1 implied 2 / 28 Motivation and setup implied the goal of this chapter is to treat the implied which requires an algorithm

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

Trust Region Methods for Unconstrained Optimisation

Trust Region Methods for Unconstrained Optimisation Trust Region Methods for Unconstrained Optimisation Lecture 9, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Trust

More information

1 Dynamic programming

1 Dynamic programming 1 Dynamic programming A country has just discovered a natural resource which yields an income per period R measured in terms of traded goods. The cost of exploitation is negligible. The government wants

More information

Financial Mathematics III Theory summary

Financial Mathematics III Theory summary Financial Mathematics III Theory summary Table of Contents Lecture 1... 7 1. State the objective of modern portfolio theory... 7 2. Define the return of an asset... 7 3. How is expected return defined?...

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2019 Last Time: Markov Chains We can use Markov chains for density estimation, d p(x) = p(x 1 ) p(x }{{}

More information

CHOICE THEORY, UTILITY FUNCTIONS AND RISK AVERSION

CHOICE THEORY, UTILITY FUNCTIONS AND RISK AVERSION CHOICE THEORY, UTILITY FUNCTIONS AND RISK AVERSION Szabolcs Sebestyén szabolcs.sebestyen@iscte.pt Master in Finance INVESTMENTS Sebestyén (ISCTE-IUL) Choice Theory Investments 1 / 65 Outline 1 An Introduction

More information

Stochastic Games and Bayesian Games

Stochastic Games and Bayesian Games Stochastic Games and Bayesian Games CPSC 532L Lecture 10 Stochastic Games and Bayesian Games CPSC 532L Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games Stochastic Games

More information

Lecture outline W.B.Powell 1

Lecture outline W.B.Powell 1 Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

Drunken Birds, Brownian Motion, and Other Random Fun

Drunken Birds, Brownian Motion, and Other Random Fun Drunken Birds, Brownian Motion, and Other Random Fun Michael Perlmutter Department of Mathematics Purdue University 1 M. Perlmutter(Purdue) Brownian Motion and Martingales Outline Review of Basic Probability

More information

SOLVING ROBUST SUPPLY CHAIN PROBLEMS

SOLVING ROBUST SUPPLY CHAIN PROBLEMS SOLVING ROBUST SUPPLY CHAIN PROBLEMS Daniel Bienstock Nuri Sercan Özbay Columbia University, New York November 13, 2005 Project with Lucent Technologies Optimize the inventory buffer levels in a complicated

More information

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0. Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization

More information

1 The EOQ and Extensions

1 The EOQ and Extensions IEOR4000: Production Management Lecture 2 Professor Guillermo Gallego September 16, 2003 Lecture Plan 1. The EOQ and Extensions 2. Multi-Item EOQ Model 1 The EOQ and Extensions We have explored some of

More information

MAT 4250: Lecture 1 Eric Chung

MAT 4250: Lecture 1 Eric Chung 1 MAT 4250: Lecture 1 Eric Chung 2Chapter 1: Impartial Combinatorial Games 3 Combinatorial games Combinatorial games are two-person games with perfect information and no chance moves, and with a win-or-lose

More information

Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete)

Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete) Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete) Ying Chen Hülya Eraslan January 9, 216 Abstract We analyze a dynamic model of judicial decision

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)

More information

AM 121: Intro to Optimization Models and Methods

AM 121: Intro to Optimization Models and Methods AM 121: Intro to Optimization Models and Methods Lecture 18: Markov Decision Processes Yiling Chen and David Parkes Lesson Plan Markov decision processes Policies and Value functions Solving: average reward,

More information

OPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF FINITE

OPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF FINITE Proceedings of the 44th IEEE Conference on Decision and Control, and the European Control Conference 005 Seville, Spain, December 1-15, 005 WeA11.6 OPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning

More information

Homework Assignments

Homework Assignments Homework Assignments Week 1 (p. 57) #4.1, 4., 4.3 Week (pp 58 6) #4.5, 4.6, 4.8(a), 4.13, 4.0, 4.6(b), 4.8, 4.31, 4.34 Week 3 (pp 15 19) #1.9, 1.1, 1.13, 1.15, 1.18 (pp 9 31) #.,.6,.9 Week 4 (pp 36 37)

More information

Lecture 3: Factor models in modern portfolio choice

Lecture 3: Factor models in modern portfolio choice Lecture 3: Factor models in modern portfolio choice Prof. Massimo Guidolin Portfolio Management Spring 2016 Overview The inputs of portfolio problems Using the single index model Multi-index models Portfolio

More information

EE365: Markov Decision Processes

EE365: Markov Decision Processes EE365: Markov Decision Processes Markov decision processes Markov decision problem Examples 1 Markov decision processes 2 Markov decision processes add input (or action or control) to Markov chain with

More information

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens. 102 OPTIMAL STOPPING TIME 4. Optimal Stopping Time 4.1. Definitions. On the first day I explained the basic problem using one example in the book. On the second day I explained how the solution to the

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

Approximate Revenue Maximization with Multiple Items

Approximate Revenue Maximization with Multiple Items Approximate Revenue Maximization with Multiple Items Nir Shabbat - 05305311 December 5, 2012 Introduction The paper I read is called Approximate Revenue Maximization with Multiple Items by Sergiu Hart

More information

The Values of Information and Solution in Stochastic Programming

The Values of Information and Solution in Stochastic Programming The Values of Information and Solution in Stochastic Programming John R. Birge The University of Chicago Booth School of Business JRBirge ICSP, Bergamo, July 2013 1 Themes The values of information and

More information

Dynamic Appointment Scheduling in Healthcare

Dynamic Appointment Scheduling in Healthcare Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2011-12-05 Dynamic Appointment Scheduling in Healthcare McKay N. Heasley Brigham Young University - Provo Follow this and additional

More information

Multi-period Portfolio Choice and Bayesian Dynamic Models

Multi-period Portfolio Choice and Bayesian Dynamic Models Multi-period Portfolio Choice and Bayesian Dynamic Models Petter Kolm and Gordon Ritter Courant Institute, NYU Paper appeared in Risk Magazine, Feb. 25 (2015) issue Working paper version: papers.ssrn.com/sol3/papers.cfm?abstract_id=2472768

More information

Martingales. by D. Cox December 2, 2009

Martingales. by D. Cox December 2, 2009 Martingales by D. Cox December 2, 2009 1 Stochastic Processes. Definition 1.1 Let T be an arbitrary index set. A stochastic process indexed by T is a family of random variables (X t : t T) defined on a

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

The Irrevocable Multi-Armed Bandit Problem

The Irrevocable Multi-Armed Bandit Problem The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision

More information

Asymptotic results discrete time martingales and stochastic algorithms

Asymptotic results discrete time martingales and stochastic algorithms Asymptotic results discrete time martingales and stochastic algorithms Bernard Bercu Bordeaux University, France IFCAM Summer School Bangalore, India, July 2015 Bernard Bercu Asymptotic results for discrete

More information

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov

More information

SYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives

SYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives SYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October

More information

Optimizing Portfolios

Optimizing Portfolios Optimizing Portfolios An Undergraduate Introduction to Financial Mathematics J. Robert Buchanan 2010 Introduction Investors may wish to adjust the allocation of financial resources including a mixture

More information

arxiv: v1 [math.pr] 6 Apr 2015

arxiv: v1 [math.pr] 6 Apr 2015 Analysis of the Optimal Resource Allocation for a Tandem Queueing System arxiv:1504.01248v1 [math.pr] 6 Apr 2015 Liu Zaiming, Chen Gang, Wu Jinbiao School of Mathematics and Statistics, Central South University,

More information