Dynamic Programming and Stochastic Control
|
|
- Berenice Rodgers
- 5 years ago
- Views:
Transcription
1 Dynamic Programming and Stochastic Control Dr. Alex Leong Department of Electrical Engineering (EIM-E) Paderborn University, Germany Dr. Alex Leong DP and Stochastic Control Paderborn University 1 / 158
2 Outline 1 Introduction Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 2 / 158
3 Introduction What is dynamic programming (DP)? Method for solving multi-stage decision problems (Sequential decision making). There is often some randomness to what happens in future. Optimize set of decisions to achieve a good overall outcome. Richard Bellman popularized DP in the 1950s Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 3 / 158
4 Examples 1) Inventory control A store sells a product, e.g. Ice cream. Order supplies once a week. Sales during the week are random. How much supply should the store get to maximize expected profit over summer? Order too little, can t meet demand. Order too much, storage/refrigeration cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 4 / 158
5 Examples 2) Parts replacement e.g. bus engine. At the start of each month, decide whether the engine on a bus should be replaced, to maximize expected profit? If replace, profit = earnings - replacement cost - maintenance. If don t replace, profit = earnings - maintenance. Earnings will decrease if engine breaks down. P(Breakdown) is age dependent. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 5 / 158
6 Examples 3) Formula 1 engines, replace or not? 20 races, 4 engines (in 2017) Decide whether to replace engine at the start of each race, to maximize chance of winning championship. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 6 / 158
7 Examples 4) Queueing (see Figure 1) Packets arrive at queues 1 and 2. If both queues transmit at same time, have collision. If collision, retransmit at next time with a certain probability. Choose retransmission probabilities to maximize throughput. Figure 1: Queueing Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 7 / 158
8 Examples 5) LQR (Linear Quadratic Regulator) Linear System: x k+1 = Ax k + Bu k (Deterministic Problem) Assume knowledge of x k at time k (Perfect state info) Choose sequence of u k to N 1 min u 0,u 1,...,u N 1 k=0 N = number of stages = horizon. N finite finite horizon. (xk T Qx k + uk T Ru k) + xn T Qx N Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 8 / 158
9 Examples 6) x k+1 = Ax k + Bu k + w k w k = Random noise. Assume x k known (Perfect state info) Choose sequence of u k to [ N 1 ] min E (x T u 0,u 1,...,u k Qx k + uk T Ru k) + xn T Qx N N 1 k=0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 9 / 158
10 Examples 7) LQG (Linear Quadratic Gaussian) Control x k+1 = Ax k + Bu k + w k y k = Cx k + v k v k, w k Gaussian noise. Case of imperfect state info. Based on measurements y k, choose u k to [ N 1 ] min E (x T u 0,u 1,...,u k Qx k + uk T Ru k) + xn T Qx N N 1 k=0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 10 / 158
11 Examples 8) Infinite horizon min lim u 0,u 1,...,u N 1 N [ N 1 1 N E k=0 (x T k Qx k + u T k Ru k) + x T N Qx N Note: Here we divide by N, otherwise summation often blows up. ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 11 / 158
12 Examples 9) Shortest paths (see Figure 2) Find shortest path from A to stage D (Deterministic Problem). Can solve using the Viterbi algorithm (1967) Can be regarded as a special case of (forward) DP. Applications: decoding of convolutional codes (communications) channel equalization (communications) estimation of hidden Markov models (signal processing) Figure 2: Shortest paths problem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 12 / 158
13 Outline 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic Structure of Dynamic Programming Problem Dynamic Programming Principle of Optimality Dynamic Programming Algorithm Shortest Path Problems Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 13 / 158
14 Basic structure of stochastic DP problem Two ingredients, discrete time system and cost function 1. Discrete time system x k+1 = f k (x k, u k, w k ), k = 0, 1,..., N 1 (or k = 1, 2,...N) k is time index. x k is state at time k, summarizes past information that is relevant for future optimization. u k is control/decision/action at time k, lies in a set U k (x k ) which may depend on k and x k. w k is random disturbance (noise), with a probability distribution P(. k, x k, u k ) which may depend on k, x k, u k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 14 / 158
15 Basic structure of stochastic DP problem x k+1 = f k (x k, u k, w k ), k = 0, 1,..., N 1 N is horizon, or number of times control is applied. f k is function that describes how system evolves over time. Examples fk = Ax k + Bu k + w k (linear system) f k = x k u k + w k (non-linear) f k = cos x k + w k sin u k (non-linear) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 15 / 158
16 Basic structure of stochastic DP problem 2. Cost function which is additive over time [ N 1 ] E g k (x k, u k, w k ) + g N (x N ) k=0 Expectation is used because of random w k. g k is function that represents cost at time k. Examples g k = x k + u k g k = xk 2 + Cu2 k, where C is a constant. g N (x N ) is terminal cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 16 / 158
17 Basic structure of stochastic DP problem Objective: Minimize the cost function over the controls u 0 = µ 0 (x 0 ), u 1 = µ 1 (x 1 ),..., u N 1 = µ N 1 (x N 1 ) Choice of u k depends on x k. Optimization over policies: rules/functions µ k for generating u k for every possible value of x k. Expected cost of policy π = (µ 0, µ 1,..., µ N 1 ) starting at x 0 is J π (x 0 ) = E [ N 1 k=0 g k (x k, µ k (x k ), w k ) + g N (x N ) Optimal policy: π = argminj π (x 0 ) π Optimal cost starting at x 0 : J (x 0 ) = minj π (x 0 ) π ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 17 / 158
18 Examples 1) Inventory example x k = amount of stock at time k. u k = stock ordered at time k. w k = demand at time k, with some probability distribution e.g. uniform. System: x k+1 = x k + u k w k (= f k (x k, u k, w k )) x k can be negative with this model. Alternative model: x k+1 = max(0, x k + u k w k ). Cost function at time k: g k (x k, u k, w k ) = r(x k ) + Cu k r(x k ) is penalty for holding excess stock. C is cost per item. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 18 / 158
19 Examples 1) Inventory example (cont.) Terminal cost: R(x N ) is penalty for having excess stock at the end. [ N 1 ] Cost function: E k=0 (r(x k) + Cu k ) + R(x N ) Amount u k to order can depend on inventory level x k. Can have constraints on u k, e.g. x k + u k max. storage. Optimization over policies: Find the rule which tells you how much to order for every possible stock level x k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 19 / 158
20 Examples 2) Example 6 of previous section System x k+1 = Ax k +Bu k +w k }{{} f k Cost function N 1 E (xk T Qx k+uk T Ru k) + xn T }{{} Qx N }{{} k=0 g k g N (x N ) Objective: Determine u k = µ k (x k ), k = 0, 1,..., N 1, to minimize the cost function. Solution turns out to be u k = L kx k for some matrices L k. (Derived in later lecture) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 20 / 158
21 Examples 3) Shortest paths (see Figure 3) Figure 3: Shortest path problem x k = which node we re in at stage k. u k = which path we take to get to stage k + 1 w k = zero Cost function = Sum of values along the paths we choose. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 21 / 158
22 Open loop vs. Closed loop Open loop: Controls (u 0, u 1,..., u N 1 ) chosen at beginning (time 0). Closed loop: Policy (µ 0, µ 1,..., µ N 1 ) chosen, where at time k, µ k (x k ) = u k can depend on x k. Can adapt to conditions. e.g. Inventory problem. If current stock level: xk high order less. x k low order more. Closed loop is always at least as good as open loop. For deterministic problems, open loop is as good as closed loop can predict exactly the future states given initial state and sequence of controls. For stochastic problems, generally should use closed loop. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 22 / 158
23 D.P. Principle of Optimality Intuition 5 B 3 D A F 6 4 C 4 E Figure 4: Shortest path problem Consider the shortest path problem in Figure 4. Shortest path from A to F shown in red: A C D F Shortest path from C to F: C D F. Subpath of shortest path from A F. Shortest path from D to F: D F. Subpath of shortest path from A F. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 23 / 158
24 D.P. Principle of Optimality Observation Shortest path from A to F contains shortest paths from intermediate nodes to F. Why? Suppose there is a shorter path from C to F which is not C D F. Then can construct a new path A C... F (new shortest path) which is shorter than A C D F contradicts A C D F being the shortest. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 24 / 158
25 D.P. Principle of Optimality Formal statement: Basic problem ( N 1 ) mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 Let π = {µ 0, µ 1,..., µ N 1 } be the optimal policy. Consider the tail subproblem ( N 1 ) min E g k (x k, µ k (x k ), w k ) + g N (x N ), µ i,µ i+1,...,µ N 1 k=i where we are at state x i at time i and we wish to minimize the cost to go from time i to time N. D.P. Principle of optimality then says that {µ i, µ i+1,..., µ N 1 } is optimal for the tail subproblem. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 25 / 158
26 D.P. Principle of Optimality Proof : If { µ i,..., µ N 1 } is a better policy for tail subproblem, then {µ 0, µ 1,..., µ i 1, µ i,..., µ N 1 } would be a better policy for original problem contradiction of {µ 0, µ 1,..., µ N 1 } being optimal. How can we make use of the D.P. principle? Idea: Construct an optimal policy in stages. Solve tail subproblem involving last stage, to obtain µ N 1 Solve tail subproblem involving last two stages, making use of µ N 1, to obtain µ N 2 Solve tail subproblem involving last three stages, making use of µ N 2, µ N 1, to obtain µ N 3... Solve tail subproblem involving last N stages, making use of µ 1,.., µ N 1, to obtain µ 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 26 / 158
27 D.P. Algorithm Basic problem: { N 1 } mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 D.P. Algorithm: For each possible x k, compute: J N (x N ) = g N (x N ), J k (x k ) = for k = N 1, N 2,..., 1, 0 Theorem: min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))}, u k U k (x k ) 1 Optimal cost J (x 0 ) = J 0 (x 0 ), where J 0 (x 0 ) is quantity computed by D.P. algorithm. 2 Let µ k (.) be the function that generates the minimum u k in the D.P. algorithm, i.e. µ k (x k) = uk. Then {µ 0, µ 1,..., µ N 1 } is the optimal policy to the basic problem. Proof: See later Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 27 / 158
28 D.P. Algorithm Comments: D.P. algorithm needs to be run for all possible states x k. Solves all tail subproblems (don t know which subproblem you need at the start). Can be computationally expensive if number of states/controls is large. Often done on computer. Suboptimal methods can reduce complexity. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 28 / 158
29 Inventory Example x k = level of stock at time k. u k = amount ordered at time k. w k = demand at time k. x k+1 = max(0, x k + u k w k ) = f k (x k, u k, w k ), excess demand is lost. Storage constraint: x k + u k 2 Cost at time k = Purchasing cost + storage cost }{{} (x k +u k w k ) 2 }{{} cost per item=1euro = u k + (x k + u k w k ) 2 = g k (x k, u k, w k ) Terminal cost g N (x N ) = 0. Probability distribution of w k : P(w k = 0) = 0.1, P(w k = 1) = 0.7, P(w k = 2) = 0.2 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 29 / 158
30 Inventory Example Problem: Find the optimal policy for horizon N = 3, i.e. { 2 } min E g k (x k, µ k (x k ), w k ) (µ 0,µ 1,µ 2 ) k=0 Apply D.P. algorithm: J 3 (x 3 ) = g 3 (x 3 ) = 0 J k (x k ) = min u k U k E{u k + (x k + u k w k ) 2 + J k+1 (max(0, x k + u k w k ))}, Question: What values can x k take? k = 2, 1, 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 30 / 158
31 Inventory Example Period 2: Compute J 2 (x 2 ) for all possible values of x 2 J 2 (0) = min E{u 2 + (0 + u 2 w 2 ) 2 + J 3 (x 3 )} u 2 {0,1,2} }{{} = min u 2 {0,1,2} u 2 + E{(u 2 w 2 ) 2 } =0 for all x 3 = min u 2 {0,1,2} u 2 + (u 2 0) (u 2 1) (u 2 2) If u 2 = 0: u u (u 2 1) (u 2 2) 2 = = 1.5 If u 2 = 1: = 1.3 If u 2 = 2: = 3.1 J 2 (0) = 1.3 and µ 2 (0) = 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 31 / 158
32 Inventory Example J 2 (1) = min u 2 + (1 + u 2 ) (1 + u 2 1) (1 + u 2 2) u 2 {0,1} If u 2 = 0: 0.3 (check this!) If u 2 = 1: 2.1 J 2 (1) = 0.3 and µ 2 (1) = 0 J 2 (2) = min E{u 2 + (2 + u 2 w 2 ) 2 } = = 1.1 u 2 {0} J 2 (2) = 1.1 and µ 2 (2) = 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 32 / 158
33 Inventory Example Period 1: Compute J 1 (x 1 ) for all possible values of x 1. J 1 (0) = min E{u 1 + (u 1 w 1 ) 2 + J 2 (max(0, 0 + u 1 w 1 ))} u 1 {0,1,2} = min u 1 {0,1,2} u 1 + (u J 2 (max(0, u 1 ))0.1 + ((u 1 1) 2 + J 2 (max(0, u 1 1)))0.7 + ((u 1 2) 2 + J 2 (max(0, u 1 2)))0.2 u 1 = 0: J 2 (0) (1 + J 2 (0) }{{}}{{} )0.7 + (4 + J 2(0) )0.2 = 2.8 }{{} from previous stage u 1 = 1: 1 + (1 + J 2 (1))0.1 + J 2 (0) }{{}}{{} (1 + J 2(0) )0.2 = 2.5 }{{} from previous stage u 1 = 2: 2 + (4 + J 2 (2))0.1 + (1 + J 2 (1))0.7 + J 2 (0)0.2 = 3.6 J 1 (0) = 2.5 and µ 1 (0) = 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 33 / 158
34 Inventory Example J 1 (1) = min E{u 1 + (1 + u 1 w 1 ) 2 + J 2 (max(0, 1 + u 1 w 1 ))} u 1 {0,1} u 1 = 0: 1.5(check!) u 1 = 1: 2.68 J 1 (1) = 1.5, and µ 1 (1) = 0 J 1 (2) = 1.68, µ 1 (2) = 0 (check!) Period 0: Compute J 0 (x 0 ) for all possible x 0 (Tutorial problem) Solution: J 0 (0) = 3.7, J 0 (1) = 2.1, J 0 (2) = µ 0 (0) = 1, µ 0 (1) = 0, µ 0 (2) = 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 34 / 158
35 Scheduling Example Example: Scheduling problem (deterministic problem) Four operations need to be performed: A, B, C, D. B has to occur after A, D has to occur after C. Costs: c AB = 2, c AC = 3, c AD = 4, c BC = 3, c BD = 1, c CA = 4, c CB = 4, c CD = 6, c DA = 3, c DB = 3. Startup costs: S A = 5, S C = 3. What is the optimal order? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 35 / 158
36 Scheduling Example 6 ABC 6 ABCD AB 2 8 A AC C 3 CA ACB 3 ACD 1 2 CAB 1 ACBD 3 ACDB 1 CABD 6 5 CD 4 3 CAD 3 CADB 3 2 CDA 2 CDAB Minimum cost to go in red Figure: Scheduling Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 36 / 158
37 Scheduling Example Use D.P. algorithm Let State = Set of operations already performed, see Figure Scheduling. No terminal costs for this problem. Tail subproblems of length 1. Easy, only one choice at each state, e.g. if state ACD, next operation has to be B. Tail subproblems of length 2. State AB, only one choice, next operation is C. State AC, if next operation is B: cost = = 5. State AC, if next operation is D: cost = = 9. Choose B. State CA, if next operation is B: cost = = 3. State CA, if next operation is D: cost = = 7. Choose B. State CD, only one choice, next operation is A. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 37 / 158
38 Scheduling Example Tail subproblems of length 3. State A, if next operation is B: cost = = 11. State A, if next operation is C: cost = = 8. Choose C State C, if next operation is A: cost = = 7. State C, if next operation is D: cost = = 11. Choose A. Original problem of length 4. If start with A: cost = = 13 If start with C: cost = = 10 Choose C Therefore, the optimal sequence = CABD, and the optimal cost = 10. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 38 / 158
39 Proof that D.P. Algorithm gives Optimal Solution Basic problem: { N 1 } mine g k (x k, µ k (x k ), w k ) + g N (x N ) π k=0 D.P. Algorithm: For each possible x k, compute: J N (x N ) = g N (x N ), J k (x k ) = for k = N 1, N 2,..., 1, 0 Theorem: min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))}, u k U k (x k ) 1 Optimal cost J (x 0 ) = J 0 (x 0 ), where J 0 (x 0 ) is quantity computed by D.P. algorithm. 2 Let µ k (.) be the function that generates the minimum u k in D.P. algorithm i.e µ k (x k) = uk. Then {µ 0, µ 1,..., µ N 1 } is the optimal policy to the basic problem. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 39 / 158
40 Proof that D.P. Algorithm gives Optimal Solution Notation: Given policy π = (µ 0, µ 1,..., µ N 1 ), let π k = (µ k, µ k+1,..., µ N 1 ) = tail policy and J k (x k) = min π k for tail subproblem. E{ N 1 i=k g i(x i, µ i (x i ), w i ) + g N (x N )} be the optimal cost Let J k (x k ) = quantity computed by D.P algorithm. Want to show that J k (x k) = J k (x k ), for all x k, k. Proof is by mathematical induction Initial step (k = N): By definition of J k (x k), J N (x N) = g N (x N ) By definition of D.P algorithm J N (x N ) = g N (x N ) J N (x N) = J N (x N ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 40 / 158
41 Proof that D.P. Algorithm gives Optimal Solution Induction step: Assume J l (x l) = J l (x l ) for l = N, N 1,..., k + 1 Want to show that J k (x k) = J k (x k ) From the definition of J k (x k), { N 1 Jk (x k) = min E π k i=k { = min E (µ k,π k+1 ) { = min µ k E g i (x i, µ i (x i ), w i ) + g N (x N ) g k (x k, µ k (x k ), w k ) + g k (x k, µ k (x k ), w k )+min π k+1e N 1 i=k+1 [ N 1 i=k+1 } g i (x i, µ i (x i ), w i ) + g N (x N ) g i (x i, µ i (x i ), w i )+g N (x N ) by D.P principle (optimize tail subproblem then µ k ) } ]} Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 41 / 158
42 Proof that D.P. Algorithm gives Optimal Solution = min µ k E{g k (x k, µ k (x k ), w k ) + J k+1 (f k(x k, µ k (x k ), w k ))} by definition of J k+1 (x k+1) = min µ k E{g k (x k, µ k (x k ), w k ) + J k+1 (f k (x k, µ k (x k ), w k ))} by induction hypothesis = min E{g k(x k, u k, w k ) + J k+1 (f k (x k, u k, w k ))} using fact that u k U k (x k ) minf (x, µ(x)) = min F (x, u). µ u U(x) = J k (x k ) from D.P. algorithm equations So Jk (x k) = J k (x k ), and µ k (x k) = uk is the optimal policy. By induction, this is true for k = N, N 1,..., 1, 0. In particular, J (x 0 ) = J 0 (x 0) = J 0 (x 0 ) is the optimal cost. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 42 / 158
43 Shortest Paths in a Trellis Initial state s Artificial Terminal state t Stage 0 a^1_ Stage 1 Stage 2 Stage N-1 Stage N Figure 6: Shortest paths in a trellis Find shortest path from a node in Stage 1 to a node in Stage N states nodes controls arcs aij k : cost of transition from state i at stage k to state j at stage k + 1. ait N : terminal cost of state i cost function = length of path from s to t Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 43 / 158
44 Shortest Paths in a Trellis D.P. Algorithm: J N (i) = a N it J k (i) = min j [a k ij + J k+1 (j)], k = N 1,..., 1, 0 Optimal cost = J 0 (s) = length of shortest path from s to t. Example: Find shortest path from stage 1 to stage 3 in Figure Shortest path in red Stage 1 Stage 2 Stage 3 Figure 7: Shortest paths example Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 44 / 158
45 Shortest Paths in a Trellis Redraw as a trellis with initial and terminal node, see Figure s Stage Stage 1 Stage 2 Stage 3 Figure 8: Redrawn shortest paths example 0 0 t Here N = 3. Call the top node state 1 and bottom node state 2. Stage N: J 3 (1) = 0 J 3 (2) = 0 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 45 / 158
46 Shortest Paths in a Trellis Stage 2: Stage 1: Stage 0: J 2 (1) = min{a J 3(1), a J 3(2)} = min{ , } = 100 J 2 (2) = min{a J 3(1), a J 3(2)} = min{ , } = 350 J 1 (1) = min{a J 2(1), a J 2(2)} = min{ , } = 400 J 1 (2) = min{a J 2(1), a J 2(2)} = min{ , } = 250 J 0 (s) = min{0 + J 1 (1), 0 + J 1 (2)} = 250 Shortest path to original problem shown in red in Figure 7. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 46 / 158
47 Forward D.P. Algorithm Observe that optimal path s t is also optimal path t s if directions of arcs are reversed. Shortest path algorithm can be run forwards in time (see Bertsekas for equations). Figure 9 shows the result of forward D.P. on shortest paths example. Forward D.P. useful in real-time applications, where data arrives just before you need to make a decision. Viterbi algorithm uses this idea Shortest paths is a deterministic problem, so forward D.P. works. For stochastic problems, no such concept of forward D.P. Impossible to guarantee that any given state can be reached Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 47 / 158
48 Forward D.P. Algorithm s t Figure 9: Forward D.P. on shortest paths example Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 48 / 158
49 Viterbi Algorithm Applications Estimation of hidden Markov models (HMMs) xk = Markov chain state transitions in x k not observed (hidden). observe z k, r(z, i, j) = probability we observe z given a transition in Markov chain x k from state i to j. Estimation problem: Given Z N = {z 1, z 2,..., z N }, find a sequence ˆX N = {ˆx 0, ˆx 1,..., ˆx N } over all possible {x 0, x 1,..., x N } that maximizes P(X N Z N ). Note that P(X N Z N ) = P(X N,Z N ) P(Z N ), and P(Z N ) is constant given Z N So max P(X N Z N ) max P(X N, Z N ) max ln P(X N, Z N ) {x 0,...,x N } {x 0,...,x N } {x 0,...,x N } Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 49 / 158
50 Viterbi Algorithm Applications After some calculations (see Bertsekas), can show that problem is equivalent to: min ln(π x 0 ) {x 0,...,x N } N ln(π xk 1 x k r(z k, x k 1, x k )) k=1 where π x0 = probability of initial state, π xk 1 x k = transition probabilities of Markov chain, and ln π x0 and ln(π xk 1 x k r(z k, x k 1, x k )) can be regarded as lengths of the different stages shortest path problem through a trellis Decoding of convolutional codes Channel equalization in presence of ISI (Inter-symbol interference) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 50 / 158
51 General Shortest Path Problems No trellis structure e.g. Find the shortest path from each node to node 5 in Figure Figure 10: General shortest path problem Graph with N + 1 nodes {1, 2,..., N, t} a ij = cost of moving from node i to node j. Find the shortest path from each node i to node t. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 51 / 158
52 General Shortest Path Problems Assume some a ij s can be negative, but cycles have non-negative length. Then shortest path will not involve more than N arcs. Reformulate as a trellis-type shortest path problem with N arcs, by allowing arcs from node i to itself with cost a ii = 0 D.P. algorithm: J N 1 (i) = a it J k (i) = min j {a ij + J k+1 (j)}, k = N 2,..., 1, 0 This algorithm is essentially the Bellman-Ford algorithm. Other algorithms have also been invented, e.g. Dijkstra s algorithm which can be used when all a ij s are positive. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 52 / 158
53 Outline 3 Problems with Perfect State Information Linear Quadratic Control Optimal Stopping Problems Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 53 / 158
54 Problems with Perfect State Information Will study some problems where analytical solutions can be obtained: Linear quadratic control Optimal stopping problems + others in Chapter 4 of Bertsekas Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 54 / 158
55 Linear Quadratic Control (Linear) System: x k+1 = Ax k + Bu k + w k, k = 0, 1,..., N 1 (Quadratic) Cost function: E { N 1 } (xk T Qx k + uk T Ru k) + xn T Qx N k=0 Problem: Determine optimal policy to minimize cost function x k, u k, w k are column vectors A, B, Q, R are matrices. w k are independent and zero mean. Q is positive semi-definite. R is positive definite. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 55 / 158
56 Linear Quadratic Control Definition: A symmetric matrix M is positive semi-definite if x T Mx 0, vectors x M is positive definite if x T Mx > 0, x 0 One characterization: M is positive semi definite all eigenvalues of M are 0. M is positive definite all eigenvalues of M are > 0. D.P. algorithm applied to this problem gives: J N (x N ) = x T N Qx N J k (x k ) = min u k {x T k Qx k + u T k Ru k + J k+1 (Ax k + Bu k + w k )}, k = N 1,..., 1, 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 56 / 158
57 Linear Quadratic Control Turns out that minimization can be done analytically J N 1 (x N 1 ) = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 + (Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 )} = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 + x T N 1 AT QAx N 1 + x T N 1 AT QBu N 1 + x T N 1 AT Qw N 1 + u T N 1 BT QAx N 1 + u T N 1 BT QBu N 1 + u T N 1 BT Qw N 1 + w T N 1 QAx N 1 + w T N 1 QBu N 1 + w T N 1 Qw N 1} = x T N 1 (AT QA + Q)x N 1 + E{w T N 1 Qw N 1} + min u N 1 {u T N 1 (R + BT QB)u N 1 + 2x T N 1 AT QBu N 1 } Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 57 / 158
58 Linear Quadratic Control Digression Problem: minf (x) x How to solve? For unconstrained scalar problems, can differentiate and set derivative equal to 0. e.g. min(x 2) 2 d, x dx (x 2)2 = 2(x 2) = 0 x = 2. Similarly, differentiate u T N 1 (R + BT QB)u N 1 + 2x T N 1 AT QBu N 1 with respect to the vector u N 1 and set equal to zero Note that (u T Au) (a T u) = 2Au, = a, u u where a and u are column vectors, and A is a symmetric matrix. Using above formulas, obtain 2(R + B T QB)u N 1 + 2B T QAx N 1 = 0 u N 1 = (R + BT QB) 1 B T QAx N 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 58 / 158
59 Linear Quadratic Control Substituting u N 1 = (R + BT QB) 1 B T QAx N 1 back into expression for J N 1 (x N 1 ), we obtain J N 1 (x N 1 ) = x T N 1 (AT QA + Q)x N 1 + E{w T N 1 Qw N 1} + x T N 1 AT QB(R + B T QB) 1 (R + B T QB)(R + B T QB) 1 B T QAx N 1 2x T N 1 AT QB(R + B T QB) 1 B T QAx N 1 = x T N 1 (AT QA + Q)x N 1 x T N 1 AT QB(R + B T QB) 1 B T QAx N 1 + E{w T N 1 Qw N 1} = x T N 1 (AT QA + Q A T QB(R + B T QB) 1 B T QA)x N 1 + E{w T N 1 Qw N 1} = x T N 1 K N 1x N 1 + E{w T N 1 Qw N 1} with K N 1 = A T QA + Q A T QB(R + B T QB) 1 B T QA Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 59 / 158
60 Linear Quadratic Control Continuing on, can show that un 2 = (BT K N 1 B + R) 1 B T K N 1 Ax N 2, and more generally (tutorial problem) that µ k (x k) = (B T K k+1 B + R) 1 B T K k+1 Ax k where K N = Q, K k = A T K k+1 A + Q A T K k+1 B(B T K k+1 B + R) 1 B T K k+1 A Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 60 / 158
61 Certainty Equivalence Certainty Equivalence: Optimal policy is the same as solving problem for the deterministic system: x k+1 = Ax k + Bu k + E[w k ], where w k is replaced by its expected value E[w k ] = 0, i.e. the standard LQR problem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 61 / 158
62 Asymptotic Behaviour Definition: A pair of matrices (A, B), where A is n n, B is n m, is controllable if the n nm matrix [ B AB A 2 B... A n 1 B ] has full rank (all rows linearly independent) A pair (A, C), where A is n n, C is m n, is observable if (A T, C T ) is controllable. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 62 / 158
63 Asymptotic Behaviour Theorem If (A, B) is controllable and Q can be written as Q = C T C, where (A, C) is observable, then: 1 K k K as k, with K satisfying the algebraic Riccati equation K = A T KA + Q A T KB(B T KB + R) 1 B T KA 2 The steady state controller µ (x k ) = Lx k, where L = (B T KB + R) 1 B T KA, stabilizes the system, i.e. the eigenvalues of A + BL have magnitude < 1. Proof: See Bertsekas Note: If u k = Lx k, then x k+1 = Ax k + Bu k + w k = (A + BL)x k + w k. x k stays bounded when the eigenvalues of A + BL have magnitude < 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 63 / 158
64 Other Variations x k+1 = A k x k + B k u k + w k A k, B k random, unknown, independent. Optimal policy: where K N = Q, µ k (k) = (R + E{BT k K k+1b}) 1 E{B T k K k+1a}x k, K k = E{A T k K k+1a T k } + Q E{A T k K k+1b k }(E{B T k K k+1b k } + R) 1 E{B T k K k+1a k } may not have certainty equivalence may not have steady state solution x k+1 = Ax k + B k u k + w k B k is random, independent, and is only revealed to us at time k. Motivation: Wireless channels Similar to Leong, Dey, Anand, Optimal LQG control over continuous fading channels, Proc. IFAC World Congress, Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 64 / 158
65 Optimal Stopping Problems At each state, there is a stop control that stops the system, i.e moves to and stays in a stop state. Pure stopping problem: if only other control is continue. For pure stopping problems, policy characterized by partition of set of states into: stop region continue region, which may depend on time. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 65 / 158
66 Example (Asset selling) A person has an asset for sale, e.g. a house. At each time k = 0, 1,..., N 1, person receives a random offer w k for the asset. Assume w k s are independent. Either accept w k at time k + 1, and invest money at interest rate r, or reject w k and wait for offer w k+1. Must accept last offer w N 1 at time N if every previous offer was rejected. Find policy that maximizes (expected) revenue at the N-th period. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 66 / 158
67 Example (Asset selling) States: If x k = T : asset already sold (= stop state) If x k = w k 1 : offer currently under consideration. Controls: {accept, reject} System evolves as: x k+1 = f k (x k, w k, u k ) { T, if 1) xk = T or 2) x = k T and u k = accept w k, otherwise. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 67 / 158
68 Example (Asset selling) Rewards at time k: g N (x N ) = { xn, if x N T ; 0, otherwise. { (1 + r) g k (x k, u k, w k ) = N k x k, if x k T and u k = accept ; 0, otherwise. (For compound interest over n years, final amount = (1 + r) n initial amount.) Note: From the way the rewards are defined, gk is non-zero for only one k {0, 1,..., N 1}. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 68 / 158
69 Example (Asset selling) Expected total reward [ N 1 ] = E g k (x k, u k, w k ) + g N (x N ) k=0 D.P. algorithm (for reward maximization): { xn, if x J N (x N ) = g N (x N ) = N T ; 0, otherwise. J k (x k ) = max u k E[g k (x k, u k, w k ) + J k+1 (x k+1 )] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 69 / 158
70 Example (Asset selling) If x k = T, then g k (x k, u k, w k ) = 0 and J k+1 (x k+1 ) = 0, by property of g k being non-zero for only one k, and reward being incurred prior to time k If x k T, then { (1 + r) E[g k (x k, u k, w k )+J k+1 (x k+1 )] = N k x k, if u k = accept; 0 + E[J k+1 (w k )], if u k = reject. So J k (x k ) = maxe[g k (x k, u k, w k ) + J k+1 (x k+1 )] u k { max((1 + r) = N k x k, E[J k+1 (w k )]), if x k T, 0, if x k = T, and optimal policy is of the form: u k = accept if (1 + r) N k x k > E[J k+1 (w k )] { accept, if x or u k = k > E[J k+1(w k )] ; (1+r) N k reject, otherwise. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 70 / 158
71 Example (Asset selling) Let α k = E[J k+1(w k )] (1 + r) N k Can show (see Bertsekas) that α k α k+1 for all k if w k are i.i.d. Intuition: offer acceptable at time k should also be acceptable at time k + 1. See Figure 11 α 1 α 2 Accept Reject α N N-1 k Figure 11: Asset selling Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 71 / 158
72 Example (Asset selling) Can also show that if w k are i.i.d and N, then optimal policy converges to the stationary policy: { accept, if xk > ᾱ u k = reject, if x k ᾱ where ᾱ is constant. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 72 / 158
73 General Stopping Problems Pure stopping problem - stop or continue only possible controls General stoping problem - stop or choose a control u k from U(x k ) (where U has more than one element) Consider time invariant case: f (x k, u k, w k ), g(x k, u k, w k ) don t depend on k, and w k is i.i.d. Stop at time k with cost t(x k ) Must stop by last stage. D.P. algorithm: J N (x N ) = t(x N ), J k (x k ) = min[t(x k ), min E{g(x k, u k, w k ) + J k+1 (f (x k, u k, w k ))}] u k U(x k ) Optimal to stop when t(x k ) min E{g(x k, u k, w k ) + J k+1 (f (x k, u k, w k ))} u k U(x k ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 73 / 158
74 General Stopping Problems Stopping set at time k (set of states where you stop) defined as T k = {x t(x) min E[g(x, u, w) + J k+1(f (x, u, w))]} u U(x) Note that J N 1 (x) J N (x) for all x, since J N (x) = t(x) and [ ] J N 1 (x) = min t(x), min E[g(x, u, w) + J k+1(f (x, u, w))] u U(x) t(x) = J N (x) Can show that J k (x) J k+1 (x) (Monotonicity principle: tutorial problem) Then we have : T 0 T 1 T 2... T k T k+1... T N 1 i.e. set of states in which we stop increases with time. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 74 / 158
75 Special Case If f (x, u, w) T N 1 for all x T N 1, u U(x), w, i.e. the set T N 1 is absorbing, then T 0 = T 1 = T 2 = = T N 1. Proof: See Bertsekas Simplifies optimal policy, called the one step lookahead policy. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 75 / 158
76 Special Case E.g. Asset selling with past offers retained Same situation as before, except that previously rejected offers can be accepted at a later time. State evolves as (instead of x k+1 = w k before) x k+1 = max(x k, w k ) Can show (see Bertsekas) that T N 1 = {x x ᾱ} for some constant ᾱ This set is absorbing, since best currently received offer cannot decrease over time. optimal policy at every time k is to accept if best offer > ᾱ Have constant threshold ᾱ even for finite horizon N Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 76 / 158
77 Outline 4 Problems with Imperfect State Information Reformulation as Perfect State Information Problem Linear Quadratic Control with Noisy Measurements Sufficient Statistics Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 77 / 158
78 Problems with Imperfect State Information State x k not known to controller. Instead have noisy observations z k of the form: z 0 = h 0 (x 0, v 0 ), z k = h k (x k, u k 1, v k ), k = 1, 2,..., N 1, where v k is observation noise, with a probability distribution P v (. x 0,..., x k, u 0,..., u k 1, w 0,..., w k 1, v 0,..., v k 1 ) which can depend on states, controls and disturbances Examples h x (x k, u k 1, v k ) = x k + v k, h k (x k, u k 1, v k ) = sin x k + u k 1 v k Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 78 / 158
79 Problems with Imperfect State Information Initial state x 0 is random with distribution P x0. u k U k, where U k does not depend on (unknown) x k. Information vector, i.e. information available to controller at time k, defined as I 0 = z 0, I k = (z 0,..., z k, u 0,..., u k 1 ), k = 1, 2,..., N 1 Policies π = (µ 0,..., µ N 1 ), where now µ k (I k ) U k (before µ k (x k )). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 79 / 158
80 Basic Problem with Imperfect State Information Find π that minimizes the cost function { N 1 } J π = E g k (x k, µ k (I k ), w k ) + g N (x N ) s.t. system equation k=0 and measurement equation x k+1 = f k (x k, µ k (I k ), w k ) z k = h k (x k, µ k 1 (I k 1 ), v k ) Question: How to solve this problem? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 80 / 158
81 Reformulation as Perfect State Information Problem Idea: Define new system where the state is I k. Then have D.P. algorithm etc. By definition I k+1 = (z 0,..., z k, z k+1, u 0,..., u k 1, u k ) = (z 0,..., z k, u 0,..., u }{{ k 1, z } k+1, u k ) I k I k+1 = (I k, u k, z k+1 ). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 81 / 158
82 Reformulation as Perfect State Information Problem Regard I k+1 = (I k, u k, z k+1 ) as a dynamical system with state I k, control u k and disturbance z k+1 Next note that E[g k (x k, u k, w k )] = E[E[g k (x k, u k, w k ) I k, u k ]] (Recall that E[X ] = E[E[X Y ]]) Define g k (I k, u k ) = E[g k (x k, u k, w k ) I k, u k ] = cost per stage of new system, and g N (I N ) = E[g N I N ] = terminal cost. Cost function becomes { N 1 } E k=0 g k(x k, µ k (I k ), w k ) + g N (x N ) { N 1 } = E k=0 g k(i k, µ k (I k )) + g N (I N ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 82 / 158
83 Reformulation as Perfect State Information Problem D.P. algorithm for reformulated perfect state information problem is: J N (I N ) = g N (I N ) = E[g N (x N ) I N ] J k (I k ) = min u k U k E{ g k (I k, u k ) + J k+1 (I k, u k, z k+1 )} = min u k U k E{g k (x k, u k, w k ) + J k+1 (I k, u k, z k+1 ) I k }, k = N 1,..., Optimal cost J = E{J 0 (z 0 )} Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 83 / 158
84 Linear Quadratic Control with Noisy Measurements System Cost function x k+1 = Ax k + Bu k + w k N 1 E (xk T Qx k + uk T Ru k) + xn T }{{} Qx N }{{} k=0 g k (x k,u k,w k ) g N (x N ) Observations z k = Cx k + v k w k are independent, zero mean. From D.P. Algorithm: J N (I N ) = E[xN T Qx N I N ], Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 84 / 158
85 Linear Quadratic Control with Noisy Measurements { J N 1 (I N 1 ) = mine x u N 1 T Qx N 1 + un 1 T Ru N 1 N 1 } +E[(Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 ) I N ] I N 1 = min u N 1 E{x T N 1 Qx N 1 + u T N 1 Ru N 1 +(Ax N 1 + Bu N 1 + w N 1 ) T Q(Ax N 1 + Bu N 1 + w N 1 ) I N 1 } (Using the tower property that E(E(X Y ) Z) = E(X Z) if Y contains more information than Z) =... ( expand, simplify and use E(w N 1 I N 1 ) = 0.) = E[xN 1 T (AT QA + Q)x N 1 I N 1 ] + E[wN 1 T Qw N 1 I N 1 ] { + min u T u N 1 (B T QB + R)u N 1 + 2E[x N 1 I N 1 ] T A T } QBu N 1 N 1 Differentiate with respect to u N 1 and set equal to zero: 2(B T QB + R)u N 1 + 2B T QAE[x N 1 I N 1 ] = 0 u N 1 = (BT QB + R) 1 B T QAE[x N 1 I N 1 ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 85 / 158
86 Linear Quadratic Control with Noisy Measurements Substituting expression for un 1 back in: J N 1 (I N 1 ) = E[x T N 1 (AT QA + Q)x N 1 I N 1 ] + E[w T N 1 Qw N 1] + E[x N 1 I N 1 ] T A T QB(B T QB + R) 1 (B T QB + R) (B T QB + R) 1 B T QAE[x N 1 I N 1 ] 2E[x N 1 I N 1 ] T A T QB(B T QB + R) 1 B T QAE[x N 1 I N 1 ] = E[x T N 1 (AT QA + Q)x N 1 I N 1 ] + E(w T N 1 Qw N 1) E(x N 1 I N 1 ) T A T QB(B T QB + R) 1 B T QAE(x N 1 I N 1 ) = E[xN 1 T (AT QA + Q)x N 1 I N 1 ] + E(wN 1 T Qw N 1) [ + E (x N 1 E[x N 1 I N 1 ]) T A T QB(B T QB + R) 1 ] B T QA(x N 1 E[x N 1 I N 1 ]) I N 1 E[xN 1 T AT QB(B T QB + R) 1 B T QA x N 1 I N 1 ] }{{} P N 1 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 86 / 158
87 Linear Quadratic Control with Noisy Measurements We have J N 1 (I N 1 ) = E[xN 1 T K N 1x N 1 I N 1 ] + E[wN 1 T Qw N 1] + E[(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E(x N 1 I N 1 )) I N 1 ] where P N 1 = A T QB(B T QB + R) 1 B T QA K N 1 = A T QA + Q P N 1. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 87 / 158
88 Linear Quadratic Control with Noisy Measurements For period N 2, J N 2 (I N 2 ) = mine{x u N 2 T Qx N 2 + un 2 T Ru N 2 + J N 1 (I N 1 ) I N 2 } N 2 [ ] = E{xN 2 T Qx N 2 I N 2 } + min u T u N 2 Ru N 2 + E{xN 1 T K N 1x N 1 I N 2 } N 2 ] + E [(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E[x N 1 I N 1 ]) I N 2 + E(w T N 1 Qw N 1) Then can obtain u N 2 = (BT K N 1 B + R) 1 B T K N 1 AE[x N 2 I N 2 ] Note that in the above the term ] E [(x N 1 E[x N 1 I N 1 ]) T P N 1 (x N 1 E[x N 1 I N 1 ]) I N 2 can be taken outside the minimization (see Bertsekas for proof). Intuition: estimation error x k E[x k I k ] can t be influenced by choice of control. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 88 / 158
89 Linear Quadratic Control with Noisy Measurements Continuing on, general solution is: µ k (I k) = uk = (BT K k+1 B + R) 1 B T K k+1 AE[x k I k ] = L k E[x k I k ] where K N = Q P k = A T K k+1 B(B T K k+1 B + R) 1 B T K k+1 A K k = A T K k+1 A + Q P k Comparison with perfect state information case: L k matrix the same x k is replaced by E[x k I k ] How to compute E[x k I k ]? Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 89 / 158
90 Linear Quadratic Control with Noisy Measurements Summary so far: System x k+1 = Ax k + Bu k + w k z k = Cx k + v k Problem min E [ N 1 ] (xk T Qx k + uk T Ru k) + xn T Qx N k=0 Optimal solution is µ k (I k) = (B T K k+1 B + R) 1 B T K k+1 AE[x k I k ] = L k E[x k I k ] where I k = (z 0,..., z k, u 0,..., u k 1 ) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 90 / 158
91 Linear Quadratic Control with Noisy Measurements Optimal controller can be decomposed into two parts: 1) An estimator which computes E[x k I k ]. 2) An actuator which multiplies E[x k I k ] with L k. L k is the same gain matrix as in the perfect state information case, only replace x k with E[x k I k ]. Estimator and actuator can be designed separately. Known as the separation principle/theorem Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 91 / 158
92 LQG Control Remaining problem: How do we compute E[x k I k ]? Very difficult problem in general (subject called non-linear filtering). When system is linear and w k, v k are Gaussian, E[x k I k ] can be computed analytically. Procedure/algorithm is known as the Kalman Filter (ref: Anderson and Moore, Optimal Filtering ), and the overall controller is called the LQG (linear quadratic Gaussian) controller Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 92 / 158
93 Kalman Filter System: x k+1 = Ax k + Bu k + w k z k = Cx k + v k w k N(0, Σ w ) i.i.d, Σ w = E[w k wk T ] v k N(0, Σ v ) i.i.d, Σ v = E[v k vk T ] Define state estimates ˆx k k = E[x k I k ] ˆx k+1 k = E[x k+1 I k ] and estimation error covariance matrices Σ k k = E[(x k ˆx k k )(x k ˆx k k ) T I k ] Σ k+1 k = E[(x k+1 ˆx k+1 k )(x k+1 ˆx k+1 k ) T I k ] Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 93 / 158
94 Kalman Filter Then ˆx k k, ˆx k+1 k, Σ k k, Σ k+1 k can be computed recursively using the Kalman Filter equations: ˆx k k = ˆx k k 1 + Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1 (z k C ˆx k k 1 ) ˆx k+1 k = Aˆx k k + Bu k Σ k k = Σ k k 1 Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1 CΣ k k 1 Σ k+1 k = AΣ k k A T + Σ w, k = 0, 1,..., N 1 Proof: see Bertsekas, or Anderson and Moore. Beware: Many people who work in Kalman filtering like to use Q for Σ w, R for Σ v, K k for the Kalman gain Σ k k 1 C T (CΣ k k 1 C T + Σ v ) 1, but here Q, R, K k have been used for different things. People also use P k+1 k for Σ k+1 k, P k k for Σ k k etc. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 94 / 158
95 Kalman Filter Properties In general the mean squared error is minimized when ˆx k = E[x k I k ] E[(x k ˆx k ) T (x k ˆx k ) I k ] Kalman filter equations compute E[x k I k ] when noises are Gaussian, and (optimal) estimates are linear functions of the measurements z k. Even when noises are not Gaussian, ˆx k k computed by Kalman filter equations gives the best linear estimate of x k. Useful suboptimal solution when noises are non-gaussian. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 95 / 158
96 Kalman Filter Properties Recall that if the pair (A, B) is controllable and (A, Q 1/2 ) is observable, optimal controller has a steady state solution. Similarly, if (A, C) is observable, and (A, Σ 1/2 w ) is controllable, then Σ k k 1 converges to a steady state value Σ as k, where Σ satisfies the algebraic Riccati equation Σ = AΣA T AΣC T (CΣC T + Σ v ) 1 CΣA T + Σ w So we have a steady state estimator: ˆx k k = ˆx k k 1 + ΣC T (CΣC T + Σ v ) 1 (z k C ˆx k k 1 ) ˆx k+1 k = Aˆx k k + Bu k Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 96 / 158
97 Sufficient Statistics Information vector I k = (z 0,.., z k, u 0,..., u k 1 ) Dimension of I k increases with time k. Inconvenient for large k Sufficient statistic: function S k (I k ) which summarizes all essential content in I k for computing the optimal control, i.e. µ k (I k) = µ(s k (I k )) for some function µ. S k (I k ) preferably of smaller dimension than I k. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 97 / 158
98 Examples of Sufficient Statistics 1) I k itself 2) Conditional state distribution/belief state P xk I k, assuming that distribution of v k depends only on x k 1, u k 1, w k 1. If number of states is finite then P xk I k is a vector. e.g. if states are 1, 2,..., n, then P xk I k = P(x k = 1 I k ) P(x k = 2 I k )... P(x k = n I k ) Dimension of vector is n, which doesn t grow with k 3) Special case: E[x k I k ] is a sufficient statistic for LQG problem (though not a sufficient statistic in general). Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 98 / 158
99 Conditional State Distribution For conditional state distribution, P xk I k can be generated recursively, as P xk+1 I k+1 = Φ k (P xk I k, u k, z k+1 ) for some function Φ k (,, ). Then D.P. algorithm can be written as J k (P xk I k ) = min E[g k (x k, u k, w k ) + J k+1 (Φ k (P xk I u k U k, u k, z k+1 )) I k ]. k General formula for Φ k (,, ) can be derived, but is quite complicated (see Bertsekas). Will derive some examples from first principles. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 99 / 158
100 Example 1: Search Problem At each period, decide whether to search a site that may contain a treasure. If treasure is present and we search, we find it with probability β and take it. States: {treasure present, treasure not present} Controls: {search, no search} Regard each search result as (imperfect) observation of the state. Let p k = probability treasure present at start of time k. If not search, pk+1 = p k. If search and find treasure, pk+1 = 0. Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 100 / 158
101 Example 1 If search and don t find treasure, p k+1 = P(treasure present at k don t find at k) = P(treasure present at k don t find at k) P(don t find at k) p k (1 β) = p k (1 β) + (1 p k ), with (1 p k ) corresponding to treasure not present & don t find. Thus p k+1 = p k, not search at time k 0, search and find treasure. p k (1 β) p k (1 β)+(1 p k ), = Φ k (p k, u k, z k+1 ) function. search and don t find treasure Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 101 / 158
102 Example 1 Now let treasure be worth V, each search costs C, and once we decide not to search we can t search again at future times. D.P. algorithm gives: J k (p k ) = max {no search,search} [ 0, C + p k βv ( + (1 p k β) J p k (1 β) ) k+1 p k (1 β) + 1 p k [ = max 0, C + p k βv {no search,search} ( + (1 p k β) J p k (1 β) k+1 p k (1 β) + 1 p k ] + p k βj k+1 (0) (where p k βj k+1 (0) = 0 since treasure already found) Can show that J k (p k ) = 0, p k C βv, and that it is optimal to search iff expected reward p k βv cost of search C. (Tutorial problem) Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 102 / 158 ) ]
103 Example 2: Research Paper* A process {P e,k } evolves in the following way, for k = 1,..., N: P e,k+1 = { P, ν k+1 γ e,k+1 = 1 AP e,k A T + Q, ν k+1 γ e,k+1 = 0, P, A, Q are some matrices {γ e,k } is i.i.d Bernoulli process with P(γ e,k = 1) = λ e, P(γ e,k = 0) = 1 λ e, k ν k {0, 1} {P e,k } is not observed at all (no observation z k ). *Leong, Quevedo, Dolz, Dey On Remote State Estimation in the Presence of an Eavesdropper Proc. IFAC World Congress, 2017 Dr. Alex Leong (alex.leong@upb.de) DP and Stochastic Control Paderborn University 103 / 158
Stochastic Optimal Control
Stochastic Optimal Control Lecturer: Eilyan Bitar, Cornell ECE Scribe: Kevin Kircher, Cornell MAE These notes summarize some of the material from ECE 5555 (Stochastic Systems) at Cornell in the fall of
More information6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE
6.21 DYNAMIC PROGRAMMING LECTURE LECTURE OUTLINE Deterministic finite-state DP problems Backward shortest path algorithm Forward shortest path algorithm Shortest path examples Alternative shortest path
More informationHandout 4: Deterministic Systems and the Shortest Path Problem
SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas
More information6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE Stopping problems Scheduling problems Minimax Control 1 PURE STOPPING PROBLEMS Two possible controls: Stop (incur a one-time stopping cost, and move
More information6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE Rollout algorithms Cost improvement property Discrete deterministic problems Approximations of rollout algorithms Discretization of continuous time
More information6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE
6.21 DYNAMIC PROGRAMMING LECTURE LECTURE OUTLINE Deterministic finite-state DP problems Backward shortest path algorithm Forward shortest path algorithm Shortest path examples Alternative shortest path
More informationFinal exam solutions
EE365 Stochastic Control / MS&E251 Stochastic Decision Models Profs. S. Lall, S. Boyd June 5 6 or June 6 7, 2013 Final exam solutions This is a 24 hour take-home final. Please turn it in to one of the
More informationSequential Decision Making
Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming
More informationHandout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems
SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 8: Introduction to Stochastic Dynamic Programming Instructor: Shiqian Ma March 10, 2014 Suggested Reading: Chapter 1 of Bertsekas,
More informationEC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods
EC316a: Advanced Scientific Computation, Fall 2003 Notes Section 4 Discrete time, continuous state dynamic models: solution methods We consider now solution methods for discrete time models in which decisions
More informationDynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming
Dynamic Programming: An overview These notes summarize some key properties of the Dynamic Programming principle to optimize a function or cost that depends on an interval or stages. This plays a key role
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More information6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE Suboptimal control Cost approximation methods: Classification Certainty equivalent control: An example Limited lookahead policies Performance bounds
More informationIntroduction to Dynamic Programming
Introduction to Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes Outline 2/65 1
More informationOptimization Methods. Lecture 16: Dynamic Programming
15.093 Optimization Methods Lecture 16: Dynamic Programming 1 Outline 1. The knapsack problem Slide 1. The traveling salesman problem 3. The general DP framework 4. Bellman equation 5. Optimal inventory
More informationPart 4: Markov Decision Processes
Markov decision processes c Vikram Krishnamurthy 2013 1 Part 4: Markov Decision Processes Aim: This part covers discrete time Markov Decision processes whose state is completely observed. The key ideas
More informationDynamic Portfolio Execution Detailed Proofs
Dynamic Portfolio Execution Detailed Proofs Gerry Tsoukalas, Jiang Wang, Kay Giesecke March 16, 2014 1 Proofs Lemma 1 (Temporary Price Impact) A buy order of size x being executed against i s ask-side
More informationLecture 2 Dynamic Equilibrium Models: Three and More (Finite) Periods
Lecture 2 Dynamic Equilibrium Models: Three and More (Finite) Periods. Introduction In ECON 50, we discussed the structure of two-period dynamic general equilibrium models, some solution methods, and their
More informationCHAPTER 5: DYNAMIC PROGRAMMING
CHAPTER 5: DYNAMIC PROGRAMMING Overview This chapter discusses dynamic programming, a method to solve optimization problems that involve a dynamical process. This is in contrast to our previous discussions
More informationDynamic Programming and Optimal Control Volume 1
Dynamic Programming and Optimal Control Volume 1 SECOND EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology Selected Theoretical Problem Solutions Athena Scientific, Belmont, MA, 2000 WWW
More informationDynamic Programming and Reinforcement Learning
Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall 2017 1 / 34 Supervised Machine Learning
More informationNon-Deterministic Search
Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:
More informationReinforcement Learning and Optimal Control. Chapter 1 Exact Dynamic Programming DRAFT
Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 1 Exact Dynamic Programming DRAFT This is Chapter 1 of the draft textbook Reinforcement
More informationSYLLABUS AND SAMPLE QUESTIONS FOR MSQE (Program Code: MQEK and MQED) Syllabus for PEA (Mathematics), 2013
SYLLABUS AND SAMPLE QUESTIONS FOR MSQE (Program Code: MQEK and MQED) 2013 Syllabus for PEA (Mathematics), 2013 Algebra: Binomial Theorem, AP, GP, HP, Exponential, Logarithmic Series, Sequence, Permutations
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives
More informationScenario Generation and Sampling Methods
Scenario Generation and Sampling Methods Güzin Bayraksan Tito Homem-de-Mello SVAN 2016 IMPA May 9th, 2016 Bayraksan (OSU) & Homem-de-Mello (UAI) Scenario Generation and Sampling SVAN IMPA May 9 1 / 30
More informationEE266 Homework 5 Solutions
EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The
More informationLecture 7: Bayesian approach to MAB - Gittins index
Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach
More informationLecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018
Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction
More informationTDT4171 Artificial Intelligence Methods
TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods
More informationIEOR 3106: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 16, 2012
IEOR 306: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 6, 202 Four problems, each with multiple parts. Maximum score 00 (+3 bonus) = 3. You need to show
More informationIEOR E4004: Introduction to OR: Deterministic Models
IEOR E4004: Introduction to OR: Deterministic Models 1 Dynamic Programming Following is a summary of the problems we discussed in class. (We do not include the discussion on the container problem or the
More informationIntroduction to Sequential Monte Carlo Methods
Introduction to Sequential Monte Carlo Methods Arnaud Doucet NCSU, October 2008 Arnaud Doucet () Introduction to SMC NCSU, October 2008 1 / 36 Preliminary Remarks Sequential Monte Carlo (SMC) are a set
More informationOptimal Stopping. Nick Hay (presentation follows Thomas Ferguson s Optimal Stopping and Applications) November 6, 2008
(presentation follows Thomas Ferguson s and Applications) November 6, 2008 1 / 35 Contents: Introduction Problems Markov Models Monotone Stopping Problems Summary 2 / 35 The Secretary problem You have
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline
More information16 MAKING SIMPLE DECISIONS
247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result
More informationRMSC 4005 Stochastic Calculus for Finance and Risk. 1 Exercises. (c) Let X = {X n } n=0 be a {F n }-supermartingale. Show that.
1. EXERCISES RMSC 45 Stochastic Calculus for Finance and Risk Exercises 1 Exercises 1. (a) Let X = {X n } n= be a {F n }-martingale. Show that E(X n ) = E(X ) n N (b) Let X = {X n } n= be a {F n }-submartingale.
More informationDynamic Programming (DP) Massimo Paolucci University of Genova
Dynamic Programming (DP) Massimo Paolucci University of Genova DP cannot be applied to each kind of problem In particular, it is a solution method for problems defined over stages For each stage a subproblem
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements
More informationComputer Vision Group Prof. Daniel Cremers. 7. Sequential Data
Group Prof. Daniel Cremers 7. Sequential Data Bayes Filter (Rep.) We can describe the overall process using a Dynamic Bayes Network: This incorporates the following Markov assumptions: (measurement) (state)!2
More informationYao s Minimax Principle
Complexity of algorithms The complexity of an algorithm is usually measured with respect to the size of the input, where size may for example refer to the length of a binary word describing the input,
More informationI R TECHNICAL RESEARCH REPORT. A Framework for Mixed Estimation of Hidden Markov Models. by S. Dey, S. Marcus T.R
TECHNICAL RESEARCH REPORT A Framework for Mixed Estimation of Hidden Markov Models by S. Dey, S. Marcus T.R. 98-31 I R INSTITUTE FOR SYSTEMS RESEARCH ISR develops, applies and teaches advanced methodologies
More informationMartingale Pricing Theory in Discrete-Time and Discrete-Space Models
IEOR E4707: Foundations of Financial Engineering c 206 by Martin Haugh Martingale Pricing Theory in Discrete-Time and Discrete-Space Models These notes develop the theory of martingale pricing in a discrete-time,
More informationReasoning with Uncertainty
Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due
More informationMarkov Chains (Part 2)
Markov Chains (Part 2) More Examples and Chapman-Kolmogorov Equations Markov Chains - 1 A Stock Price Stochastic Process Consider a stock whose price either goes up or down every day. Let X t be a random
More informationInterpolation. 1 What is interpolation? 2 Why are we interested in this?
Interpolation 1 What is interpolation? For a certain function f (x we know only the values y 1 = f (x 1,,y n = f (x n For a point x different from x 1,,x n we would then like to approximate f ( x using
More informationChapter 4 - Insurance Benefits
Chapter 4 - Insurance Benefits Section 4.4 - Valuation of Life Insurance Benefits (Subsection 4.4.1) Assume a life insurance policy pays $1 immediately upon the death of a policy holder who takes out the
More informationInformation Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete)
Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete) Ying Chen Hülya Eraslan March 25, 2016 Abstract We analyze a dynamic model of judicial decision
More informationMath-Stat-491-Fall2014-Notes-V
Math-Stat-491-Fall2014-Notes-V Hariharan Narayanan December 7, 2014 Martingales 1 Introduction Martingales were originally introduced into probability theory as a model for fair betting games. Essentially
More informationAMH4 - ADVANCED OPTION PRICING. Contents
AMH4 - ADVANCED OPTION PRICING ANDREW TULLOCH Contents 1. Theory of Option Pricing 2 2. Black-Scholes PDE Method 4 3. Martingale method 4 4. Monte Carlo methods 5 4.1. Method of antithetic variances 5
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2018 Last Time: Markov Chains We can use Markov chains for density estimation, p(x) = p(x 1 ) }{{} d p(x
More informationComplex Decisions. Sequential Decision Making
Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by
More informationEconomic optimization in Model Predictive Control
Economic optimization in Model Predictive Control Rishi Amrit Department of Chemical and Biological Engineering University of Wisconsin-Madison 29 th February, 2008 Rishi Amrit (UW-Madison) Economic Optimization
More informationCS 3331 Numerical Methods Lecture 2: Functions of One Variable. Cherung Lee
CS 3331 Numerical Methods Lecture 2: Functions of One Variable Cherung Lee Outline Introduction Solving nonlinear equations: find x such that f(x ) = 0. Binary search methods: (Bisection, regula falsi)
More information4 Reinforcement Learning Basic Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems
More informationStochastic Games and Bayesian Games
Stochastic Games and Bayesian Games CPSC 532l Lecture 10 Stochastic Games and Bayesian Games CPSC 532l Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games 4 Analyzing Bayesian
More informationLecture 5 January 30
EE 223: Stochastic Estimation and Control Spring 2007 Lecture 5 January 30 Lecturer: Venkat Anantharam Scribe: aryam Kamgarpour 5.1 Secretary Problem The problem set-up is explained in Lecture 4. We review
More informationPakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks
Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks Spring 2009 Main question: How much are patents worth? Answering this question is important, because it helps
More information1 Answers to the Sept 08 macro prelim - Long Questions
Answers to the Sept 08 macro prelim - Long Questions. Suppose that a representative consumer receives an endowment of a non-storable consumption good. The endowment evolves exogenously according to ln
More informationMaking Decisions. CS 3793 Artificial Intelligence Making Decisions 1
Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside
More informationSYLLABUS AND SAMPLE QUESTIONS FOR MS(QE) Syllabus for ME I (Mathematics), 2012
SYLLABUS AND SAMPLE QUESTIONS FOR MS(QE) 2012 Syllabus for ME I (Mathematics), 2012 Algebra: Binomial Theorem, AP, GP, HP, Exponential, Logarithmic Series, Sequence, Permutations and Combinations, Theory
More informationLecture Quantitative Finance Spring Term 2015
implied Lecture Quantitative Finance Spring Term 2015 : May 7, 2015 1 / 28 implied 1 implied 2 / 28 Motivation and setup implied the goal of this chapter is to treat the implied which requires an algorithm
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationTrust Region Methods for Unconstrained Optimisation
Trust Region Methods for Unconstrained Optimisation Lecture 9, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Trust
More information1 Dynamic programming
1 Dynamic programming A country has just discovered a natural resource which yields an income per period R measured in terms of traded goods. The cost of exploitation is negligible. The government wants
More informationFinancial Mathematics III Theory summary
Financial Mathematics III Theory summary Table of Contents Lecture 1... 7 1. State the objective of modern portfolio theory... 7 2. Define the return of an asset... 7 3. How is expected return defined?...
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2019 Last Time: Markov Chains We can use Markov chains for density estimation, d p(x) = p(x 1 ) p(x }{{}
More informationCHOICE THEORY, UTILITY FUNCTIONS AND RISK AVERSION
CHOICE THEORY, UTILITY FUNCTIONS AND RISK AVERSION Szabolcs Sebestyén szabolcs.sebestyen@iscte.pt Master in Finance INVESTMENTS Sebestyén (ISCTE-IUL) Choice Theory Investments 1 / 65 Outline 1 An Introduction
More informationStochastic Games and Bayesian Games
Stochastic Games and Bayesian Games CPSC 532L Lecture 10 Stochastic Games and Bayesian Games CPSC 532L Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games Stochastic Games
More informationLecture outline W.B.Powell 1
Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous
More informationCS 188: Artificial Intelligence. Outline
C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence
More informationDrunken Birds, Brownian Motion, and Other Random Fun
Drunken Birds, Brownian Motion, and Other Random Fun Michael Perlmutter Department of Mathematics Purdue University 1 M. Perlmutter(Purdue) Brownian Motion and Martingales Outline Review of Basic Probability
More informationSOLVING ROBUST SUPPLY CHAIN PROBLEMS
SOLVING ROBUST SUPPLY CHAIN PROBLEMS Daniel Bienstock Nuri Sercan Özbay Columbia University, New York November 13, 2005 Project with Lucent Technologies Optimize the inventory buffer levels in a complicated
More informationOutline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.
Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization
More information1 The EOQ and Extensions
IEOR4000: Production Management Lecture 2 Professor Guillermo Gallego September 16, 2003 Lecture Plan 1. The EOQ and Extensions 2. Multi-Item EOQ Model 1 The EOQ and Extensions We have explored some of
More informationMAT 4250: Lecture 1 Eric Chung
1 MAT 4250: Lecture 1 Eric Chung 2Chapter 1: Impartial Combinatorial Games 3 Combinatorial games Combinatorial games are two-person games with perfect information and no chance moves, and with a win-or-lose
More informationInformation Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete)
Information Acquisition under Persuasive Precedent versus Binding Precedent (Preliminary and Incomplete) Ying Chen Hülya Eraslan January 9, 216 Abstract We analyze a dynamic model of judicial decision
More information16 MAKING SIMPLE DECISIONS
253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)
More informationAM 121: Intro to Optimization Models and Methods
AM 121: Intro to Optimization Models and Methods Lecture 18: Markov Decision Processes Yiling Chen and David Parkes Lesson Plan Markov decision processes Policies and Value functions Solving: average reward,
More informationOPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF FINITE
Proceedings of the 44th IEEE Conference on Decision and Control, and the European Control Conference 005 Seville, Spain, December 1-15, 005 WeA11.6 OPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF
More information91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010
91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course
More informationMarkov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N
Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning
More informationHomework Assignments
Homework Assignments Week 1 (p. 57) #4.1, 4., 4.3 Week (pp 58 6) #4.5, 4.6, 4.8(a), 4.13, 4.0, 4.6(b), 4.8, 4.31, 4.34 Week 3 (pp 15 19) #1.9, 1.1, 1.13, 1.15, 1.18 (pp 9 31) #.,.6,.9 Week 4 (pp 36 37)
More informationLecture 3: Factor models in modern portfolio choice
Lecture 3: Factor models in modern portfolio choice Prof. Massimo Guidolin Portfolio Management Spring 2016 Overview The inputs of portfolio problems Using the single index model Multi-index models Portfolio
More informationEE365: Markov Decision Processes
EE365: Markov Decision Processes Markov decision processes Markov decision problem Examples 1 Markov decision processes 2 Markov decision processes add input (or action or control) to Markov chain with
More informationDefinition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.
102 OPTIMAL STOPPING TIME 4. Optimal Stopping Time 4.1. Definitions. On the first day I explained the basic problem using one example in the book. On the second day I explained how the solution to the
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC
More informationApproximate Revenue Maximization with Multiple Items
Approximate Revenue Maximization with Multiple Items Nir Shabbat - 05305311 December 5, 2012 Introduction The paper I read is called Approximate Revenue Maximization with Multiple Items by Sergiu Hart
More informationThe Values of Information and Solution in Stochastic Programming
The Values of Information and Solution in Stochastic Programming John R. Birge The University of Chicago Booth School of Business JRBirge ICSP, Bergamo, July 2013 1 Themes The values of information and
More informationDynamic Appointment Scheduling in Healthcare
Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2011-12-05 Dynamic Appointment Scheduling in Healthcare McKay N. Heasley Brigham Young University - Provo Follow this and additional
More informationMulti-period Portfolio Choice and Bayesian Dynamic Models
Multi-period Portfolio Choice and Bayesian Dynamic Models Petter Kolm and Gordon Ritter Courant Institute, NYU Paper appeared in Risk Magazine, Feb. 25 (2015) issue Working paper version: papers.ssrn.com/sol3/papers.cfm?abstract_id=2472768
More informationMartingales. by D. Cox December 2, 2009
Martingales by D. Cox December 2, 2009 1 Stochastic Processes. Definition 1.1 Let T be an arbitrary index set. A stochastic process indexed by T is a family of random variables (X t : t T) defined on a
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More informationThe Irrevocable Multi-Armed Bandit Problem
The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision
More informationAsymptotic results discrete time martingales and stochastic algorithms
Asymptotic results discrete time martingales and stochastic algorithms Bernard Bercu Bordeaux University, France IFCAM Summer School Bangalore, India, July 2015 Bernard Bercu Asymptotic results for discrete
More informationMarkov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo
Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov
More informationSYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives
SYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October
More informationOptimizing Portfolios
Optimizing Portfolios An Undergraduate Introduction to Financial Mathematics J. Robert Buchanan 2010 Introduction Investors may wish to adjust the allocation of financial resources including a mixture
More informationarxiv: v1 [math.pr] 6 Apr 2015
Analysis of the Optimal Resource Allocation for a Tandem Queueing System arxiv:1504.01248v1 [math.pr] 6 Apr 2015 Liu Zaiming, Chen Gang, Wu Jinbiao School of Mathematics and Statistics, Central South University,
More information