6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE

Size: px

Start display at page:

Download "6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE"

Coral Hodges
5 years ago
Views:

1 6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE Stopping problems Scheduling problems Minimax Control 1

2 PURE STOPPING PROBLEMS Two possible controls: Stop (incur a one-time stopping cost, and move to cost-free and absorbing stop state) Continue [using x k+1 = f k (x k,w k ) and incurring the cost-per-stage] Each policy consists of a partition of the set of states x k into two regions: Stop region, where we stop Continue region, where we continue CONTINUE REGION STOP REGION Stop State 2

3 EXAMPLE: ASSET SELLING A person has an asset, and at k = 0,1,...,N 1 receives a random offer w k May accept w k and invest the money at fixed rate of interest r, or reject w k and wait for w k+1. Must accept the last offer w N 1 DP algorithm (x k : current offer, T: stop state): xn if xn T, J N (x N ) = 0 if x N = T, [ }] max (1+r) N k x k, E J k+1 (w k ) if x k = T, J k (x k ) = 0 if xk = T. Optimal policy; accept the offer x k if x k > α k, reject the offer x k if x k < α k, where α = k } Jk 1 (w k ) (1+r). N k E + 3

4 FURTHER ANALYSIS a 1 a 2 ACCEPT REJECT a N N - 1 N k Can show that α k α k+1 for all k Proof: Let V T. Then the DP algorithm is k(x k ) = J k (x k )/(1+r) N k for x k V N (x N ) = x N, V k (x k ) = max [x k, (1+r) 1 Ew Vk+1 (w) }] } Wehave α k = Ew Vk+1 (w) /(1+r),soitisenough to show that V k (x) V k+1 (x) for all x and k. Start with V N 1 (x) V N (x) and use the monotonicity property of DP. Q.E.D. We can also show that if w is bounded, α k a as k. Suggests that for an infinite horizon the optimal policy is stationary. 4

5 GENERAL STOPPING PROBLEMS At time k, we may stop at cost t(x k ) or choose a control u k U(x k ) and continue J N (x N ) = t(x N ), J k (x k ) = min [ t(x k ), min E g(x k,u k,w k ) uk U(x k) +J k+1 f(x k,u k,wk) Optimal to stop at time k for x in T k = x ( )}] the set } ( )} t(x) min E g(x,u,w)+j k+1 f(x,u,w) u U(x) Since J N 1 (x) JN(x), we have J k (x) J k+1 (x) for all k, so T 0 T k T k+1 T N 1. Interesting case is when all the T k are equal (to T N 1, the set where it is better to stop than to go one step and stop). Can be shown to be true if f(x,u,w) T N 1, for all x T N 1, u U(x), w. 5

6 SCHEDULING PROBLEMS We have a set of tasks to perform, the ordering is subject to optimal choice. Costs depend on the order There may be stochastic uncertainty, and precedence and resource availability constraints Some of the hardest combinatorial problems are of this type (e.g., traveling salesman, vehicle routing, etc.) Some special problems admit a simple quasianalytical solution method Optimal policy has an index form, i.e., each task has an easily calculable cost index, and it is optimal to select the task that has the minimum value of index (multiarmed bandit problems- to be discussed later) Some problems can be solved by an interchange argument (start with some schedule, interchange two adjacent tasks, and see what happens). They require existence of an optimal policy which is open-loop. 6

7 EXAMPLE: THE QUIZ PROBLEM Given a list of N questions. If question i is answered correctly (given probability p i ), we receive reward R i ; if not the quiz terminates. Choose order of questions to maximize expected reward. Let i and j be the kth and (k + 1)st questions in an optimally ordered list L = (i 0,...,i k 1,i,j,i k+2,...,i N 1 ) } E reward of L} = E reward of i 0,...,i k 1 } +p i 0 p i (p i R k 1 i i j j +p p R ) +p i0 p ik p i p j E reward of i k+2,...,i N 1 } 1 Consider the list with i and j interchanged L = (i 0,...,i k 1,j,i,i k+2,...,i N 1 ) Since Lisoptimal, Ereward of L} Ereward of L }, so it follows that p i R i +p i p j R j p j R j +p j p i R i or p i R i /(1 p i ) p j R j /(1 p j ). } 7

8 MINIMAX CONTROL Consider basic problem with the difference that the disturbance w k instead of being random, it is just known to belong to a given set W k (x k,u k ). Find policy π that minimizes the cost J π (x 0 ) = max g N (x N ) wk W k(x k,µ k(x k)) k=0,1,...,n 1 [ + N 1 k=0 The DP algorithm takes the form J N (x N ) = g N (x N ), g k ( xk,µ k (x k ),w k J k (x k ) = min max g k (x k,u k,w k ) uk U(x k) wk W k(x k,u k) ( )] fk (x k,u k,w k ) +J k+1 (Section 1.6 in the text). [ ) ] 8

9 DERIVATION OF MINIMAX DP ALGORITHM Similar to the DP algorithm for stochastic problems. The optimal cost J (x 0 ) is J (x 0 ) = min min max max µ 0 µ N 1 w 0 W[x 0,µ 0 (x 0 )] w N 1 W[x N 1,µ N 1 (x N 1 )] [ N 1 k=0 = min min g k ( xk,µ k (x k ),w k ) +gn (x N ) [ min max max µ 0 µ N 2 µ N 1 w 0 W[x 0,µ 0 (x 0 )] w N 2 W[x N 2,µ N 2 (x N 2 )] [ N 2 k=0 ( ) g k x k,µ k (x k ),w k + [ ] max w N 1 W[x N 1,µ N 1 (x N 1 )] g N 1 ( xn 1,µ N 1 (x N 1 ),w N 1 +J N (x N ) Interchange the min over µ N 1 andthemax over w 0,...,w N 2, and similarly continue backwards, with N 1 in place of N, etc. After N steps we obtain J (x 0 ) = J 0 (x 0 ). Construct optimal policy by minimizing in the RHS of the DP algorithm. ) ]] ] 9

10 UNKNOWN-BUT-BOUNDED CONTROL For each k, keep the x k of the controlled system inside a given set X k, ( x k+1 = f k x k,µ k (x k ),w k the target set at time k. This is a minimax control problem, where the cost at stage k is 0 i g k ( k ) = f x k X x k, 1 if x k / X k. We must reach at time k the set X k = x k J k (x k ) = 0 in order to be able to maintain the state within the subsequent target sets. Start with X N = X N, and for k = 0,1,...,N 1, X k = x k X k there exists u k U k (x k ) such that f k (x k,u k,w k ) X k+1, for all w k W k (x k,u k ) } } ) 10

11 MIT OpenCourseWare Dynamic Programming and Stochastic Control Fall 2015 For information about citing these materials or our Terms of Use, visit:

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE Rollout algorithms Cost improvement property Discrete deterministic problems Approximations of rollout algorithms Discretization of continuous time