Dynamic Programming and Optimal Control Volume 1

Size: px

Start display at page:

Download "Dynamic Programming and Optimal Control Volume 1"

Annabelle Jennings
6 years ago
Views:

1 Dynamic Programming and Optimal Control Volume 1 SECOND EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology Selected Theoretical Problem Solutions Athena Scientific, Belmont, MA, 2000 WWW site for book information and orders 1

2 NOTE This solution set is meant to be a significant extension of the scope and coverage of the book. Solutions to all of the book s exercises marked with the symbol www have been included. The solutions are continuously updated and improved, and additional material, including new problems and their solutions are being added. The solutions may be reproduced and distributed for personal or educational uses. Please send comments, and suggestions for additions and improvements to the author at bertsekas@lids.mit.edu or dimitrib@mit.edu 2

3 Vol. I, Chapter www We first show the result given in the hint. We have for all µ M } max G 0 (w) + G 1 f(w), µ(f(w)) max G 0 (w) + min G } 1 f(w), u w W w W u U and, taking min over µ M we obtain min max µ M w W G 0 (w) + G 1 f(w), µ(f(w)) } max w W G 0 (w) + min G } 1 f(w), u u U We must therefore show the reverse inequality. For any ɛ > 0, let µ ɛ M be such that G 1 f(w), µɛ (f(w)) min G 1 f(w), u + ɛ, w W u U (Such a µ ɛ exists because of the assumption that min u U G 1 f(w), u > ). Then min max } G 0 (w) + G 1 f(w), µ(f(w)) max G 0 (w) + G 1 f(w), µɛ (f(w)) } µ M w W w W max w W G 0 (w) + min G } 1 f(w), u + ɛ u U Since ɛ > 0 can be taken arbitrarily small we obtain the reverse inequality to ( ), and thus the desired result. To see how this result can fail without the condition min u U G 1 f(w), u > for all w, let R be the real line, let u be a real number, let w = (w 1, w 2 ) be a two-dimensional vector, let there be no constraints on u and w (U = R, W = R R), let G 0 (w) = w 1, f(w) = w 2, and G 1 f(w), u = f(w) + u. Then } G 0 (w) + G 1 f(w), µ(f(w)) = so that max w W On the other hand, max w W G 0 (w) + min u U G 1 since min u R w 2 + u} = for all w 2. max w 1 + w 2 + µ(w 2 )} =, µ M, w 1 R, w 2 R min max } G 0 (w) + G 1 f(w), µ(f(w)) =. µ M w W We now turn to showing the DP algorithm. We have J (x 0 ) = min min max µ 0 µ N 1 w 0 Wx 0,µ 0 (x 0 ) = min min min max µ 0 µ N 2 µ N 1 w 0 Wx 0,µ 0 (x 0 ) } } f(w), u = max w 1 + min w 2 + u} =, w 1 R, w 2 R u R max w N 1 Wx N 1,µ N 1 (x N 1 ) max N 1 k=0 w N 2 Wx N 2,µ N 2 (x N 2 ) g k (x k, µ k (x k ), w k ) + g N (x N ) N 2 + max gn 1 (x N 1, µ N 1 (x N 1 ), w N 1 ) + J N (x N ) }}} w N 1 Wx N 1,µ N 1 (x N 1 ) 3 k=0 g k (x k, µ k (x k ), w k ) }

4 Applying the result of the hint with the identifications w = (w 0, w 1,... w N 2 ), u = u N 1, f(w) = x N 1, N 2 G 0 (w) = k=0 g k(x k, µ k (x k ), w k ) if w k W k x k, µ k (x k ), k, otherwise, where we have Ĝ 1 f(w), u = G 1 f(w), u = Ĝ1 f(w), u if u U N 1 f(w), otherwise, ( ) ( max g N 1 f(w), u, wn 1 + JN fn 1 (f(w), u, w N 1 ) )}, w N 1 W N 1 f(w),u J (x 0 ) = min µ 0 min µ N 2 max w 0 Wx 0,µ 0 (x 0 ) max w N 2 Wx N 2,µ N 2 (x N 2 ) N 2 k=0 g k (x k, µ k (x k ), w k ) + J N 1 (x N 1 ) The required condition min u U G 1 f(w), u > for all w is implied by the assumption JN 1 (x N 1 ) > for all x N 1. Without this assumption, it can be seen that mathematical anomalies of the type demonstrated in the earlier example may arise. By working with the preceding expression for J (x 0 ) and by similarly continuing backwards, with N 1 in place of N etc., after N steps we obtain J (x 0 ) = J 0 (x 0 ). } 1.13 www The DP algorithm is J k (x k ) = min u k J N (x N ) = c N x N } E c k x k + g k (u k ) + J k+1 Ak x k + f k (u k ) + w k w k,a k We will show that J k (x k ) is affine through induction. Clearly J N (x N ) is affine. Assume that J k+1 (x k+1 ) is affine; that is, J k+1 (x k+1 ) = b k+1 x k+1 + d k+1 Then J k (x k ) = min u k E c k x k + g k (u k ) + b k+1 A kx k + b k+1 f k(u k ) + b k+1 w } k + d k+1 w k,a k = c k + b k+1 E A k } x k + b k+1 E w k } + min u k gk (u k ) + b k+1 f k(u k ) } + d k+1 Note that EA k } and Ew k } do not depend on x k or u k. If the optimal value is finite then ming k (u k ) + b k+1 f k(u k )} is a real number and J k (x k ) is affine. Furthermore, the optimal control at each stage solves this minimization which is independent of x k. Thus, the optimal policy consists of constant functions µ k. 4

5 1.16 www (a) Given a sequence of matrix multiplications M 1 M 2 M k M k+1 M N we represent it by a sequence of numbers n 1,..., n N+1 }, where n k n k+1 is the dimension of M k. Let the initial state be x 0 = n 1,..., n N+1 }. Then choosing the first multiplication to be carried out corresponds to choosing an element from the set x 0 n 1, n N+1 }. For instance, choosing n 2 corresponds to multiplying M 1 and M 2, which results in a matrix of dimension n 1 n 3, and the initial state must be updated to discard n 2, the control applied at that stage. Hence at each stage the state represents the dimensions of the matrices resulting from the multiplications done so far. The allowable controls at stage k are u k x k n 1, n N+1 }. The system equation evolves according to x k+1 = x k u k. Note that the control will be applied N 1 times, therefore the horizon of this problem is N 1. The terminal state is x N 1 = n 1, n N+1 } and the terminal cost is 0. The cost at stage k is given by the number of multiplications, g k (x k, u k ) = n a n uk n b, where n uk = u k and The DP algorithm for this problem is given by a = max i 1,..., N + 1} i < u k, i x k }, b = min i 1,..., N + 1} i > u k, i x k }. J N 1 (x N 1 ) = 0, J k (x k ) = min na n uk n b + J k+1 (x k u k ) }, k = 0,..., N 2. u k x k n 1,n N+1 } Now consider the given problem, where N = 3 and M 1 is 2 10 M 2 is 10 5 M 3 is 5 1 The optimal order is M 1 (M 2 M 3 ), requiring 70 multiplications. (b) In this part we can choose a much simpler state space. Let the state at stage k be given by a, b}, where a, b 1,..., N } and give the indices of the first and the last matrix in the current partial product. There are two possible controls at each stage, which we denote by L and R. Note that L can be applied only when a 1 and R can be applied only when b N. The system equation evolves according to a 1, b}, if uk = L x k+1 = a, b + 1}, if u k = R k = 1,..., N 1. The terminal state is x N = 1, N} with cost 0. The cost at stage k is given by na 1 n a n g k (x k, u k ) = b+1, if u k = L n a n b+1 n b+2, if u k = R k = 1,..., N 1. To handle the initial stage, we can take x 0 to be the empty set and u 0 1,..., N}. The next state will be given by x 1 = u 0, u 0 } and the cost incurred at the initial stage will be 0 for all possible controls. 5

6 1.19 www Let t 1 < t 2 <... < t N 1 denote the times where g 1 (t) = g 2 (t). Clearly, it is never optimal to switch functions at any other times. We can therefore divide the problem into N 1 stages, where we want to determine for each stage k whether or not to switch activities at time t k. Define 0 if on activity g1 just before time t x k = k 1 if on activity g 2 just before time t k u k = 0 to continue current activity 1 to switch between activities Then the state at time t k+1 is simply x k+1 = (x k + u k ) mod 2, and the profit for stage k is The DP algorithm is then J N (x N ) = 0 g k (x k, u k ) = tk+1 t k g 1+xk+1 (t)dt u k c J k (x k ) = min u k g k (x k, u k ) + J k+1 (x k + u k ) mod 2} www (a) Consider the problem with the state equal to the number of free rooms. At state x 1 with y customers remaining, if the inkeeper quotes a rate r i, the transition probability is p i to state x 1 (with a reward of r i ) and 1 p i to state x (with a reward of 0). The DP algorithm for this problem starts with the terminal conditions J(x, 0) = J(0, y) = 0, x 0, y 0, and is given by J(x, y) = max pi (r i + J(x 1, y 1)) + (1 p i )J(x, y 1), x 0. i=1,...,m From the above equation and the terminal conditions, we can compute sequentially J(1, 1), J(1, 2),..., J(1, y) up to any desired integer y. Then, we can calculate J(2, 1), J(2, 2),..., J(2, y), etc. We first prove by induction on y that for all y, we have J(x, y) J(x 1, y), x 1. Indeed this is true for y = 0. Assuming this is true for a given y, we will prove that J(x, y + 1) J(x 1, y + 1), x 1. This relation holds for x = 1 since r i > 0. For x 2, by using the DP recursion, this relation is written as max pi (r i + J(x 1, y)) + (1 p i )J(x, y) max pi (r i + J(x 2, y)) + (1 p i )J(x 1, y). i=1,...,m i=1,...,m 6

7 By the induction hypothesis, each of the terms on the left-hand side is no less than the corresponding term on the right-hand side, so the above relation holds. The optimal rate is the one that maximizes in the DP algorithm, or equivalently, the one that maximizes p i r i + p i (J(x 1, y 1) J(x, y 1)). The highest rate r m simultaneously maximizes p i r i and minimizes p i. Since J(x 1, y 1) J(x, y 1) 0, as proved above, we see that the highest rate simultaneously maximizes p i r i and p i (J(x 1, y 1) J(x, y 1)), and so it maximizes their sum. (b) The algorithm given is the algorithm of Exercise 1.22 applied to the problem of part (a). Clearly, it is optimal to accept an offer of r i if r i is larger than the threshold r(x, y) = J(x, y 1) J(x 1, y 1) www (a) The total net expected profit from the (buy/sell) investment decissions after transaction costs are deducted is N 1 ( E uk P k (x k ) c u k )}, where k=0 1 if a unit of stock is bought at the kth period, u k = 1 if a unit of stock is sold at the kth period, 0 otherwise. With a policy that maximizes this expression, we simultaneously maximize the expected total worth of the stock held at time N minus the investment costs (including sale revenues). with The DP algorithm is given by J k (x k ) = max u k P k (x k ) c u k + E } J k+1 (x k+1 ) x k, u k = 1,0,1 J N (x N ) = 0, where J k+1 (x k+1 ) is the optimal expected profit when the stock price is x k+1 at time k + 1. Since u k does not influence x k+1 and E J k+1 (x k+1 ) x k }, a decision uk 1, 0,1} that maximizes u k P k (x k ) c u k at time k is optimal. Since P k (x k ) is monotonically nonincreasing in x k, it follows that it is optimal to set u k = 1 if xk x k, 1 if x k x k, 0 otherwise, where x k and x k are as in the problem statement. Note that the optimal expected profit J k (x k ) is given by J k (x k ) = E N 1 i=k max u i = 1,0,1 7 ui P i (x i ) c u i }.

8 (b) Let n k be the number of units of stock held at time k. If n k is less that N k (the number of remaining decisions), then the value n k should influence the decision at time k. We thus take as state the pair (x k, n k ), and the corresponding DP algorithm takes the form with V k (x k, n k ) = Note that we have max uk 1,0,1} max uk 0,1} u k P k (x k ) c u k + E V k+1 (x k+1, n k + u k ) x k } if n k 1, u k P k (x k ) c u k + E V k+1 (x k+1, n k + u k ) x k } if n k = 0, V N (x N, n N ) = 0. V k (x k, n k ) = J k (x k ), if n k N k, where J k (x k ) is given by the formula derived in part (a). Using the above DP algorithm, we can calculate V N 1 (x N 1, n N 1 ) for all values of n N 1, then calculate V N 2 (x N 2, n N 2 ) for all values of n N 2, etc. To show the stated property of the optimal policy, we note that V k (x k, n k ) is monotonically nondecreasing with n k, since as n k decreases, the remaining decisions become more constrained. An optimal policy at time k is to buy if P k (x k ) c + E V k+1 (x k+1, n k + 1) V k+1 (x k+1, n k ) x k } 0, (1) and to sell if P k (x k ) c + E V k+1 (x k+1, n k 1) V k+1 (x k+1, n k ) x k } 0. (2) The expected value in Eq. (1) is nonnegative, which implies that if x k x k, implying that P k (x k ) c 0, then the buying decision is optimal. Similarly, the expected value in Eq. (2) is nonpositive, which implies that if x k < x k, implying that P k (x k ) c < 0, then the selling decision cannot be optimal. It is possible that buying at a price greater than x k is optimal depending on the size of the expected value term in Eq. (1). (c) Let m k be the number of allowed purchase decisions at time k, i.e., m plus the number of sale decisions up to k, minus the number of purchase decisions up to k. If m k is less than N k (the number of remaining decisions), then the value m k should influence the decision at time k. We thus take as state the pair (x k, m k ), and the corresponding DP algorithm takes the form with W k (x k, m k ) = max uk 1,0,1} max uk 1,0} u k P k (x k ) c u k + E W k+1 (x k+1, m k u k ) x k } if m k 1, u k P k (x k ) c u k + E W k+1 (x k+1, m k u k ) x k } if m k = 0, W N (x N, m N ) = 0. From this point the analysis is similar to the one of part (b). (d) The DP algorithm takes the form H k (x k, m k, n k ) = u k P k (x k ) c u k + E } H k+1 (x k+1, m k u k, n k + u k ) x k max u k 1,0,1} if m k 1 and n k 1, and similar formulas apply for the cases where m k = 0 and/or n k = 0 compare with the DP algorithms of parts (b) and (c). (e) Let r be the interest rate, so that x invested dollars at time k will become (1 + r) N k x dollars at time N. Once we redefine the expected profit P k (x k ) to be the preceding analysis applies. P k (x) = Ex N x k = x} (1 + r) N k x, 8

9 1.27 www We consider part (b), since part (a) is essentially a special case. We will consider the problem of placing N 2 points between the endpoints A and B of the given subarc. We will show that the polygon of maximal area is obtained when the N 2 points are equally spaced on the subarc between A and B. Based on geometric considerations, we impose the restriction that the angle between any two successive points is no more than π. As the subarc is traversed in the clockwise direction, we number sequentially the encountered points as x 1, x 2,..., x N, where x 1 and x N are the two endpoints A and B of the arc, respectively. For any point x on the subarc, we denote by φ the angle between x and x N (measured clockwise), and we denote by A k (φ) the maximal area of a polygon with vertices the center of the circle, the points x and x N, and N k 1 additional points on the subarc that lie between x and x N. Without loss of generality, we assume that the radius of the circle is 1, so that the area of the triangle that has as vertices two points on the circle and the center of the circle is (1/2) sin u, where u is the angle corresponding to the center. By viewing as state the angle φ k between x k and x N, and as control the angle u k between x k and x k+1, we obtain the following DP algorithm 1 A k (φ k ) = max 0 u k minφ k,π} 2 sin u k + A k+1 (φ k u k ), k = 1,..., N 2. (1) Once x N 1 is chosen, there is no issue of further choice of a point lying between x N 1 and x N, so we have A N 1 (φ) = 1 sin φ, (2) 2 using the formula for the area of the triangle formed by x N 1, x N, and the center of the circle. It can be verified by induction that the above algorithm admits the closed form solution A k (φ k ) = 1 ( ) 2 (N k) sin φk, k = 1,..., N 1, (3) N k and that the optimal choice for u k is given by u k = φ k N k. Indeed, the formula (3) holds for k = N 1, by Eq. (2). Assuming that Eq. (3) holds for k + 1, we have from the DP algorithm (1) A k (φ k ) = max H k(u k, φ k ), (4) 0 u k minφ k,π} where H k (u k, φ k ) = 1 2 sin u k + 1 ( ) 2 (N k 1) sin φk u k. (5) N k 1 It can be verified that for a fixed φ k and in the range 0 u k minφ k, π}, the function H k (, φ k ) is concave (its second derivative is negative) and its derivative is 0 only at the point u k = φ k/(n k) which must therefore be its unique maximum. Substituting this value of u k in Eqs. (4) and (5), we obtain A k (φ k ) = 1 ( ) 2 sin φk + 1 ( ) N k 2 (N k 1) sin φk φ k /(N k) = 1 ( ) N k 1 2 (N k) sin φk, N k and the induction is complete. Thus, given an optimally placed point x k on the subarc with corresponding angle φ k, the next point x k+1 is obtained by advancing clockwise by φ k /(N k). This process, when started at x 1 with φ 1 equal to the angle between x 1 and x N, yields as the optimal solution an equally spaced placement of the points on the subarc. 9

10 Vol. I, Chapter www (a) We denote by P k the OPEN list after having removed k nodes from OPEN, (i.e., after having performed k iterations of the algorithm). We also denote d k j the value of d j at this time. Let b k = min j Pk d k j }. First, we show by induction that b 0 b 1 b k. Indeed, b 0 = 0 and b 1 = min j a sj } 0, which implies that b 0 b 1. Next, we assume that b 0 b k for some k 1; we shall prove that b k b k+1. Let j k+1 be the node removed from OPEN during the (k + 1)th iteration. By assumption d k j k+1 = min j Pk d k j } = b k, and we also have = mind k i, dk j k+1 + a jk+1 i}. d k+1 i We have P k+1 = (P k j k+1 }) N k+1, where N k+1 is the set of nodes i satisfying d k+1 i and i / P k. Therefore, min d k+1 i } = min i P k+1 d k+1 i i (P k j k+1 }) N k+1 } = min min i P k j k+1 } dk+1 i }, min i N k+1 d k+1 i } = d k j k+1 + a jk+1 i. Clearly, Moreover, min i P k j k+1 } dk+1 min d k+1 i N i } = min d k k+1 i N j k+1 + a jk+1 i} d k j k+1. k+1 i } = min min i P k j k+1 } mind k i, dk j k+1 + a jk+1 i} min i P k j k+1 } dk i }, dk j k+1 = min i P k d k i } = dk j k+1, because we remove from OPEN this node with the minimum d k i. It follows that b k+1 = min i Pk+1 d k+1 i } d k j k+1 = b k. Now, we may prove that once a node exits OPEN, it never re-enters. Indeed, suppose that some node i exits OPEN after the k th iteration of the algorithm; then, d k 1 i = b k 1. If node i re-enters OPEN after the l th iteration (with l > k ), then we have d l1 1 i > d l i = d l 1 j l + a jl i d l 1 j = b l l 1. On the other hand, since d i is non-increasing, we have b k 1 = d k 1 i d l 1 i. Thus, we obtain b k 1 > b l 1, which contradicts the fact that b k is non-decreasing. Next, we claim the following: after the kth iteration, d k i equals the length of the shortest possible path from s to node i P k under the restriction that all intermediate nodes belong to C k. The proof will be done by induction on k. For k = 1, we have C 1 = s} and d 1 i = a si, and the claim is obviously true. Next, we assume that the claim is true after iterations 1,..., k; we shall show that it is also true after iteration k + 1. The node j k+1 removed from OPEN at the (k + 1)-st iteration satisfies min i Pk d k i } = d j k+1. Notice now that all neighbors of the nodes in C k belong either to C k or to P k. It follows that the shortest path from s to j k+1 either goes through C k or it exits C k, then it passes through a node j P k, and eventually reaches j k+1. If the latter case applies, then the length of this path is at least the length of the shortest path from s to j through C k ; by the induction hypothesis, this equals d k j, which is at least dk j k+1. It follows that, for node j k+1 exiting the OPEN list, d k j k+1 equals the length of the shortest path from s to j k+1. Similarly, all nodes that have exited previously have their current estimate 10

11 of d i equal to the corresponding shortest distance from s. * Notice now that d k+1 i = min d k i, dk j k+1 + a jk+1 i}. For i / P k and i P k+1 it follows that the only neighbor of i in C k+1 = C k j k+1 } is node j k+1 ; for such a node i, d k i =, which leads to dk+1 i = d k j k+1 + a jk+1 i. For i j k+1 and i P k, the augmentation of C k by including j k+1 offers one more path from s to i through C k+1, namely that through j k+1. Recall that the shortest path from s to i through C k has length d k i (by the induction hypothesis). Thus, d k+1 i = min d k 1, dk j k+1 + a jk+1 i} is the length of the shortest path from s to i through Ck+1. The fact that each node exits OPEN with its current estimate of d i being equal to its shortest distance from s has been proved in the course of the previous inductive argument. (b) Since each node enters the OPEN list at most once, the algorithm will terminate in at most N 1 iterations. Updating the d i s during an iteration and selecting the node to exit OPEN requires O(N) arithmetic operations (i.e., a constant number of operations per node). Thus, the total number of operations is O(N 2 ). 2.6 www Proposition: If there exist a path from the origin to each node in T, the modified version of the label correcting algorithm terminates with UPPER < and yields a shortest path from the origin to each node in T. Otherwise the algorithm terminates with UPPER =. Proof: The proof is analogous to the proof of Proposition 3.1. To show that this algorithm terminates, we can use the identical argument in the proof of Proposition 3.1. Now suppose that for some node t T, there is no path from s to t. Then a node i such that (i, t) is an arc cannot enter the OPEN list because this would establish that there is a path from s to i, and therefore also a path from s to t. Thus, d t is never changed and UPPER is never reduced from its initial value of. Suppose now that there is a path from s to each node t T. Then, since there is a finite number of distinct lengths of paths from s to each t T that do not contain any cycles, and each cycle has nonnegative length, there is also a shortest path. For some arbitrary t, let (s, j 1, j 2,..., j k, t) be a shortest path and let d t be the corresponding shortest distance. We will show that the value of UPPER upon termination must be equal to d = max t T d t. Indeed, each subpath (s, j 1,..., j m ), m = 1,..., k, of the shortest path (s, j 1,..., j k, t) must be a shortest path from s to j m. If the value of UPPER is larger than d at termination, the same must be true throughout the algorithm, and therefore UPPER will also be larger than the length of all the paths (s, j 1,..., j m ), m = 1,..., k, throughout the algorithm, in view of the nonnegative arc length assumption. If, for each t T, the parent node j k enters the OPEN list with d jk equal to the shortest distance from s to j k, UPPER will be set to d in step 2 immediately following the next time the last of the nodes j k is examined by the algorithm in step 2. It follows that, for some t T, the associated parent node j k will never enter the OPEN list with d jk equal to the shortest distance from s to j k. Similarly, and using also the nonnegative length assumption, this means that node j k 1 will never enter the OPEN list with d j k 1 equal to the shortest distance from s to j k 1. Proceeding backwards, we conclude that j 1 never enters the OPEN list with d j 1 equal to the shortest distance from s to j 1 which is equal to the length of the arc (s, j 1 ). This happens, however, at the first iteration of the algorithm, obtaining a contradiction. It follws that at termination, UPPER will be equal to d. Finally, it can be seen that, upon termination of the algorithm, the path constructed by tracing the parent nodes backward from d to s has length equal to d t for each t T. Thus the path is a shortest path from s to t. * Strictly speaking, this is the shortest distance from s to these nodes because paths are directed from s to the nodes. 11

12 2.9 www (a) It was shown in the proof to proposition 3.2 that (P, p) satisfies CS throughout the original algorithm. Note that deleting arcs does not cause CS conditions to no longer hold. Therefore (P, p) satisfies CS throughout this algorithm. It was also shown in the text that if a pair (P, p) satisfies the CS conditions, then the portion of the path P between node s and any node i P is a shortest path from s to i. Now, consider any node j that becomes the terminal node of P through an extension using (i, j). If there is a shortest path from s to t that does not include node j, then removing the arcs (k, j) where k i yields a graph including the same shortest path. If the only shortest path from s to t does include node j, then since P is a shortest path from s to j, there is a shortest path from s to t that has P as its portion to node j. Thus removing the arcs (k, j), where k i yields a graph including a shortest path from the original graph. If node j has no outgoing arcs, then any path from s to t can not include j. Thus removing j yields a graph including the same paths from s to t as in the original graph. Therefore, both types of arc deletions leave the shortest distance from s to t unaffected. We can view the auction algorithm with graph reduction as follows: an iteration of the original auction algorithm is applied, followed by arc deletions that do not affect the CS conditions and which leaves the shortest distance from s to t unchanged; another iteration of the original auction algorithm is applied to the new graph, follwed by arc deletions; and so on. If an iteration of the original auction algorithm yields a path P with t as the terminal node, P is a shortest path from s to t in the latest modified graph. Since we have shown that each new graph has the same shortest distance from s to t as in the original graph, P must also be a shortest path from s to t in the original graph. Now assume that no iteration of the original auction algorithm ever yields a path P with t as the terminal node; i.e., there is no path from s to t. Assume that the modified algorithm never terminates. Since there are a finite number of arcs and nodes, there can be only a finite number of arc and node deletions. Consider the algorithm after the last deletion. Since there are no more deletions, there must be an outgoing arc (i 1, i 2 ) from any terminal node i 1 of P. Since the algorithm never terminates, i 2 must eventually be added to P. There must also be an outgoing arc (i 2, i 3 ) from i 2, and so on. However, there is only a finite number of nodes so some nodes must be repeated, which implies there is a cycle in P. Since there are no arcs incident to s, there must be some arc not part of the cycle that is incident to a node k in the cycle. But then this node has two incoming arcs. When k became the terminal in P, one of these arcs should have been deleted, yielding a contradiction. Thus the algorithm must terminate. Since there is no path from s to t, the algorithm can only terminate by deleting s. (b) Consider any cycle of zero length. Let j be the first node of the cycle to be a terminal node of path P. Let i be the node preceding j in the path P, and l be the node preceding j in the cycle. All incoming arcs (k, j) of j with k i, including arc (l, j), are deleted. Therefore, our problem reduces to one in which there are no cycles of zero length. (c) The iterations of the modified algorithm applied to the problem of Exercise 2.8 are given below. The first 13 iterations are the same as in the original algorithm, with the exception that at iteration 2, where the path is extended to include node 2, arc (4,2) is also deleted. As a result of this deletion, after the contraction in iteration 13, the price of node 4 is changed to L, resulting in faster convergence of the algorithm. 12

13 Iteration Path Price vector p Action 1 (1) (0,0,0,0,0) contraction at 1 2 (1) (1,0,0,0,0) extension to 2 3 (1,2) (1,0,0,0,0) contraction at 2 4 (1) (1,1,0,0,0) contraction at 1 5 (1) (2,1,0,0,0) extension to 2 6 (1,2) (2,1,0,0,0) extension to 3 7 (1,2,3) (2,1,0,0,0) contraction at 3 8 (1,2) (2,1,1,0,0) contraction at 2 9 (1) (2,2,1,0,0) contraction at 1 10 (1) (3,2,1,0,0) extension to 2 11 (1,2) (3,2,1,0,0) extension to 3 12 (1,2,3) (3,2,1,0,0) extension to 4 13 (1,2,3,4) (3,2,1,0,0) contraction at 4 14 (1,2,3) (3,2,1,L,0) contraction at 3 15 (1,2) (3,2,L+1,L,0) contraction at 2 16 (1) (3,L+2,L+1,L,0) contraction at 1 17 (1) (L+3,L+2,L+1,L,0) extension to 2 18 (1,2) (L+3,L+2,L+1,L,0) extension to 3 19 (1,2,3) (L+3,L+2,L+1,L,0) extension to 4 20 (1,2,3,4) (L+3,L+2,L+1,L,0) extension to 5 21 (1,2,3,4,5) (L+3,L+2,L+1,L,0) done 2.13 www (a) We first need to show that d k i is the length of the shortest k-arc path originating at i, for i t. For k = 1, d 1 i = min c ij j which is the length of shortest arc out of i. Assume that d k 1 i is the length of the shortest (k 1)-arc path out of i. Then d k i = min c ij + d k 1 j j } If d k i is not the length of the shortest k-arc path, the initial arc of the shortest path must pass through a node other than j. This is true since d k 1 j length of any (k 1)-step arc out of j. Let l be the alternative node. From the optimality principle distance of path through l = c il + d k 1 l But this contradicts the choice of d k i in the DP algorithm. Thus, dk i is the length of the shortest k-arc path out of i. Since d k t = 0 for all k, once a k-arc path out of i reaches t we have dκ i = d k i for all κ k. But with all arc lengths positive, d k i is just the shortest path from i to t. Clearly, there is some finite k such that the shortest k-path out of i reaches t. If this were not true, the assumption of positive arc lengths implies that the distance from i to t is infinite. Thus, the algorithm will yield the shortest distances in a finite number of steps. We can estimate the number of steps, N i as N i min j d jt min j,k d jk 13 d k i

14 (b) Let d k i be the distance estimate generated using the initial condition d 0 i = and dk i be the estimate generated using the initial condition d 0 i = 0. In addition, let d i be the shortest distance from i to t. Lemma d k i dk+1 i d i d k+1 i d k i (1) d k i = d i = d k i for k sufficently large (2) Proof Relation (1) follows from the monotonicity property of DP. Note that d 1 i d0 i and that d 1 i d 0 i. Equation (2) follows immediately from the convergence of DP (given d 0 i = ) and from part a). Proposition For every k there exists a time T k such that for all T T k d k i dt i d k i, i = 1,2,..., N Proof The proof follows by induction. For k = 0 the proposition is true, given the positive arc length assumption. Asume it is true for a given k. Let N(i) be a set containing all nodes adjacent to i. For every j N(i) there exists a time, T j k such that d k j dt j d k j T T j k Let T be the first time i updates its distance estimate given that all d T j k j, j N(i), estimates have arrived. Let d T ij be the estimate of d j that i has at time T. Note that this may differ from d T j k j estimates from j may have arrived before T. From the Lemma d k j dt ij which, coupled with the monotonicity of DP, implies d k+1 i d T i d k j d k+1 i T T since the later Since each node never stops transmitting, T is finite and the proposition is proved. Using the Lemma, we see that there is a finite k such that d κ i = d i = d κ i, κ k. Thus, from the proposition, there exists a finite time T such that d T i = d i for all T T and i. 14

15 Vol. I, Chapter www This problem is similar to the Brachistochrone Problem (Example 4.2) described in the text. As in that problem, we introduce the system ẋ = u and have a fixed terminal state problem x(0) = a and x(t) = b. Letting 1 + u 2 g(x, u) =, Cx the Hamiltonian is H(x, u, p) = g(x, u) + pu. Minimization of the Hamiltonian with respect to u yields p(t) = u g(x(t), u(t)). Since the Hamiltonian is constant along an optimal trajectory, we have g(x(t), u(t)) u g(x(t), u(t))u(t) = constant. Substituting in the expression for g, we have 1 + u 2 u 2 Cx 1 + u2 Cx = u2 Cx = constant, which simplifies to (x(t)) 2 (1 + (ẋ(t)) 2 ) = constant. Thus an optimal trajectory satisfies the differential equation ẋ(t) = D (x(t)) 2 (x(t)) 2. It can be seen through straightforward calculation that the curve (x(t)) 2 + (t d) 2 = D satisfies this differential equation, and thus the curve of minimum travel time from A to B is an arc of a circle. 15

16 3.9 www We have the system ẋ(t) = Ax(t) + Bu(t), for which we want to minimize the quadratic cost T x(t) Q T x(t) + x(t) Qx(t) + u(t) Ru(t) dt. 0 The Hamiltonian here is and the adjoint equation is with the terminal condition H(x, u, p) = x Qx + u Ru + p (Ax + Bu), ṗ(t) = A p(t) 2Qx(t), p(t) = 2Qx(T). Minimizing the Hamiltonian with respect to u yields the optimal control u (t) = arg min x (t) Qx (t) + u Ru + p (Ax (t) + Bu) u = 1 2 R 1 B p(t). We now hypothesize a linear relation between x (t) and p(t) 2K(t)x (t) = p(t), t 0, T, and show that K(t) can be obtained by solving the Riccati equation. Substituting this value of p(t) into the previous equation, we have u (t) = R 1 B K(t)x (t). By combining this result with the system equation, we have ẋ(t) = (A BR 1 B K(t)) x (t). ( ) Differentiating 2K(t)x (t) = p(t) and using the adjoint equation yields Combining with ( ), we have 2 K(t)x (t) + 2K(t)ẋ (t) = A 2K(t)x (t) 2Qx (t). K(t)x (t) + K(t)(A BR 1 B K(t)) x (t) = A K(t)x (t) Qx (t), and we thus see that K(t) should satisfy the Riccati equation K(t) = K(t)A A K(t) + K(t)BR 1 B K(t) Q. From the terminal condition p(t) = 2Qx(T), we have K(T) = Q, from which we can solve for K(t) using the Riccati equation. Once we have K(t), we have the optimal control u (t) = R 1 B K(t)x (t). By reversing the previous arguments, this control can then be shown to satisfy all the conditions of the Pontryagin Minimum Principle. 16

17 Vol. I, Chapter www (a) Clearly, J N (x) is continuous. Assume that J k+1 (x) is continuous. We have } J k (x) = cu + L(x + u) + G(x + u) where min u 0,1,...} G(y) = E wk J k+1 (y w k )} L(y) = E wk p max(0, w k y) + h max(0, y w k )} Thus, L is continuous. Since J k+1 is continuous, G is continuous for bounded w k. Assume that J k is not continuous. Then there exists a ˆx such that as y ˆx, J k (y) does not approach J k (ˆx). Let } u y = arg min cu + L(y + u) + G(y + u) u 0,1,...} Since L and G are continuous, the discontinuity of J k at ˆx implies lim y ˆx uy uˆx But since u y is optimal for y, lim cuy + L(y + u y ) + G(y + u y ) } < lim cuˆx + L(y + uˆx ) + G(y + uˆx ) } = J k (ˆx) y ˆx y ˆx This contradicts the optimality of J k (ˆx) for ˆx. Thus, J k is continuous. (b) Let Y k (x) = J k (x + 1) J k (x) Clearly Y N (x) is a non-decreasing function. Assume that Y k+1 (x) is non-decreasing. Then Y k (x + δ) Y k (x) = c(u x+δ+1 u x+δ ) c(u x+1 u x ) + L(x + δ u x+δ+1 ) L(x + δ + u x+δ ) L(x u x+1 ) L(x + u x ) + G(x + δ u x+δ+1 ) G(x + δ + u x+δ ) G(x u x+1 ) G(x + u x ) Since J k is continuous, u y+δ = u y for δ sufficiently small. Thus, with δ small, Y k (x + δ) Y k (x) = L(x + δ u x+1 ) L(x + δ + u x ) L(x u x+1 ) L(x + u x ) + G(x + δ u x+1 ) G(x + δ + u x ) G(x u x+1 ) G(x + u x ) Now, since the control and penalty costs are linear, the optimal order given a stock of x is less than the optimal order given x + 1 stock plus one unit. Thus u x+1 u x u x

18 If u x = u x+1 + 1, Y (x + δ) Y (x) = 0 and we have the desired result. Assume that u x = u x+1. Since L(x) is convex, L(x + 1) L(x) is non-decreasing. Using the assumption that Y k+1 (x) is non-decreasing, we have Y k (x + δ) Y k (x) = L(x + δ u x ) L(x + δ + u x ) L(x u x ) L(x + u x ) }} 0 Thus, Y k (x) is a non-decreasing function in x. + E wk Jk+1 (x + δ u x w k ) J k+1 (x + δ + u x w k ) J k+1 (x u x w k ) J k+1 (x + u x w k ) } }} 0 0 (c) From their definition and a straightforward induction it can be shown that J k (x) and J k(x, u) are bounded below. Furthermore,since lim x L k (x, u) =, we obtain lim x (x, 0) =. From the definition of J k (x, u), we have Let S k be the smallest real number satisfying J k (x, u) = J k (x + 1, u 1) + c, u 1, 2,...}. (2) J k (S k, 0) = J k (S k + 1, 0) + c (1) We show that S k is well defined. If no S k satisfying (1) exists, we must have either J k (x, 0) J k (x + 1, 0) > c, x R or J k (x, 0) J k (x + 1,0) < 0, x R, because J k is continuous. The first possibility contradicts the fact that lim x J k (x, 0) =. The second possibility implies that lim x J k (x,0) + cx is finite. However, using the boundedness of J k+1 (x) from below, we obtain lim x J k (x, 0) + cx =. The contradiction shows that S k is well defined. We now derive the form of an optimal policy u k (x). Fix some x and consider first the case x S k. Using the fact that J k (x, u) J k (x + 1, u) is nondecreasing function of x we have for any u 0, 1,2,...} J k (x + 1, u) J k (x, u) J k (S k + 1, u)j k (S k, u) = J k (S k + 1, 0) J k (S k, 0) = c Therefore, J k (x, u + 1) = J k (x + 1, u) + c J k (x, u) u 0, 1,...}, x S k. This shows that u = 0 minimizes J k (x, u), for all x S k. Now let x S k n, S k n +1), n 1, 2,...}. Using (2), we have J k (x, n + m) J k (x, n) = J k (x + n, m) J k (x + n, 0) 0 m in 0,1,...}. (3) However, if u < n then x + u < S k and Therefore, J k (x + u + 1, 0) J k (x + u,0) < J k (S k + 1,0) J k (S k, 0) = c. J k (x, u + 1) = J k (x + u + 1, 0) + (u + 1)c < J k (x + u,0) + uc = J k (x, u) u 0,1,...}, n < n. (4) Inequalities (3),(4) show that u = n minimizes J k (x, u) whenever x S k n, S k n + 1). 18

19 4.18 www Let the state x k be defined as The system evolves according to The cost function is given by T, if the selection has already terminated x k = 1, if the k th object observed has rank 1 0, if the k th object observed has rank < 1 T, if uk = stop or x x k+1 = k = T w k, if u k = continue k g k (x k, u k, w k ) = N, if x k = 1 and u k = stop 0, otherwise g N (x N ) = 1, if xn = 1 0, otherwise Note that if termination is selected at stage k and x k 1 then the probability of success is 0. Thus, if x k = 0 it is always optimal to continue. To complete the model we have to determine P(w k x k, u k ) = P(w k ) when the control u k = continue. At stage k, we have already selected k objects from a sorted set. Since we know nothing else about these objects the new element can, with equal probability, be in any relation with the already observed objects a j < a i1 < < a i2 < < a ik }} k+1 possible positions for a k+1 Thus, P(w k = 1) = 1 k + 1 ; P(w k = 0) = k k + 1 Proposition If k S N = i ( ) 1 N i 1}, then J k (0) = k N ( 1 N ) k J k (1) = k N. Proof For k = N 1 and µ N 1 (0) = continue. J J N 1 (0) = max }} 0 stop N 1 N 1(1) = max N }} stop, Ew N 1 } = 1 }} N continue, Ew N 1 } = N 1 }} N continue 19

20 and µ N 1 (1) = stop. Note that N 1 S N for all S N. Now, Assume the proposition is true for J k+1 (x k+1 ). Then J k (0) = max }} 0 stop k J k (1) = max N }} stop EJ k+1 (w k )} = 1 k + 1 k + 1 N + k k + 1 k + 1 N = k ( 1 N N ) k, EJ k+1 (w k )} }} continue, EJ k+1 (w k )} }} continue ( 1 N ) k + 1 Clearly and µ k (0) = continue. If k S N, J k (0) = k N ( 1 N ) k J k (1) = k N and µ k (1) = stop. Q.E.D. Proposition If k S N J k (0) = J k (1) = δ 1 N where δ is the minimum element of S N. ( 1 N ) δ 1 Proof For k = δ 1 J k (0) = 1 δ δ N + δ 1 ( δ 1 δ N N ) δ = δ 1 ( 1 N N ) δ 1 δ 1 J k (1) = max = δ 1 N and µ δ 1 (0) = µ δ 1 (1) = continue. Assume the proposition is true for J k (x k ). Then N, δ 1 ( 1 N N ) δ 1 ( 1 N ) δ 1 J k 1 (0) = 1 k J k(1) + k 1 k J k(0) = J k (0) 20

21 and µ k 1 (0) = continue. and µ k 1 (1) = continue. Q.E.D. 1 J k 1 (1) = max δ 1 = max N = J k (0) k J k(1) + k 1 k J k(0), k 1 N ( 1 N δ 1 ), k 1 N ( Thus the optimum ) policy is to continue until the δ th object, where δ is the minimum integer such that 1 N δ 1, and then stop at the first time an element is observed with largest rank www (a) In order that A k x+b k u+w X for all w W k, it is sufficient that A k x+b k u belong to some ellipsoid X such that the vector sum of X and W k is contained in X. The ellipsoid where for some scalar β (0, 1), X = z z F z 1}, F 1 = (1 β)(ψ 1 β 1 D 1 k ) has this property (based on the hint and assuming that F 1 is well-defined as a positive definite matrix). Thus, it is sufficient that x and u are such that (A k x + B k u) F(A k x + B k u) 1. (1) In order that for a given x, there exists u with u R k u 1 such that Eq. (1) is satisfied as well as it is sufficient that x is such that x Ξx 1 min x Ξx + u R k u + (A k x + B k u) F(A k x + B k u) 1, (2) u R m or by carryibf out explicitly the quadratic minimization above, where The control law x Kx 1, K = A k (F 1 + B k R 1 k B k ) 1 + Ξ. µ(x) = (R k + B k F B k) 1 B k F A kx attains the minimum in Eq. (2) for all x, so it achieves reachability. (b) Follows by iterative application of the results of part (a), starting with k = N 1 and proceeding backwards. (c) Follows from the arguments of part (a). 21

22 Vol. I, Chapter www Define y k = x k + A 1 k w k + A 1 k A 1 y N = x N k+1 w k A 1 k A 1 N 1 w N 1 Then and y k = x k + A 1 k (w k x k+1 ) + A 1 k y k+1 = x k + A 1 k ( A kx k B k u k ) + A 1 = A 1 k Now, the cost function is the expected value of We have B ku k + A 1 k y k+1 y k+1 = A k y k + B k u k k y k+1 N 1 N 1 x N Qx N + u k R k u k = y 0 K 0 y 0 + (y k+1 K k+1 y k+1 y k K k y k + u k R k u k ) k=0 k=0 y k+1 K k+1 y k+1 y k K k y k + u k R k u k = (A k y k + B k u k ) K k+1 (A k y k + B k u k ) + u k R k u k y k A k K k+1 K k+1 B k (B k K k+1 B k ) 1 B k τrk k+1 A k y k = y k A k K k+1 A k y k + 2y k A k K k+1b k u k + u k B k K k+1 B k u k y k A k K k+1 A k y k + y k A k K k+1 B k P 1 k B k K k+1a k y k + u k R k u k Thus, the cost function can be written as = 2y k L k P ku k + u k P k u k + y k L k P k L k y k = (u k L k y k ) P k (u k L k y k ) N 1 } E y 0 K 0 y 0 + (u k L k y k ) P k (u k L k y k ) k=0 The problem now is to find µ k (I k), k = 0, 1,..., N 1, that minimize over admissible control laws µ k (I k ), k = 0, 1,..., N 1, the cost function E y 0 K 0 y 0 + N 1 k=0 ( µk (I k ) L k y k ) Pk ( µk (I k ) L k y k ) } 22

23 We do this minimization by first minimizing over µ N 1, then over µ N 2, etc. The minimization over µ N 1 involves just the last term in the sum and can be written as (µn ) PN ( ) } min E 1 (I N 1 ) L N 1 y N 1 1 µn 1 (I N 1 ) L N 1 y N 1 µ N 1 (un 1 ) PN ( ) }} = E min E L N 1 y N 1 1 un 1 L N 1 y IN 1 N 1 u N 1 Thus this minimization yields the optimal control law for the last stage µ N 1 (I } N 1) = L N 1 E y IN 1 N 1 (Recall here that, generically, Ez I} minimizes over u the expression Ez(u z) P(u z) I} for any random variable z, any conditioning variable I, and any positive semidefinite matrix P.) The minimization over µ N 2 involves (µn 2 ) PN 2 ( ) } E (I N 2 ) L N 2 y N 2 µn 2 (I N 2 ) L N 2 y N 2 (E ) L ( ) } + E y N 1 I N 1 } y N 1 N 1 P N 1 L N 1 E y N 1 I N 1 } y N 1 However, as in the lemma of p. 104, the term Ey N 1 I N 1 } y N 1 does not depend on any of the controls (it is a function of x 0, w 0,..., w N 2, v 0,..., v N 1 ). Thus the minimization over µ N 2 involves just the first term above and yields similarly as before µ N 2 (I } N 2) = L N 2 E y IN 2 N 2 Proceeding similarly, we prove that for all k µ k (I } k) = L k E y Ik k Note: The preceding proof can be used to provide a quick proof of the separation theorem for linear-quadratic problems in the case where x 0, w 0,..., w N 1, v 0,..., v N 1 are independent. If the cost function is E x N Q N x N + N 1 ( ) } xk Q k x k + u k R k u k the preceding calculation can be modified to show that the cost function can be written as E x 0 K 0 x 0 + k=0 N 1 ( (uk L k x k ) ) } P k (u k L k x k ) + w k K k+1 w k k=0 By repeating the preceding proof we then obtain the optimal control law as µ k (I } k) = L k E x Ik k 23

24 5.3 www The control at time k is (u k, α k ) where α k is a variable taking values 1 (if the next measurement at time k + 1 is of type 1) or 2 (if the next measurement is of type 2). The cost functional is N 1 E x N Qx N + k=0 (x k Qx k + u k Ru k ) + N 1 We apply the DP algorithm for N = 2. We have from the Riccatti equation So J 1 (I 1 ) = J 1 (z 0, z 1, u 0, α 0 ) = E x1 x 1 (A QA + Q)x 1 I 1 } + E w1 w Qw} k=0 g αk } + min u 1 u1 (B QB + R)u Ex 1 I 1 } A QBu 1 } + ming 1, g 2 µ 1 (I 1) = (B QB + R) 1 B QA Ex 1 I 1 } 1, if α 1 (I g1 g 2 1) = 2, otherwise Note that the measurement selected at k = 1 does not depend on I 1. This is intuitively clear since the measurement z 2 will not be used by the controller so its selection should be based on measurement cost alone and not on the basis of the quality of estimate. The situation is different once more than one stage is considered. Using a simple modification of the analysis in Section 5.2 of the text, we have J 0 (I 0 ) = J 0 (z 0 ) = min E x0 Qx 0 + u 0 Ru 0 + Ax 0 + Bu 0 + w } } 0 K 0 Ax 0 + Bu 0 + w 0 z0 u 0 x 0,w 0 + min E E x1 Ex 1 I 1 } P 1 x 1 Ex 1 I 1 } } } I1 z0, u 0, α 0 + g α0 α 0 z 1 x 1 + E w1 w 1 Qw 1 } + ming 1, g 2 The quantity in the second bracket is the error covariance of the estimation error (weighted by P 1 ) and, as shown in the text, it does not depend on u 0. Thus the minimization is indicated only with respect to α 0 and not u 0. Because all stochastic variables are Gaussian, the quantity in the second pracket does not depend on z 0. (The weighted error covariance produced by the Kalman filter is precomputable and depends only on the system and measurement matrices and noise covariances but not on the measurements received.) In fact E E x1 Ex 1 I 1 } P 1 x 1 Ex 1 I 1 } } } I1 z0, u 0, α 0 z 1 x 1 ( Tr = Tr P P 1 2, if α 0 = 1 ( ) P P 1 2 1, if α 0 = 2 where Tr( ) denotes the trace of a matrix, and ( ) denotes the error covariance of the Kalman filter estimate if a measurement of type 1 (type 2) is taken at k = 0. Thus at time k = 0 we have that the optimal measurement chosen does not depend on z 0 and is of type 1 if and of type 2 otherwise. T r 1 ) ( P Σ P ) + g 1 T r ( P Σ P ) + g 2 24

25 5.7 www (a) We have p j k+1 = P(x k+1 = j z 0,..., z k+1, u 0,..., u k ) = P(x k+1 = j I k+1 ) = P(x k+1 = j, z k+1 I k, u k ) from Bayes P(z k+1 I k, u k ) rule n i=1 = P(x k = i)p(x k+1 = j x k = i, u k )P(z k+1 u k, x k+1 = j) n n s=1 i=1 P(x k = i)p(x k+1 = s x k = i, u k )P(z k+1 u k, x k+1 = s) n i=1 = pi k p ij(u k )r j (u k, z k+1 ) n n s=1 i=1 pi k p is(u k )r s (u k, z k+1 ). Rewriting p j k+1 in vector form, we have p j k+1 = P k P.j(u k )r j (u k, z k+1 ) n s=1 P k P.s(u k )r s (u k, z k+1 ) = r j (u k, z k+1 )P(u k ) P k j n s=1 r s(u k, z k+1 )P(u k ) P k s, j = 1,..., n. Therefore, P k+1 = r(u k, z k+1 ) P(u k ) P k r(u k, z k+1 ) P(u k ) P k. (b) The DP algorithm for this system is J N 1 (P N 1 ) = min u = min u = min u n n p i N 1 p ij (u)g N 1 (i, u, j) i=1 j=1 n } p i N 1 G N 1(u) i i=1 P N 1 G N 1(u)} J k (P k ) = min u n p i k i=1 j=1 = min u P k G k(u) + n p ij (u)g k (i, u, j) + q θ=1 n p i k i=1 j=1 n p ij (u) q r j (u, θ) J k+1 (P k+1 P k, u, θ)} θ=1 r(u, θ) P(u) P k Jk+1 r(u, θ) P(u) P k r(u, θ) P(u) P k }. (c) For k = N 1, J N 1 (λp N 1 ) = minλp u N 1 G N 1(u)} = min = λ min u n i=1 n u i=1 p i N 1 G N 1(u) i } = λ min u 25 λp i N 1 G N 1(u) i } = min u λ n p i N 1 G N 1(u) i } i=1 n p i N 1 G N 1(u) i } = λ J N 1 (P N 1 ). i=1

26 Now assume J k (λp k ) = λ J k (P k ). Then, J k 1 (λp k 1 ) = min u λp k 1 G k 1(u) + q r(u, θ) P(u) λp k 1 Jk (P k P k 1, u, θ)} θ=1 = minλp u k 1 G k 1(u) + λ = λ min u P k 1 G k 1(u) + q r(u, θ) P(u) P k 1 Jk (P k P k 1, u, θ)} θ=1 q r(u, θ) P(u) P k 1 Jk (P k P k 1, u, θ)} θ=1 = λ J k 1 (P k 1 ). Q.E.D. Note that induction is not necessary to show that J k (λp k ) = λ J k (P k ). For any u, r(u, θ) P(u) P k is a constant. Therefore, letting λ = r(u, θ) P(u) P k, we have q r(u, θ) P(u) r(u, θ) P(u) P k P k Jk+1 r(u, θ) P(u) P k θ=1 q = min P k G k(u ) + J k+1 (r(u, θ) P(u) P k ). J k (P k ) = min u P k G k(u) + θ=1 } (d) For k = N 1, we have J N 1 (P N 1 ) = minp N 1 G N 1(u), and so J N 1 (P N 1 ) has the desired form J N 1 (P N 1 ) = minp N 1 α1 N 1,..., P N 1 αm N 1, where α j N 1 is the jth element of G N 1(u). Assume J k+1 (P k+1 ) = minp k+1 α1 k+1,..., P k+1 αm k+1 k+1. Then, using the expression from part (c) for J K (P k ), J k (P k ) = min = min = min = min P k G k(u ) + P k G k(u) + P k G k(u) + P k G k(u) + q J k+1 (r(u, θ) P(u) P k ) θ=1 q minr(u, θ) P(u) P k } α k+1 θ=1 q minp k P(u)r(u, θ) α k+1 θ=1 q minp(u)r(u, θ) α k+1 θ=1 = min P k α1 k,..., P k αm k k, where The induction is thus complete. q α k = G k (u) + minp(u)r(u, θ) α k+1. θ=1 26

27 Vol. I, Chapter www First, we notice that α β pruning is applicable only for arcs that point to right children, so that at least one sequence of moves (starting from the current position and ending at a terminal position, that is, one with no children) has been considered. Furthermore, due to depth-first search the score at the ancestor positions has been derived without taking into account the positions that can be reached from the current point. Suppose now that α-pruning applies at a position with Black to play. Then, if the current position is reached (due to a move by White), Black can respond in such a way that the final position will be worse (for White) than it would have been if the current position were not reached. What α-pruning saves is searching for even worse positions (emanating from the current position). The reason for this is that White will never play so that Black reaches the current position, because he certainly has a better alternative. A similar argument applies for β pruning. A second approach: Let us suppose that it is the WHITE s turn to move. We shall prove that a β cutoff occurring at the nth position will not affect the backed up score. We have from the definition of β β = mintbs of all ancestors of n (white) where BLACK has the move}. For a cutoff to occur: TBS(n) > β. Observe first of all that β = T BS(n 1 ) for some ancestor n 1 where BLACK has the move. Then there exists a path n 1, n 2,..., n k, n. Since it is WHITE s move at n we have that T BS(n) = maxtbs(n), BS(n i )} > β, where n i are the descendants of n. Consider now a position n k. Then T BS(n k ) will either remain unchanged or will increase to a value greater than β as a result of the exploration of node n. Proceeding similarly, we conclude that T BS(n 2 ) will either remain the same or change to a value greater than β. Finally at node n 1 we have that TBS(n 1 ) will not change since it is BLACK s turn to move and he will choose the move with minimum score. Thus the backed up score and the choice of the next move are unaffected from β pruning. A similar argument holds for α pruning www (a) We have for all i F(i) ˆF(i) = a ij(i) + F ( j(i) ). (1) Assume, in order to come to a contradiction, that the graph of the N 1 arcs ( i, j(i) ), i = 1,..., N 1, contains a cycle (i 1, i 2,..., i k, i 1 ). Using Eq. (1), we have By adding the above inequalities, we obtain F(i 1 ) a i1 i 2 + F(i 2 ), F(i 2 ) a i2 i 3 + F(i 3 ), F(i k ) a ik i 1 + F(i 1 ). 0 a i1 i 2 + a i2 i a ik i 1. Thus the length of the cycle (i 1, i 2,..., i k, i 1 ) is nonpositive, a contradiction. Hence, the graph of the N 1 arcs ( i, j(i) ), i = 1,..., N 1, contains no cycle. Given any node i N, we can start with arc ( i, j(i) ), 27

VOL. I, FOURTH EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 2/11/2017. Athena Scientific, Belmont, Mass.

VOL. I, FOURTH EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 2/11/2017. Athena Scientific, Belmont, Mass. Dynamic Programming and Optimal Control VOL. I, FOURTH EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology Selected Theoretical Problem Solutions Last Updated 2/11/2017 Athena Scientific,