STP Problem Set 3 Solutions

Size: px

Start display at page:

Download "STP Problem Set 3 Solutions"

Juniper Benson
6 years ago
Views:

1 STP Problem Set 3 Solutions 4.4) Consider the separable sequential allocation problem introduced in Sections and 4.6.3, where the goal is to maximize the sum subject to the constraints f(x 1,, x N ) = N g(x t ) t=1 N x i = M with x 1,, x N 0. i=1 As explained in Section 3.3.3, this optimization problem can be formulated as a deterministic dynamic program by letting S = [0, M], A s = [0, s], r t (s, a) = g(a) for t = 1,, N 1 and r N (s) = g(s), with dynamical equation s t+1 = s t a t. For this exercise, we will assume that the function g(a) is strictly concave of the set [0, M], meaning that for all points x, y [0, M] and all p [0, 1] we have with strict inequality whenever p (0, 1). pg(x) + (1 p)g(y) g ( px + (1 p)y ), If we let u t (s), t = 1,, N denote the optimal value functions for this problem, then these satisfy the Bellman equations u { t (s) = sup g(a) + u t+1 (s a), t = 1,, N 1, u N(s) = g(s), which can be solved using backwards recursion. For example, when t = N 1, we have since the concavity of g implies that u N 1(s) = sup {g(a) + g(s a) = 2g(s/2), 1 2 g(a) + 1 g(s a) g 2 ( 1 2 a + 1 ) (s a) = g(s/2), 2 with equality if and only if a = d N 1 (s) = s/2. Similarly, when t = N 2, then { ( ) s a u N 2(s) = sup g(a) + 2g = 3g(s/3), 2 since the concavity of g implies that 1 3 g(a) + 2 ( ) s a 3 2 ( 1 g 3 a ) s a = g(s/3), 2

2 with equality if and only if a = d N 2 (s) = s/3. These calculations suggest that the following general formula holds ( s ) u N t+1(s) = tg t (1) for t = 1,, N with the unique optimal decision rule being d N t+1 (s) = s/t. This can be verified by backwards induction. We first note that (1) holds for t = 1, 2, 3 as above. Suppose, then, that (1) is true for t = 1,, n. Then u { N n(s) = sup g(a) + u N n+1 (s a) = sup = (n + 1)g { g(a) + ng ( ) s. n + 1 ( ) s a Indeed, the third equality follows from the concavity of g, since 1 n + 1 g(a) + n ( ) ( s a 1 n + 1 g g n + 1 n + 1 a + n ) s a n + 1 n with equality if and only if a = d N n (s) = s/(n + 1). n ( ) s = g, n + 1 The optimal allocation pattern can now be determined by forward recursion using the optimal decision rules given above. When t = 1, we have s 1 = M, a 1 = d 1 (M) = M/N and r(s 1, a 1 ) = g(m/n). Likewise, when t = 2, we have s 2 = M(N 1)/N, a 2 = d 2 (M(N 1)/N) = M/N and r(s 2, a 2 ) = g(m/n). Indeed, by forward induction, we can establish that the optimal sequence of states, actions and rewards is for t = 1,, N. s t = Mt N, a t = M N, r(s t, a t ) = g(m/n) Note: In Section 4.6.3, the problem is reformulated as a maximization of the sum N f(x 1,, x N ) = g(x t ) t=1 and then we need to assume that g is convex. 2

3 4.5) Consider the separable sequential allocation problem as formulated in the preceding exercise, but now suppose that g is convex, i.e., pg(x) + (1 p)g(y) g ( px + (1 p)y ). As before, we can solve the Bellman equations using backwards recursion. When t = N 1, we have u N 1(s) = sup {g(a) + g(s a) = g(0) + g(s) with maxima at a = 0 or a = s. Indeed, the convexity of g implies that for any a [0, s] we have ( 1 a ) g(0) + a g(s) g(a) s s and summing these inequalities gives a s g(0) + (1 a s ) g(s) g(s a) g(0) + g(s) g(a) + g(s a) for all such a with equality when a = 0 or a = s. If we next let t = N 2, then u N 1(s) = sup {g(a) + g(0) + g(s a) = 2g(0) + g(s) by the argument given above, with maxima again at a = 0 or a = s. induction, we can establish that for all t = 1,, N Thus, by backwards u t (s) = (N t + 1)g(0) + g(s) and that d t (s) {0, s. It follows that there are at least N different optimal allocation patterns, π1,, π N, where π i is the pattern obtained by allocating all M of the resources during the i th decision epoch and 0 during all of the others. 3

4 4.20) Consider the equipment replacement model of Section with two decision epochs (N = 3), where the condition of the equipment s decays from epoch to epoch according to the equation s t+1 = s t + X t, where X t is geometrically-distributed with parameter π = 0.4, and the costs are determined by the parameters R = 0, K = 5, h(s) = 2s, and r 3 (s) = (5 s) +. By Theorem 4.7.5, we know that this problem is monotone, which implies that for each decision epoch t < N there exists a threshold value s t such that the optimal decision rule takes the form d t (s) = 0 if s < s t and d t (s) = 1 otherwise. In other words, the optimal policy is to order a replacement in epoch t if and only if the condition of the equipment is greater than level s t (higher values indicating poorer condition). This policy can be found by carrying out the monotone backward induction algorithm. begin by observing that u 3(s) = (5 s) + for all s 0. Upon substituting this expression into the optimal value equations, we then have u 2(s) = max { 2s + E[(5 s X) + ], 5 + E[(5 X) + ], where X Geometric(π). A simple calculation shows that the values of the expression E[(5 s X) + ] are , , 2.376, 1.44 and 0.6 for s = 0, 1, 2, 3 and 4, respectively, and 0 for s 5. Accordingly, we obtain the following values for u 2 (s): and we see that s 2 = if s = 0 u 2(s) = if s = if s 2 Using the optimal value equations one more time gives u 1(s) = max { 2s + E[u 2(s + X)], 5 + E[u 2(X)], where the expectation E[u s(s + X)] are equal to , and for s = 0, 1 or 2. Substituting these into optimality equation shows that if s = 0 u 1(s) = if s = if s 2 and so s 1 = 2. We 4

5 4.21) We consider a two-armed bandit model for a project management problem with two projects that are available for selection in each of three periods (N = 4). Project 1 yields a reward of one unit and always occupies state s, while project 2 occupies either state t or u. If project 2 is selected when it occupies state t, then it yields a reward of 0 and moves to state u at the next decision epoch; if selected when it occupies state u, then it yields a reward of 2 units and moves to state t with probability 0.5 and otherwise remains in state u. Assume a terminal reward of 0 and that project 2 does not change state when it is not selected. The policy that maximizes the total expected reward over the three decision epochs can be found with the Bellman equations. Since there is no terminal reward we have u 4 (s) = 0 for all states S = {(s, t), (s, u) and so with d 3 ((s, t)) = 1 and d 3 ((s, u)) = 2. Similarly, u 3((s, t)) = max{1, 0 = 1 u 3((s, u)) = max{1, 2 = 2, u 2((s, t)) = max{1 + u 3((s, t)), 0 + u 3((s, u)) = 2 u 2((s, u)) = max{1 + u 3((s, u)), (u 3((s, t)) + u 3((s, u))) = 3.5, with d 2 ((s, t)) {1, 2 (both choices are optimal) and d 2 ((s, u)) = 2. Lastly, u 1((s, t)) = max{1 + u 2((s, t)), 0 + u 2((s, u)) = 3.5 u 1((s, u)) = max{1 + u 2((s, u)), (u 2((s, t)) + u 2((s, u))) = 4.75, with d 1 ((s, t)) = d 1 ((s, u)) = ) Our aim is to find a foraging strategy that maximizes the probability of survival over 30 days using the model of lion foraging described in problem Recall that the reward function for this problem is { 1 if s > 0 r N (s) = 0 if s = 0 with r t (s, a) = 0 for all s and a when t = 1,, 29. Here s S = [0, 30] specifies the energy reserves of an individual lion, with s = 0 corresponding to death of that lion, while the action a A s = {0,, 6 is either a = 0 if the lion chooses to not hunt on that day or a 1 if the lion chooses to hunt in a group of size a. I will assume that the lion can choose to hunt whenever its energy reserves are positive, but we could also reasonably require s > 0.5 since each hunt uses this amount of energy. Since u 30 (s) = r 30(s), the optimal value equations for t = 29 take the form u 29(s) = max {E[u a 30(X 30 ) X 29 = s, Y 29 = a] = u ( 30 (s 6) + ) { max pa u 30(Φ(s, a)) + (1 p a )u 30((s 6.5) + ) 1 a 6 0 if s = 0 = 0.43 if 0 < s 6 1 if 6 < s 30, 5

6 where p a is the probability of a successful hunt by a group containing a animals and the function { ( Φ(s, a) min 30, s ) + a specifies the change in the lion s energy reserves following a successful hunt by such a group. When s > 6.5 all actions are optimal on day 29 (with respect to this criterion) and every optimal decision rule is consistent with the following specification: 6 if 0 < s 6 d 29(s) = 0 if 6 < s 6.5 0,, 6 if 6.5 < s 30. In general, the optimal value equations for this problem take the form { u 0 if s = 0 t (s) = u t+1 ((s { 6)+ ) max 1 a 6 pa u t+1 (Φ(s, a)) + (1 p a)u t+1 ((s 6.5)+ ) otherwise. These are most easily solved with the help of a computer and a sample C program that does this is attached. When this program is executed, one finds that there is a unique optimal decision rule for days 1 to 20, but that different rules are optimal on days 21 to 29. Furthermore, there are multiple rules that are optimal on days 25 to 29. These are indicated below: 0 if s (6, 6.5] (12.5, 13] (19, 19.5] (25.5, 26] d t (s) = 5 if s (0, 31/6) 6 if s (31/6, 6] (6.5, 12.5] (13, 19] (19.5, 25.5] (26, 30] for t = 1,, 20, 0 if s (6, 6.5] (12.5, 13] (19, 19.5] (25.5, 26] d t (s) = 5 if s (0, 14/3] 6 if s (14/3, 6] (6.5, 12.5] (13, 19] (19.5, 25.5] (26, 30] for t = 21 23, 0 if s (6, 6.5] (12.5, 13] (19, 19.5] (25.5, 26] d t (s) = 5 if s (0, 31/6] 6 if s (31/6, 6] (6.5, 12.5] (13, 19] (19.5, 25.5] (26, 30] for t = 24, 5 if s (0, 19/6] d t (s) = 6 if s (19/6, 6] (6.5, 12.5] (13, 19] (19.5, 25.5] (26, 30] 0, 6 if s (6, 6.5] (12.5, 13] (19, 19.5] (25.5, 26] 6

7 for t = 25, for t = 26, for t = 27, for t = 28, and for t = if s (24, 24.5] d 6 if s (0, 6] (6.5, 12.5] (13, 19] (19.5, 24] t (s) = 0, 6 if s (6, 6.5] (12.5, 13] (19, 19.5] 0,, 6 if s (24.5, 30] 0 if s (18, 18.5] d 6 if s (0, 6] (6.5, 12.5] (13, 18] t (s) = 0, 6 if s (6, 6.5] (12.5, 13] 0,, 6 if s (18.5, 30] 0 if s (12, 12.5] d 6 if s (0, 6] (6.5, 12] t (s) = 0, 6 if s (6, 6.5] 0,, 6 if s (12.5, 30] 0 if s (6, 6.5] d t (s) = 6 if s (0, 6] 0,, 6 if s (6.5, 30] The near-periodicity of the optimal decision rules for days 1 to 23 seems odd and is probably not robust to natural variation in daily energy budgets or storage capacity. Furthermore, the optimality of foraging in a group of size 5 when an individual s energy reserves are low is at least partly an artifact of the assumption that energy stores exceeding the maximum storage capacity can be used to satisfy daily energy costs. For example, if the model is modified by setting ({ ( Φ(s, a) min 30, s ) + 6.5), a then the optimal policy no longer has this feature. Arguably the most robust prediction of the model is that lions should hunt in large groups when possible, as this increases the likelihood that the hunt is successful and only marginally reduces the amount by which an individual s energy reserves increase following a successful hunt. 7

8 lion_mdp.c 10/15/12 6:49 PM /* Solution of the Bellman equations for the lion foraging model in Puterman p */ #include <stdio.h> #include <math.h> int main(void) { int t, N; int i, k, gain, loss, a; int flag; int dec[9001][7]; double p[7]; double pmax, psurv[7]; double u[31][9001]; /* capture probabilities */ p[1] = 0.15; p[2] = 0.33; p[3] = 0.37; p[4] = 0.4; p[5] = 0.42; p[6] = 0.43; /* boundary conditions for the policy evaluation algorithm */ u[30][0] = 0; for (i=1;i<=9000;i++) u[30][i] = 1; /* backwards induction */ for (t=29;t>=1;t--) { printf("\nt = %d\n", t); u[t][0] = 0.0; for (i=1;i<=9000;i++) { /* a = 0: lion doesn't hunt */ if (i > 1800) psurv[0] = u[t+1][i ]; else psurv[0] = 0; pmax = psurv[0]; /* a > 0: lion hunts in a group of size a */ if (i >= 1) { for (a=1;a<=6;a++) { gain = i *300/a; /* successful hunt */ if (gain > 9000) gain = 9000; if (gain < 0) gain = 0; loss = i ; /* unsuccessful hunt */ if (loss < 0) loss = 0; psurv[a] = p[a]*u[t+1][gain] + (1-p[a])*u[t+1][loss]; if (psurv[a] > pmax) pmax = psurv[a]; else for (a=1;a<=6;a++) psurv[a] = -1; for (a=0;a<=6;a++) { dec[i][a] = 0; Page 1 of 2

9 lion_mdp.c 10/15/12 6:49 PM if (psurv[a] == pmax) { dec[i][a] = 1; /* mark the optimal actions */ u[t][i] = pmax; /* expected total value of the optimal action */ printf("i = 0 (s = %lf): 0", 0.0); printf("\n"); for (i=1;i<=9000;i++) { flag = 0; for (a=0;a<=6;a++) { if (dec[i][a]!= dec[i-1][a]) flag = 1; /* print optimal decision rules at break points */ if (flag == 1) { printf("i = %d (s = %lf): ", i, ((double) i)/300); for (a=0;a<=6;a++) if (dec[i][a] == 1) printf("%d ", a); printf("\n"); Page 2 of 2

10 4.30) We need to find an optimal policy for a call option to purchase 100 shares of stock at a cost of $31 per share over a 30 day period when the initial price is $30 per share and the daily price increases by $0.10 with probability 0.6, remains the same with probability 0.1, and decreases with probability $0.10 with probability 0.3. We assume that the transaction cost is $50 and that the option expires at the end of the 30 day period. This is an example of an American call option, since the buyer has the right to purchase the stock at any time within those 30 days. In contrast, a European call option would allow the buyer to purchase the stock only on the expiration date. In either case, the buyer is not obligated to purchase the stock. As explained in Section 3.4.4, we can formulate this model as an optimal stopping problem with no holding cost and reward r t (s, Q) = 100 (s 31) 50 = 100s 3150 when s [0, ) and r t (s, C) = r t (, Q) = 0 for all s S and all t = 1,, 30. Because the cemetery state is absorbing, it is clear that u t ( ) = 0 for all t = 1,, 31. Furthermore, because the option has no value after its expiration date, we also know that u 31 (s) = r 31(s) = 0 for all s S. Accordingly, the optimal value equation for t = 30 has the form u 30(s) = max {0, 100s 3150 = (100s 3150) +, which shows that the option to purchase the shares should be exercised on the 30 th day if and only if the price of the stock is greater than or equal to $31.5, i.e., an optimal decision rule for t = 30 is { d Q if s (s) = C otherwise. Similarly, the optimal value equations for t 29 can be written as u t (s) = max { 0.6u t+1(s + 0.1) + 0.1u t+1(s) + 0.3u t+1((s 0.1) + ), 100s (2) and I claim that the maximum is achieved when d t (s) = C so that u t (s) = 0.6u t+1(s + 0.1) + 0.1u t+1(s) + 0.3u t+1((s 0.1) + ). (3) To verify this claim, first note that because the functions u 30 (s) and 100s 3150 are both convex, it follows that u 29 (s) is also convex (the maximum of two convex functions is also a convex function). This in turn implies that u 28 (s) is convex and then backwards induction allows us to conclude that all of the optimal value functions u t (s), t = 1,, 30 are convex. Similarly, either by induction or by an appeal to Proposition 4.7.3, we can conclude that all of the optimal value functions are non-decreasing. Taken together, these two properties imply that u t (s) 0.6u t+1(s + 0.1) + 0.1u t+1(s) + 0.3u t+1((s 0.1) + ) 0.6u t+1(s + 0.1) + 0.1u t+1(s) + 0.3u t+1((s 0.1)) u t+1(s ) u t+1(s). However, since u 30 (s) 100s 3150 for all s R, a third induction argument shows that u t (s) 100s 3150 for all t = 1,, 30, which then leads to (3). 8

11 These arguments have shown that the optimal policy for this problem is to wait until the expiration date and then purchase the shares if their price is at least $31.50 per share. Thus, to calculate the value of the option when the initial price is $30 per share, we merely need to evaluate u 1 (30), which we can do with the help of the policy evaluation algorithm and a computer (see the attached C code). This shows that the value of the option is a mere $ Indeed, since on average the share price increases by only three cents a day, the expected price of the stock at the end of the 30 day period is only $30.90 per share, which is below the strike price. 9

12 call_option_mdp.c 10/15/12 6:49 PM /* Solution of the Bellman equations for the call option model in Puterman p */ #include <stdio.h> #include <math.h> int main(void) { int t; int i; /* state s = i * $0.10 */ int dec[32][601]; /* optimal decision rules */ double u[32][601]; /* optimal value functions */ double valc, valq; /* Policy evaluation algorithm */ for (i=0;i<=600;i++) u[31][i] = 0; /* boundary conditions: no scrap value */ for (t=30;t>=1;t--) { for (i=0;i<=600-(31-t);i++) { /* valc = expected value if the shares are not purchased */ if (i > 0) valc = 0.6*u[t+1][i+1] + 0.1*u[t+1][i] + 0.3*u[t+1][i -1]; else if (i == 0) valc = 0.6*u[t+1][i+1] + 0.4*u[t+1][i]; /* valq = expected value if the shares are purchased */ valq = 10*((double) (i-310)) - 50; if (valc > valq) { u[t][i] = valc; dec[t][i] = 0; /* C is optimal */ else { u[t][i] = valq; dec[t][i] = 1; /* Q is optimal */ printf("t = %d, u_t(30) = %lf, d_t(30) = %d\n", t, u[t][300], dec[t] [300]); printf("value of option = %lf\n", u[1][300]); Page 1 of 1

STP Problem Set 2 Solutions

STP Problem Set 2 Solutions STP 425 - Problem Set 2 Solutions 3.2.) Suppose that the inventory model is modified so that orders may be backlogged with a cost of b(u) when u units are backlogged for one period. We assume that revenue