Stochastic Optimal Control - PDF Free Download

Stochastic Optimal Control Lecturer: Eilyan Bitar, Cornell ECE Scribe: Kevin Kircher, Cornell MAE These notes summarize some of the material from ECE 5555 (Stochastic Systems) at Cornell in the fall of 2013. The theme is optimal control of discrete time stochastic systems using dynamic programming. Although the course covered estimation as well as control, these notes only cover perfectly observed systems. The primary sources are Dynamic Programming and Optimal Control by Dimitri Bertsekas, Stochastic Systems: Estimation, Identification and Adaptive Control by P.R. Kumar and Pravin Varaiya, and Prof. Eilyan Bitar s lecture notes. contents theory the basic optimal control problem dynamic programming applications optimal stopping inventory control portfolio analysis LQ systems 1

part I theory The Basic Problem Definitions The state of a system is a variable that separates the future from the past. The state is constrained at each time k by x k X k R n, k = 0, 1,..., N where X k is the feasible state space. The initial state x 0 is generally treated either as given, or as an n-dimensional random vector. The control or input u k is constrained to a feasible control space U k : where m n. u k U k (x k ) R m The state disturbance w k is a p-dimensional random vector: w k W k R p where p n. The disturbance w k is characterized by a probability distribution P wk ( x k, u k ) which, by assumption, does not depend explicitly on prior disturbances w 0,..., w k 1. The state transition map is a function f k : X k U k W k X k+1 that describes the evolution of the state in time: x k+1 = f k (x k, u k, w k ). The measurement noise v k is a q-dimensional random vector: v k V k R q where q n. The noise v k is characterized by a probability distribution P vk ( x 0,..., x k, u 0,..., u k 1, w 0,..., w k 1, v 0,..., v k 1 ) which (unlike the disturbance) may depend explicitly on prior states, controls, disturbances and noises. 2

The observation y k is a function of the state and noise: y k = h k (x k, v k ) Y k R r where r n and h k : X k V k Y k is the measurement model. The stage cost is a function g k : X k U k (x k ) W k R. The terminal cost is a function g N : X N R. A policy or control law π is a sequence of decision functions µ k (y k ), where y k = (y 0,..., y k ). Explicitly, ( ) π = µ 0 (y 0 ), µ 1 (y 0, y 1 ),..., µ N 1 (y 0, y 1,..., y N 1 ) ( ) = µ 0 (y 0 ), µ 1 (y 1 ),..., µ N 1 (y N 1 ) A policy is called admissible or feasible if, for all k = 0,..., N 1, and µ k (y k ) U k (x k ) y k Y 0 Y k and x k X k f k (x k, µ k (y k ), w k ) X k+1. Denote the set of all admissible policies by Π. The expected cost of a policy π, for a given initial state x 0, is the expected sum of the terminal cost and all stage costs under π: ] J π (x 0 ) = E wk [g N (x N π ) + N 1 k=0 g k (x π k, µ k (y k,π ), w k ) The superscript π emphasizes the dependence of x k and y k on the policy π. Remember that y k depends implicitly on x k through the observation map. The expected cost of a policy π over all initial states, then, is J(π) = E x0 [J π (x 0 )]. Optimal Control The goal of optimal control is as follows. Given a joint pdf on {x 0, w 0,..., w N, v 0,..., v N }, state transition maps f k, and measurement models h k, we seek to design a feasible policy π that minimizes the expected cost over all initial states, J(π). This is a harder problem than static optimization because we re optimizing over functions rather than numbers. 3

Formally, the basic optimal control problem is min J(π) π Π s.t. x π k+1 = f k (x π k, u π k, w k ) y π k = h k (x π k, v k ) u π k = µ k (y 0 π,..., y π k) U k (x π k) A policy π is called optimal if it minimizes the expected cost, J(π ) = E x0 [J π (x 0 )] = min π Π J(π). We call J π (x 0 ) the optimal cost function or optimal value function. It maps every initial state x 0 into an optimal cost. Interesting fact: in many cases, a policy π that minimizes J(π ), the expected cost over all initial states, is also optimal for any particular initial state x 0. Controlled Markov Chains For stochastic systems with finite-dimensional state spaces, we can use an alternative formulation. The controlled Markov chain description of a stochastic system is specified by 1. state transition probabilities P xk+1 x k,u k (x k+1 x k, u k ), 2. observation maps h k (x k, u k ) or observation probabilities P yk x k (y k x k ), and 3. the joint distribution of the basic RVs x 0, v 0,..., v N 1. More specifically, we can describe a controlled Markov chain as follows. State: x k X = {1, 2,..., I}. The feasible state space X is time invariant and finite-dimensional. Control: u k U(x k ). The feasible control space U may be infinite-dimensional. State transition probability matrix, P (u) [0, 1] I I : P (u) = [ p ij (u) ], 1 i, j I where p ij (u) = P(x k+1 = j x k = i, u k = u). 4

Costs: the expected cost under a policy π, for initial condition x 0, is [ ] J π (x 0 ) = E g N (x π n) + N 1 k=0 g k (x π k, u π k) Here the expectation is just a sum, weighted by the state transition probabilities p ij (u), and g k (x π k, uπ k ) and g N(x π N ) are the stage and terminal costs. The Bellman Equations are therefore Information Patterns V N(i) = g N (i) V k (i) = inf u U(i) { g k (i, u) + I j=1 } p ij (u) Vk+1(j) Let I k be the set of all observations that your controller is allowed to depend upon at time k. A stochastic system can follow one of four information patterns, defined by their particular forms of I k. 1. Oracle. I k = {w 0,..., w N 1 } {v 0,..., v N 1 } {x 0 } Perfect knowledge of past, present and future. Realizations of all disturbances, noise and initial state are known.. 2. No knowledge. I k = 3. Perfect information. I k = {x k } Current state is completely observed with no noise. 4. Imperfect information. I k = {y k } Current state is incompletely observed, possibly with noise. Mathematically, patterns 1 and 2 are easy. Patterns 3 and 4 are hard. In order of performance, 1 > 3 > 4 > 2. 5

Dynamic Programming Dynamic programming (DP) is the primary tool for solving optimal control problems. Here we present the dynamic programming algorithm for a perfectly observed Markovian system, i.e. one with yk π = x π k (perfect observation) and u π k = µ k (x π k) (Markovian policies) Assume that the random vectors {x 0, w 0,..., w N } are independent. Dynamic Programming Algorithm The DP algorithm consists of the following three steps: 1. Define the optimal terminal value function by V N(x N ) = g N (x N ). 2. At each stage k = N 1,..., 0, recursively define the optimal stage value function by solving the following optimization problem: Vk (x k ) = min E [ gk (x k, u k, w k ) + Vk+1( fk (x k, u k, w k ) )]. u k U k (x k ) w k The solution to the above results in - the optimal stage decision function µ k (x k), and - the optimal stage value function V k. 3. At stage 0, calculate the optimal expected cost over all initial states: J(π ) = E x0 [V 0 (x 0 )] The optimal policy π is the sequence of decision functions found in Step 2. The equations in Steps 1 and 2 are collectively referred to as the Bellman Equation. Optimality of the Bellman Equation Some theorems necessary for proving that dynamic programming works. 6

Preliminaries Let s build up to the proof of optimality of the Bellman Equation. Here are a few useful results, presented without proof. The proofs can be found in Kumar and Varaiya, pages 72-77. Lemma. (Markov Property) Let Π M Π be the set of all Markovian feasible policies, i.e. those where the policy depends on the current state only, so µ k = µ k (x k ). If π Π M, then x π k+1 {x π k 1, x π k 2,... } x π k or, equivalently, x π k+1 = f k (x π k, µ k (x π k), w k ) and w k {x 0 π,..., x k 1 π }. A Markovian policy is one where the control depends only on the current state. If the policy is Markovian, then the state is Markovian too. Definition. At stage k, the expected cost-to-go for the Basic Problem under policy π is J π k (x 0 ) = E wk [ ( g N (x N π ) + N 1 j=k ) g j (x π j, u π {x0 j, w j ) π,..., x π k} The expected total cost under π over all initial states x 0, then, is J π (x 0 ) = E x0 [J π 0 (x 0 )]. ]. The Comparison Principle Lemma. (Comparison Principle) Let V k (x k ), k = 0,..., N be any collection of functions that satisfy C1. V N (x N ) g N (x N ) x N X N C2. V k (x k ) g k (x k ) + Ew k [V k+1 (f k (x k, u k, w k ))] x k X k, u k U k, k = 0,..., N 1 Then for any π Π, V k (x π k) J π k (x 0 ), k = 0,..., N. Corollary. Let V 0 (x 0 ),..., V N (x N ) satisfy C1 and C2. Then J(π ) E x0 [V 0 (x 0 )] and if J π 0 (x 0 ) = V 0 (x 0 ), then π is optimal. The value functions upper bound the optimal expected cost. If the total expected cost under a feasible policy π equals the stage 0 value function, then π is optimal. 7

Optimality of DP Definition. Let the following define the Bellman Equation: A1. V N (x N ) = g N (x N ) A2. V k (x k ) = inf uk U k { g k (x k, u k, w k ) + Ew k [ Vk+1 ( fk (x k, u k, w k ) )] } Theorem. (Optimality of DP) 1. For any π Π, V k (x π k ) J π k (x 0), and in particular, J(π) Ex 0 [V 0 (x 0 )]. 2. If a Markov policy π Π M achieves the infimum in A2, then (i) π is optimal, (ii) V k (x π k ) = J π k (x 0), and (iii) J(π) = Ex 0 [V 0 (x 0 )] (i.e. the inequalities in 1 are binding). 3. A Markov policy π Π M is optimal only if for each k, the infimum at x π k by u π k = µ k(x π k ), i.e. is achieved V k (x π k) = g k (x π k, µ k (x π k), w k ) + E wk [V k+1 (f k (x π k, µ k (x π k), w k ))]. A feasible Markov policy is optimal if and only if it satisfies the Bellman Equation. Monotonicity of DP Consider the stationary version of the basic problem, i.e. X k = X, U k = U, f k = f, g k = g for all k, and w k iid. If, in this case, the expected terminal cost exceeds the expected cost-to-go at stage N 1, then the value functions must be monotonically decreasing with k: V N(x) V N 1(x) for all x X = V k+1(x) V k (x) for all x X and for all k. Similarly, if the stage N 1 cost-to-go exceeds the terminal cost, then the value functions must be monotonically increasing with k: V N(x) V N 1(x) for all x X = V k+1(x) V k (x) for all x X and for all k. 8

part II applications We studied the following problems in the segment of the class on perfectly observed systems. 1. Moving on a graph. 2. (Deterministic) shortest path. 3. Multi-period gambling (a.k.a. dynamic portfolio analysis). 4. Optimal stopping problems: - asset selling - deadline purchasing - the secretary problem - general stopping problems. 5. Perfectly observed linear systems with quadratic costs (LQ) - finite horizon - infinite horizon. 6. Inventory control (book only). 7. Scheduling (e.g. EV problem; see Bertsekas 4.5). Optimal Stopping Problems Asset Selling A series of offers w k are received at stages k = 0,..., N. The decisions are to either stop in which case the offer is taken and invested at interest rate r or to continue. If at stage N you haven t taken an offer, you must take offer w N. Structure of the optimal policy: threshold. If x k > α k, accept. For iidoffers, the thresholds monotonically decrease with time, i.e. α k+1 α k. If the horizon N is very large, the optimal policy is well approximated by a stationary threshold policy: stop if x k < ᾱ. A good practice problem: derive the thresholds α k and ᾱ. 9

Deadline Purchasing A similar problem to asset selling. Some stuff has to be bought by some deadline. At each stage there s a random price w k. Whether the prices are correlated or iid, the optimal policy ends up being a threshold policy: purchase if x k < α k. The α k are given by a linear system, and for both the iidand correlated price cases the thresholds are monotonically increasing with time, i.e. α k+1 α k. Good/easy practice problem: derive the threshold α k for the iidcase. Harder problem: do the same for the correlated case. General Stopping Problems A stationary problem (X k = X, U k = U, f k = f, g k = g k, w k iid). At each stage, you can pay a cost t(x k ) to stop, at which point you stay stopped forever at no additional cost. Stopping is mandatory by stage N. The Bellman Equations for this problem are VN(x N ) = t(x N ) { Vk (x k ) = min t(x k ), min u k U(x k ) { Ewk [ g(xk, u k, w k ) + V k+1(f(x k, u k, w k )) ] }}. by The optimal policy is stop when x k T k, where T k is the optimal stopping set, defined T k = { { [ x k t(xk ) min Ewk g(xk, u k, w k ) + Vk+1(f(x k, u k, w k )) ] }}. u k U(x k ) For the general stopping problem, T 0 T k T k+1... T N 1, i.e. the optimal stopping sets get larger as time goes on. For the special case where the stage N 1 stopping set is absorbing, i.e. f(x, u, w) T N 1 x T N 1, u U, and w, we get the nice result that T k = T N 1 for all k. In words, the optimal stopping set at stage k is the set of all states for which it is better to stop now, than to proceed for one more stage and then stop. Policies of this type are called one step look-ahead policies. Good problems: work through Examples 4.4.2 and, time permitting, 4.4.1. The Secretary Problem You plan to interview (or date) N secretaries (partners). You want to maximize the probability that you hire (marry) the best secretary (wife). (NB: this is a different objective than our typical metric of optimizing the expected value.) Assume that you hire (date) one candidate per stage and that each candidate has a random quality w k [0, 1]. 10

There are four possible states: (you ve stopped or you haven t) (the current candidate is the best or not the best). There are three possible controls: accept the current candidate, reject them, or do nothing. If you ve already stopped, your only choice is to do nothing. If you ve already seen a better candidate than your current option, then you can t stop now. If you reach the last stage without hiring (marrying) anyone, then you have to hire (marry) your last option. Result: the optimal policy is a threshold policy: stop at stage k, where where k = inf { k N k N V k (0) } Vk (0) = k N 1 N j=k is a non-increasing function in k, and the state 0 means you ve yet to pick, but the current candidate is not the best you ve seen so far. As expected, the thresholds Vk (0) get lower as time goes on. A nice approximate solution that s independent of the distribution of w k : stop after k secretaries, where k = N e. Expect to date 10 people of random and iidquality? Marry the fourth one ( 10 2.7 = 4). Inventory Control Here the state is the stuff held in inventory, the control is how much new stuff to order at each stage (at a unit cost c), and the disturbance is the demand at each stage. No Fixed Cost In the simple version of this problem, the stage cost is p if x k < 0 (a shortage cost ) and h if x k > 0 (a holding cost ). This leads to a convex optimization with a single solution, and therefore a threshold policy: buy S k x k if x k < S k, and buy nothing if x k S k. We can interpret S k as the optimal amount of stock to have at stage k. Formally, we have V k (x k ) = min y k x k G k (y k ) cx k, where G k (y) = cy + H(y) + E w [ V k+1 (y w) ], H(y) = p E wk [max 0, w k y] + h E wk [max 0, y w k ], y k = x k + u k, and S k = arg min y R G k(y). 1 j 11

Positive Fixed Cost In this case, we add a fixed cost K to the stage cost (which is still either a shortage or a holding cost). The optimal policy is still a threshold policy, but with a different optimal amount of stock: buy S k x k if x k < s k, and buy nothing if x k s k. The derivation of this cost is harder, however, because the cost function is no longer convex, but K-convex. Formal definitions: y k, G k (y), S k, and H(y) are defined as above, but we have s k = min y { y Gk (y) = K + G k (S k ) }. A policy of this form is called a multiperiod (s,s) policy. Dynamic Portfolio Analysis In this problem, an investor decides how to distribute an initial fortune x 0 among a collection of n assets with random rates of return e 1,..., e n. The investor also has the option of a riskfree asset with rate of return s. The controls u 1,..., u n are the investments in each of the risky assets. We assume the objective is to maximize the expectation of a utility function U(x), assumed concave and twice differentiable. Single Period The investor s fortune after a single period is n x 1 = sx 0 + (e i s)u i. i=1 When the utility function satisfies U (x)/u (x) = a + bx (e.g. exponential, logarithmic or power functions) for all x and for some scalars a, b, then the optimal portfolio is the linear policy µ i (x 0 ) = α i (a + bsx 0 ) for i = 1,..., n, where the α i are constants. The single-period optimal policy is called myopic. Multiperiod Here we reinvest the stage k 1 fortune at stage k: n x k+1 = s k x k + (e k i s k )u k i, k = 0,..., N 1, i=1 and we maximize the expected utility of the terminal fortune x N. If the utility function follows the properties above, then we again get linear controls: ( µ k (x a ) k) = α k + bs k x k R n, s N 1... s k+1 12

where α k is a random vector that depends on the joint distribution of the rates of return. 1 For utility functions of the form ln(x) or b 1 (bx)1 (1/b), b 0, b 1, or for risk-free rate of return s k = s = 1 for all k, it turns out that the myopic policy is optimal at every stage. In some other cases, a partially myopic (limited look-ahead) policy is optimal. For the infinite horizon problem, the optimal multiperiod policy approaches myopia. Perfectly Observed Linear Systems with Quadratic Costs (a.k.a. LQ systems) Finite Horizon First we study the non-stationary linear system x k+1 = A k x k + B k u k + w k, k = 0,..., N 1. (w k is an n-dimensional random vector; if you don t have randomness entering on all channels, then the elements of w k corresponding to those channels are just zero.) The stage and terminal costs are quadratic, so the Bellman Equations are VN(x N ) = x T N Q N x N { [ Vk (x k ) = min x T k Q k x k + u T k R k u k + E V k+1 (A k x k + B k u k + w k ) ] } u k wk We assume that the matrices Q k are positive semidefinite symmetric, and the matrices R k are positive definite symmetric. The controls u k are unconstrained, and the disturbances w k are zero mean, finite variance independent random vectors with distributions independent of state and control. Result. state: where The optimal value functions V k are quadratic, so the optimal control law is linear in the µ k (x k ) = L k x k, (1) L k = (B T k K k+1 B k + R k ) 1 B T k K k+1 A k, and the matrices K k are given by the following recursion, known as the discrete-time Riccati equation (DRE): K N = Q N, K k = A T k (K k+1 K k+1 B k (B T k K k+1 B k + R k ) 1 B T k K k+1 )A k + Q k. The optimal cost (for a given initial state) is J 0 (x 0 ) = x 0 T K 0 x 0 + N 1 k=0 [ ] E T wk K k+1 w k. w k 13

It s occasionally useful to express the last term as the following: [ ] E T wk K k+1 w k = Tr(Kk+1 Cov(w k, w k )), where w k Cov(w k, w k ) = E wk [ wk w k T ]. Infinite Horizon Now we consider the (mostly) stationary linear system x k+1 = Ax k + Bu k + w k, k = 0,..., N 1. with Q k = Q, R k = R. (It s mostly stationary because the disturbances w k are not assumed to be iid.) In the large horizon regime (N 1, k N), the optimal finite horizon control law (1) is well approximated by µ (x) = Lx, where L = (B T KB + R) 1 B T KA, and the matrices K solve the discrete-time algebraic Riccati equation (DARE): K = A T (K KB(B T KB + R) 1 B T K)A + Q. This linear control law is easy to implement and applicable to a wide range of problems. It s so nice to work with, in fact, that people often squish problems into the LQ framework so that these powerful results (and the analogous ones for the LQG problem) can be applied. Many variations of the LQ problem have been studied, such as non-zero-mean disturbances, tracking a trajectory rather than driving the state to 0, and random system matrices. The last is a tool for overcoming the sensitivity to modeling error that these results often exhibit. When is it valid to use the asymptotic approximation, i.e. to use the DARE in place of the DRE? When the following conditions are met: 1. The system is stationary: A k = A, B k = B, Q k = Q, R k = R for all k, 2. Q = Q T 0, R = R T 0, 3. (A,B) is a controllable pair, and 4. (A,C) is an observable pair, where we can write Q = C T C. Under these conditions, we re guaranteed that: 1. There exists a K 0 s.t. for all K N, lim k K k (K N ) = K, 2. K is the unique solution of the DARE, and 3. The closed-loop system x k+1 = (A + BL) x k is stable, where L = (B T KB + R) 1 B T KA. 14