Lecture 21. Dynamic Programming. Dynamic programming is a method to solve optimal control problems. Here we

Size: px

Start display at page:

Download "Lecture 21. Dynamic Programming. Dynamic programming is a method to solve optimal control problems. Here we"

Chester Gibbs
5 years ago
Views:

1 Lectre 21 Dynamic Programming Karl Henrik Johansson Dynamic programming is a method to solve optimal control problems. Here we introdce the notion by discssing dynamic programming for a combinatorial problem and dynamic programming for continos-time systems. 1 Discrete-Time Systems Consider the problem of traveling by minimm cost from point A to point in the following graph: A The weights on the edges denote the cost for taking a particlar way between two vertices. We assme that it is only possible to move north-east and soth-east in the graph. One way of solving this problem is, of corse, to calclate the total cost for all possible paths going from A and. A more ecient way (in particlar, if the nmber of vertices is large) is to se dynamic programming. To illstrate the idea, let s introdce some notation. Let x(t) 2 f1; : : :; 4g f1; : : :; 4g denote the state at (discrete) time t 2 f0; : : :; g, where each time step corresponds to moving one step to the right in the graph. The vertices are labeled after the rows (pointing soth-east) and the colmns (pointing north-east) in the graph, sch that the initial state A is x(0) = (1; 1) and the nal state is x() = (4; 4). At each time t 2 f0; : : :; 5g, a control action (t) 2 f1g is to be chosen, which decides how next step shold be taken ( = +1 for north-east and = 1 for

2 EECS291E Hybrid Systems Lectre 21 soth-east). Note that for some states the control is constrained, for example, for x(4) = (4; 1) only (4) = +1 is possible. Dene the cost fnction J(()) = P 5 k=0 W (x(k); x(k + 1)) for a path from A to, where W (x(k); x(k + 1)) is the weight on the edge (x(k); x(k + 1)). 1 Let the cost-to-go fnction J (x(t); t) denote the optimal (minimm) cost fnction starting in x(t) at time t, i.e., J (x(t); t) = min (t);:::;(5) J(()): The dynamic programming soltion to the traveling problem is based on deriving the cost-to-go fnction for all states, starting with the nal state and then going backwards ntil the initial state is reached. We do the bookkeeping by specifying the vale J (x(t); t) at the vertex corresponding to (x(t); t). At the nal state, we obviosly have J ((4; 4); ) = 0. At (3; 4), we have J ((3; 4); 5) = J ((4; 4); ) + W ((3; 4); (4; 4)) = 0 + = and similarly we have J ((4; 3); 5) =, as indicated in the following detail of the graph: 0 The arrows indicate the optimal control, i.e., the optimal action to take at a certain state. For (3; 4) and (4; 3) the optimal control is, of corse, the trivial choices shown above. Deriving J (; 4) yields the following detail of the graph: Contining like this gives the graph 1 Here x() denotes a fnction in contrast to a single point, x(k), say. Note also that, of corse, x() depends on () throgh the dynamics imposed by the graph. 2

3 EECS291E Hybrid Systems Lectre A From this, we may easily depict the optimal soltion going from A to by jst following the arrows: A This is the dynamic programming soltion to the graph traveling problem. In general, we have J (x(t); t) = J (x(t + 1); t + 1) + W (x(t); x(t + 1)): Hence, the cost-to-go fnction at time t depends only on the cost-to-go fnction at time t + 1 and the optimal cost going from x(t) to x(t + 1). This is a simple, bt fndamental, fact, which is called the principle of optimality. The principle of optimality states that if gives the optimal soltion (i); (i + 1); : : : ; (j); : : :; (k) x(i); x(i + 1); : : : ; x(j); : : : ; x(k); 3

4 EECS291E Hybrid Systems Lectre 21 then the trncated control (j); : : : ; (k) gives the optimal soltion from x(j). It ths follows that it is not necessary to solve the whole optimal control at once, bt instead we may solve individal pieces going backwards from the nal state and then patch these pieces together, as was done in the example. Comparing dynamic programming with the approach of deriving all possible paths from A and, we see that the dynamic programming leads to the calclation of 15 nmbers while there are 20 possible paths. In general, for a graph with N N vertices, we have N 2 1 cost-to-go fnctions to calclated compared to (2(N 1))!=((N 1)!) 2 paths. For example, for N = 8, we get 3 compared to Dynamic programming atomatically leads to a feedback soltion: to each state (vertex) there is a dedicated control action (arrow). This means that the soltion is robst, for instance, if an implse distrbance moves the state from one location to another, still the optimal control is applied. Dynamic programming is particlarly ecient in mlti-stage decision problems, when there are few control choices at each stage. In the traveling problem, we have only two choices at each time step. If the dimensions of the control space and the state space are large, dynamic programming may be less appealing. The phrase \crse of dimensionality" was given to the problem of exponential growth of a hyper-volme as a fnction of dimensionality Dynamic programming may also be applied to discrete-time systems specied as dierence eqations. Next, however, we consider the continos-time set-p. 2 Continos-Time Systems Consider the continos-time control system _x(t) = f(x(t); (t); t); x(t0) = x0; (1) where t 2 [t0; t f ], x(t) 2 R n, and (t) 2 R. Assme that f is smooth in all its argments and that is piecewise continos. Dene the cost fnction J(()) = Z tf t0 L(x(s); (s); s) ds + (x(t f ); t f ); (2) where the rnning cost L and the terminal cost are smooth fnctions. oth the initial time t0 and the nal time t f > t0 are assmed to be xed. The optimal control problem is to minimize J with respect to : [t0; t f ]! R sbject to (1). The optimal control is denoted : [t0; t f ]! R and the corresponding trajectory x : [t0; t f ]! R n. The cost-to-go fnction is dened as the minimm cost to go from any state x 2 R n at time t 2 [t0; t f ] to x(t f ) and is given by Z tf J (x; t) = min L(x(s); (s); s) ds + (x(t f ); t f ): () t 4

5 EECS291E Hybrid Systems Lectre 21 The cost-to-go fnction at t = t0 is hence the vale of the cost fnction for the optimal control, i.e., J (x (t0); t0) = J( ()): The principle of optimality for continos-time control systems reads as follows. Let x and be the optimal trajectory and optimal control as dened above. Assme x0 = x ( t0 ) for some t0 2 (t0; t f ). Then, the optimal control for (1) with respect to (2), where t0 is replaced by t0 and x0 by x0, is given by the trncated control : [ t0 ; t f ]! R with (t) = (t) for all t 2 [ t0 ; t f ]. This means, similar to the discrete case, that the optimal control problem can be solved by going backwards from the nal state, by rst solving the optimal control problem on some interval [t f ; t f ], then on [t f 2; t f ] etc, and nally patching the soltions together to get the soltion on [t0; t f ]. Hamilton-Jacobi-ellman Using the principle of optimality, we do a heristic derivation of the Hamilton- Jacobi-ellman eqation. Consider an optimal trajectory going from x (t0) to x (t f ), shown as a solid line the following gre: x (t0) x 00 = x(t 0 + t) x 0 = x(t 0 ) x(t f ) x (t f ) Consider a non-optimal trajectory deviating from the optimal only between x(t 0 ) = x 0 and x(t 0 + t) = x 00, and from x 00 following an optimal trajectory to the nal state at t = t f. The deviation is shown with a dashed line above. Denote the cost for starting in x 0 at t 0 and passing throgh x 00 by J 0 (x 0 ; t 0 ). Clearly, J 0 (x 0 ; t 0 ) J (x 0 ; t 0 ) with eqality only for t = 0, i.e., only if (t) = (t) for all t 2 (t 0 ; t f ). If t > 0 is small, then J 0 (x 0 ; t 0 ) J (x 0 ; t 0 + t) + L(x 0 ; 0 ; t 0 )t; where we sed the notation 0 = (t 0 ). Hence, assming that the following minima are attained, we get J (x 0 ; t 0 ) = min J (x 00 ; t 0 + t) + L(x 0 ; 0 ; t 0 )t = min J (x 0 + f(x 0 ; 0 ; t 0 )t; t 0 + t) + L(x 0 ; 0 ; t 0 )t = min J (x 0 ; t 0 (x0 ; t 0 )f(x 0 ; 0 ; t 0 (x0 ; t 0 )t + L(x 0 ; 0 ; t 0 )t : 5

6 EECS291E Hybrid Systems Lectre 21 Note that the minima are taken pointwise, i.e., 2 R. This yields the Hamilton- (x; t) (x; t)f(x; ; t) + L(x; ; t) ; J (x; t f ) = (x; t f ): This is a partial dierential eqation with bondary condition given by the terminal cost. Under certain conditions, the soltion of this eqation corresponds to the optimal cost. The optimal control is eqal (t) = arg (x; t)f(x; ; t) + L(x; ; t) : Note that this pointwise minimization gives a feedback control law: given the fnction J, the control is derived from the crrent state x(t). We rephrase the Hamilton-Jacobi-ellman eqation in a slightly more general setting in the following theorem. Here the nal time t f is a variable and x(t f ) is constrained by the terminal condition (x(t f ); t f ) = 0, where is a smooth fnction. Theorem 1 (Hamilton-Jacobi-ellman Eqation). Assme the C 1 fnction V satises the Hamilton-Jacobi-ellman eqation (x; t) = (x; t)f(x; ; t) + L(x; ; t) ; 8x; t V (x; t f ) = (x; t f ); 8x 2 fz : (z; t f ) = 0g: Also, assme the minimm is attained for all x and t for some : R n [t0; t f ]! R, which is piecewise C 0 in t. Finally, assme that the soltion to _x(t) = f(x(t); (x(t); t); t) is niqe for all initial states. Then, V is the niqe soltion of the Hamilton- Jacobi-ellman eqation and it is eqal to the optimal cost-to-go fnction, i.e., V (x; t) = J (x; t); 8x; t: Frthermore, the optimal control is given by (t) = (x(t); t) = arg (x(t); t)f(x(t); ; t) + L(x(t); ; t) : Linear Qadratic Control Consider the optimal control problem given by the linear time-invariant system _x(t) = Ax(t) + (t)

7 EECS291E Hybrid Systems Lectre 21 and cost fnction J(()) = Z tf t0 x(t)t Qx(t) + (t) 2 dt + x(t f ) T Q f x(t f ); where t0 and t f are xed, Q and Q f are positive semidenite matrices, and > 0. The Hamilton-Jacobi-ellman eqation is then eqal (x; t) = (x; t)(ax + ) + xt Qx + 2 with bondary condition V (x; t f ) = x T Q f x: Let s try a soltion of the form V (x; t) = x T P (t)x with P (t) = P (t) T. = _ xt P = 2xT P (t); which gives x T P(t)x _ = min 2xT P (t)ax + 2x T P (t) + x T Qx + 2 : The minimm is attained for = 1 T P (t)x. Sbstitting this into the Hamilton-Jacobi-ellman eqation gives x T P(t)x _ = 2x T P (t)ax 2 1 x T P (t) T P (t)x+x T Qx+x T P (t) 1 T P (t)x; so P (t) mst satisfy the continos-time Riccati eqation _ P (t) = P (t)a A T P (t) + P (t) 1 T P (t) Q; P (t f ) = Q f : Solving this matrix dierential eqation ths gives the optimal linear qadratic control as (t) = 1 T P (t)x(t); which is a time-varying linear state-feedback control law. Note that it follows from Theorem 1 that this control is niqe, so there exist no nonlinear controller that performs eqally well or better. ackgrond The term dynamic programming was coined by ellman [1]. There are several good textbooks on optimal control and dynamic programming, e.g., [2{5]. The example in Section 1 is from [3].

8 EECS291E Hybrid Systems Lectre 21 References 1. R. ellman. Dynamic Prograggmin. Princeton University Press, Princeton, NJ, D. P. ertsekas. Dynamic Programming and Optimal Control: Volmes I{II. Athena Scientic, elmont, MA, A. E. ryson and Y.-C. Ho. Applied Optimal Control. Hemisphere Pblishing Corporation, G. Leitmann. An Introdction to Optimal Control. McGrawHill, New York, NY, L. C. Yong. Optimal Control Theory. Chelsea,

Finance: Risk Management Module II: Optimal Risk Sharing and Arrow-Lind Theorem

Finance: Risk Management Module II: Optimal Risk Sharing and Arrow-Lind Theorem Institte for Risk Management and Insrance Winter 00/0 Modle II: Optimal Risk Sharing and Arrow-Lind Theorem Part I: steinorth@bwl.lm.de Efficient risk-sharing between risk-averse individals Consider two