Dynamic Programming (DP) Massimo Paolucci University of Genova

DP cannot be applied to each kind of problem In particular, it is a solution method for problems defined over stages For each stage a subproblem is defined The overall solution is obtained recursively DP is based on the so-called Optimality Principle, which allows one to reduce the solution of a given problem to the solutions of a series of subproblems

A company has three machineries and 5M available to expand them C i : possible investment on machinery i R i : corresponding profit Possibility of investment An example Machinery 1 Machinery 2 Machinery 3 C1 R1 C2 R2 C3 R3 1 2 1 5 2 8 1 3 3 2 6 3 9 - - 4 - - 4 12 - - Target: finding the most rewarding investment

In this simple example one might enumerate all possible alternatives, but in general explicit enumeration is cumbersome and inefficient: for each alternative one has to solve the whole problem; unfeasible alternatives are not recognized a priori; at each step, the information obtained during the computation of previous alternatives is not exploited.

In this example, the approach based on Dynamic Programming can be introduced via a graph. The model: stage machinery state (x i ) possible allocation of money to machineries from stage to stage i (x i 5) arc (x i, x i+1 ) most rewarding allocation of money (x i+1 -x i ) Weight of each arc profit associated with the corresponding investment

stage x stage 1 x 1 stage 2 x 2 stage 3 x 3 1 2 3 4 5 5 1 2 3 4 5 5 6 6 6 6 8 9 9 8 12 8 12 12 8 3 3 3 3 3

The problem consists in finding the path -5 (from stage to stage 3) with largest weight Each path represents an admissible solution Some arcs represent overspending situations

Backward phase The graph is travelled backward, starting from stage 3 To each node one associates f i (x i ) := length of the maximal path between x i and x 3

stage 3: f 3 (5) = stage 2: f 2 (x 2 ) = R(x 2, x 3 ) = length of the maximal path between x 2 and x 3 f 2 () = f 2 (1) = f 2 (2) = f 2 (3) = f 2 (4) = 3 f 2 (5) =

stage 1: length of the maximal path between x 1 and x 3 f1(x 1) max f2(x2) R(x1,x 2) x 2 length of the maximal path between x 2 and x 3 length of the arc between x 1 and x 2

stage 1: f 1 () = max[+3, +3, 8+3, 9+3, 12+3, 12+] = 15 f 1 (1) = max[+3, +3, 8+3, 9+3, 12+] = 12 f 1 (2) = max[+3, +3, 8+3, 9+] = 11 f 1 (3) = max[+3, +3, 8+] = 8 f 1 (4) = max[+3, +] = 3 f 1 (5) = max[+] =

stage : length of the maximal path between x and x 3 f () = max[+15, 5+12, 6+11, 6+8, 6+3, 6+] = 17 f() max f1(x 1) R(,x1) x 1

Forward phase There exist various alternative maximal paths, which can be found by travelling the solution forward: path 1) - 1-4 - 5 possibility of investment 2, 3, 2 path 2) - 1-5 - 5 possibility of investment 2, 4, 1 path 3) - 2-4 - 5 possibility of investment 3, 2, 2

We have seen a deterministic example The DP approach can be applied, with suitable modifications, also to stochastic contexts

The backward equations of DP (deterministic context) Let us consider the following general case: x........ 1 2 N-1 x 1 x 2 x N-1 N x N N stages (i=,...,n-1) For each stage i one has x i {1,...,q}, i.e., there are q possible different states. In general, the state is a vector in R n A cost T(x i, x i+1 ) is associated with each arc (e.g., the cost when the arc is travelled)

Minimum total cost: T N1 min T i(x i,xi1) i Number of possible paths: q N-1 E.g., q=1, N=21 1 2 If a path is computed in 1-6 sec, then finding the solution requires 1 14 sec 3 1 6 years!!!

DP solves the problem with a backward procedure, in which a subproblem is solved at each stage To each arc one associates the optimal cost required to reach the final stage starting from it Stage N-1: T N-1(x N-1 ) = T N-1 (x N-1, x N ) For instance, in the case of the route of an plane, such a cost is the time required to reach x N from x N-1

Stage N - 2: T (x ) min T (x,x ) T N 2 N2 N2 N2 N1 N 1 (xn1) xn1 Minimum time to reach x N starting from x N-2 Minimum time to reach x N starting from x N-1 Minimum time to reach x N-1 starting from x N-2

In general Backward equations of DP T (xi) i min T i(xi,xi1) xi1 T i 1 (xi1) i N 1,N 2,..., T (xn) N Cost-To-Go

Optimality Principle: Whichever the state at a certain stage is, one has to proceed by following the optimal trajectory: T minmin... x1 x2 min TN xn1... At each stage, one has to make q sums q q 2 sums Hence, in total one has q 2 N operations Example: 1 2 21 1-6 = 2.1 1-3 = 2.1 msec Compare with 3 1 6 years!!!

Bellman s theorem It proves the correctness of the backward equations of DP T N1 min T i(x i,xi1) x1,...,xn1 i x : departure, x N : arrival T min T (x,x1) x1,...,xn1 N1 T i(x i,xi1) i1 x 1 influences all the terms, but x 2,...,x N-1 influence only the terms inside the sum.

Hence T mint (x,x1) x1 N1 min T i(x i,xi1) x2,...,xn1 i1 Cost-To-Go Optimal cost from the second stage till the last one

The equation can be rewritten as T min T (x,x ) T 1 (x1) 1 x1 1st equation of DP By writing T 1 (x 1 ) explicitly T N 1 (x ) min T (x,x ) T (x,x ) 1 1 1 1 2 i i i 1 x2,...,xn1 i2 mint1 (x1,x 2) x2 min T (x,x ) T 1 1 2 (x2) 2 x2 N1 min T i(x i,xi1) x3,...,xn1 i2 2nd equation of DP... and so on

The curse of dimensionality In general the state is not a scalar, but an n-dimensional vector Hence, at each stage of DP one has to keep the information regarding all possible combinations of values.

For instance, when the state has dimension 2 (i.e., n=2) and the possible values of each x i are d, then one has for each stage q q for N-1 stages q 2 q 2... q 2 = q 2(N-1) in general, for dimension n q n(n-1) x 2 x 1 stage 1... stage N Number of operations in dimension n: q n q n =q 2n sums per stage q 2n N operations over N stages

Dynamic Programming: summing up To apply DP, it is required that the problem can be divided into stages. For every stage, a decision policy has to be determined A number of states is associated with each stage The effect of the decision at each stage consists in transforming the present state into a new state, associated with the next stage

At the current stage, the decision taken at previous stages does not influence the decision at next stages The backward solution process determines the optimal decision policy for each state of the previous stage The optimal decision policy is obtained recursively It is determined by travelling forward the sequence of decisions