3. The Dynamic Programming Algorithm (cont d)

Size: px

Start display at page:

Download "3. The Dynamic Programming Algorithm (cont d)"

Felicia Logan
5 years ago
Views:

1 3. The Dynamic Programming Algorithm (cont d) Last lecture e introduced the DPA. In this lecture, e first apply the DPA to the chess match example, and then sho ho to deal ith problems that do not match the standard form outlined in Section 2.1. xample 1: Chess match strategy (revisited) Consider a to game chess match ith an opponent. Our objective is to develop a strategy that maximizes the chance of inning the match. ach game can have one of to outcomes: 1) Win/Lose: 1 point for the inner, 0 for the loser; 2) Dra: 0.5 points for each player. In addition, if at the end of to games the score is equal, the players keep on playing ne games until one ins, and thereby ins the match (also knon as sudden death). There are to possible playing styles for our player: timid and bold. When playing timid, our player dras ith probability p d and loses ith probability (1 p d ). When playing bold, our player ins ith probability p and loses ith probability (1 p ). We also assume that p d > p, a necessary condition for this problem to make sense. We ant to find a control policy that maximizes the probability of inning the match. We ill solve this using the DPA (replacing min ith max). The state x k is the difference beteen our player s score and the opponent s score at the end of game k. That is, x 0 = 0, x 1 S 1 = { 1, 0, 1}, x 2 S 2 = { 2, 1, 0, 1, 2}. The control inputs u k are the to playing styles, that is u k U = {timid, bold}. Dynamics: model as a finite state system using transition probabilities (see Section 1.3): here x k+1 = k, k = 0, 1 Pr ( k = x k u k = timid) = p d Pr ( k = x k 1 u k = timid) = 1 p d Pr ( k = x k + 1 u k = bold) = p Pr ( k = x k 1 u k = bold) = 1 p ith p d > p. Date compiled: October 6,

2 Cost: e ant to maximize the probability of inning. This is equivalent to the standard form 1 g 2 (x 2 ) + g k (x k, u k, k ) here k=0 g k (x k, u k, k ) = 0, k {0, 1} 1 if x 2 > 0 g 2 (x 2 ) = p if x 2 = 0 0 if x 2 < 0 To see that the expected cost is equal to the probability of inning P in, let q + := Pr (x 2 0), q 0 := Pr (x 2 = 0) q := Pr (x 2 0). The probability of inning is P in = q + + q 0 p and the expected value of the cost is g 2 (x 2 ) = q q 0 p + q 0 = q + + q 0 p. No apply DPA: Initialization: 1 if x > 0 J 2 (x) = p if x = 0 0 if x < 0 Recursion: J k (x) = max u U g k(x k, u k, k ) + J k+1 (x k+1 ), x S k, k {0, 1} ( k x k =x,u k =u) = max J k+1( k ) u U ( k x k =x,u k =u) = max p dj k+1 (x) + (1 p d )J k+1 (x 1), p J k+1 (x + 1) + (1 p )J k+1 (x 1) }{{}}{{}. timid Henceforth the first entry of the maximum ill denote the cost associated ith timid play and the second ith bold play. k = 1 : J 1 (x) = max {p d J 2 (x) + (1 p d )J 2 (x 1), p J 2 (x + 1) + (1 p )J 2 (x 1)} x 1 = 1: J 1 (1) = max {p d + (1 p d )p, p + (1 p )p } Comparing the to entries yields: (p d + (1 p d )p ) (p + (1 p )p ) = (p d p )(1 p ) > 0 (since p d > p ) Therefore µ 1 (1) = timid and J 1(1) = p d + (1 p d )p. 2 bold

3 x 1 = 0: k = 0 : J 1 (0) = max {p d p + (1 p d ) 0, p + (1 p ) 0} = max {p d p, p } Therefore µ 1 (0) = bold and J 1(0) = p. x 1 = 1: J 1 ( 1) = max {p d 0 + (1 p d ) 0, p p + (1 p ) 0} = max { 0, p 2 } Therefore µ 1 ( 1) = bold and J 1( 1) = p 2. J 0 (x) = max {p d J 1 (x) + (1 p d )J 1 (x 1), p J 1 (x + 1) + (1 p )J 1 (x 1)} x 0 = 0: J 0 (0) = max {p d J 1 (0) + (1 p d )J 1 ( 1), p J 1 (1) + (1 p )J 1 ( 1)} = max { p d p + (1 p d )p 2, p (p d + (1 p d )p ) + (1 p )p 2 } = max { p d p + (1 p d )p 2, p d p + (1 p d )p 2 + (1 p )p 2 } Therefore µ 0 (0) = bold and J 0(0) = p d p + (1 p d )p 2 + (1 p )p 2. Optimal match strategy: Play timid iff ahead in the score. 3.1 Converting non-standard problems to the standard form At first glance, the class of systems of our standard problem formulation in Section 1.1 may seem limiting but is in fact general enough to handle other types of problems via state-augmentation. We include some examples belo Time Lags Assume the dynamics have the folloing form: x k+1 = f k (x k, x k 1, u k, u k 1, k ) Let y k := x k 1, s k := u k 1, and the augmented state vector x k := (x k, y k, s k ). The dynamics of the augmented state then become x k+1 f k (x k, y k, u k, s k, k ) x k+1 = y k+1 = s k+1 x k u k =: f k ( x k, u k, k ) hich no matches the standard form. Note that this procedure orks for an arbitrary number of time lags. 3

4 3.1.2 Correlated Disturbances Disturbances k that are correlated across time (colored noise) can commonly be modeled as the output of a linear system driven by independent random variables as follos: k = C k y k+1 y k+1 = A k y k + ξ k here A k, C k are given and ξ k, k = 0,..., N 1, are independent random variables. Let the augmented state vector x k := (x k, y k ). Note that no y k must be observed at time k, hich can be done using a state estimator. The dynamics of the augmented state then become xk+1 fk (x x k+1 = = k, u k, C k (A k y k + ξ k )) =: y k+1 A k y k + ξ f k ( x k, u k, ξ k ) k hich no matches the standard form Forecasts Consider the case here at each time period e have access to a forecast that reveals the probability distribution of k, and possibly of future disturbances. For example, assume that k is independent of x k and u k. At the beginning of each period k, e receive a prediction y k (forecast) that k ill attain a probability distribution out of a given finite collection of distributions { p k y k ( 1), p k y k ( 2),..., p k y k ( m) }. In particular, e receive a forecast that y k = i and thus p k y k ( i) is used to generate k. Furthermore, the forecast itself has a given a-priori probability distribution, namely, y k+1 = ξ k, here the ξ k are independent random variables taking value i {1, 2,..., m} ith probability p ξk (i). Let the augmented state vector x k := (x k, y k ). Since the forecast y k is knon at time k, e still have perfect state information. We define our ne disturbance as k := ( k, ξ k ), ith probability distribution p( k x k, u k ) = p( k, ξ k x k, y k, u k ) = p( k x k, y k, u k, ξ k ) p(ξ k x k, y k, u k ) = p( k y k ) p(ξ k ). Note that k depends only on x k (in particular y k ), and ξ k does not depend on anything. The dynamics therefore become xk+1 fk (x x k+1 = = k, u k, k ) =: f k ( x k, u k, k ). y k+1 hich no matches the standard form. The associated DPA becomes: ξ k 4

5 Initialization J N ( x) = J N (x, y) = g N (x), x S N, y {1,..., m} Recursion J k ( x) = J k (x, y) Steps: u U k (x k ) ( k x k = x,u k =u) 1 u U k (x k ) ( k y k =y) 2 u U k (x k ) ( k y k =y) u U k (x k ) ( k y k =y) g k (x k, u k, k ) + J k+1 (f k (x k, u k, k ), ξ k ) g k (x, u, k ) + J k+1 (f k (x, u, k ), ξ k ) ξ k g k (x, u, k ) + J k+1 (f k (x, u, k ), ξ k ) ξk g k (x, u, k ) + 1. Using p( k x k, u k ) = p( k y k ) p(ξ k ). m p ξk (i) J k+1 (f k (x, u, k ), i) i=1 x S k, y {1,..., m}, k = N 1,..., Since g k (x k, u k, k ) is not a function of the random variable ξ k The Curse of Dimensionality The above mentioned conversions come ith the price of increased computational complexity. The augmented state space increases the computational burden exponentially. This is sometimes knon as the curse of dimensionality. 5

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE Suboptimal control Cost approximation methods: Classification Certainty equivalent control: An example Limited lookahead policies Performance bounds