Markov Decision Processes II

Size: px

Start display at page:

Download "Markov Decision Processes II"

Jane Cannon
6 years ago
Views:

1 Markov Decision Processes II Daisuke Oyama Topics in Economic Theory December 17, 2014

2 Review Finite state space S, finite action space A. The value of a policy σ A S : v σ = β t Q t σr σ, t=0 which satisfies v σ = r σ + βq σ v σ. The value function v R S : v (s) = sup π Π M v π (s), where Π M is the set of Markov plans. In the end, for a v -greedy policy σ we have v = v σ. 1 / 20

3 Review: Operators T σ : R S R S, σ A S : T σ v = r σ + βq σ v. v σ is the unique fixed point of T σ. T : R S R S : T v = max σ A S r σ + βq σ v. By definition, T σ v T v for any σ and v. σ is v-greedy if T σ v = T v. 2 / 20

4 T σ and T are monotone. T σ (v + c1) = T σ v + βc1 and T (v + c1) = T v + βc1. T σ and T are β-contractions. The unique fixed point of T σ is v σ. The unique fixed point of T is v. A v -greedy policy (which exists) is an optimal policy. (T σ v = T v = v v = v σ.) For any v, T n σ v v σ and T n v v as n. 3 / 20

5 Policy Iteration 1. Set n = 0. Choose any σ 0 ; or choose any v 0 and let σ 0 be a v 0 -greedy policy. 2. [Policy evaluation] Solve (I βq σn )x = r σn for x and let v n+1 = x. 3. [Policy improvement] Compute a v n+1 -greedy policy σ n+1, i.e., a σ n+1 such that T σn+1 v n+1 = T v n If σ n+1 = σ n, then return ˆσ = σ n and ˆv = v n+1. Otherwise, let n = n + 1 and go to Step 2. 4 / 20

6 Proposition 1 The policy iteration algorithm terminates in finitely many steps, and ˆσ is an optimal policy and ˆv is the optimal value. 5 / 20

7 ε-optimality Let v be the value function. v is a δ-approximation of v if v v < δ. σ is an ε-optimal policy if v σ is an ε-approximation of v. 6 / 20

8 Error Bounds 1 Lemma 2 For any v R S, v T v β T v v. 1 β Proof v T v v T m v + T m v T v, where m 1 Second term T k+1 v T k v k=1 m 1 k=1 β k T v v = β βm T v v. 1 β Let m. 7 / 20

9 Lemma 3 For any v R S and any T v-greedy policy σ, v σ T v β 1 β T v v. Proof Denote u = T v. Recall that v σ = T σ v σ and T σ u = T u. Then, v σ u = T σ v σ u T σ v σ T u + T u u = T σ v σ T σ u + T u T v β v σ u + β u v. Rearranging terms yields the desired inequality. 8 / 20

10 Proposition 4 For any v R S and any T v-greedy policy σ, v σ v 2β 1 β T v v. Proof By the previous two lemmas, v σ v v σ T v + T v v β 1 β T v v + β 1 β T v v. 9 / 20

11 Error Bounds 2 For x R S, write m(x) = min i x i and M(x) = max i x i. Lemma 5 For any v R S and any v-greedy policy σ, v + 1 β m(t v v)1 T v + m(t v v)1 1 β 1 β v σ v T v + β 1 M(T v v)1 v + M(T v v)1. 1 β 1 β 10 / 20

12 For x R S, write span(x) = M(x) m(x) (= max x i min x i ). i i Proposition 6 For any v R S and any v-greedy policy σ, v v σ β span(t v v), 1 β and v ( T v + β 1 β 1 β span(t v v). 2 1 β m(t v v) + M(T v v) 1) 2 11 / 20

13 Proof of Lemma 5 Take any v R n, and let σ be a v-greedy policy: T σ v = T v. (Recall m(x) = min i x i and M(x) = max i x i.) Clearly, T σ v = T v v + m(t v v)1. By the properties of T σ, Tσ 2 v T σ (v + m(t v v)1) = T σ v + βm(t v v)1 v + (1 + β)m(t v v)1, Tσ 3 v T σ (v + (1 + β)m(t v v)1) = T σ v + β(1 + β)m(t v v)1 v + (1 + β + β 2 )m(t v v)1,. 12 / 20

14 We thus have T n σ v T σ v + (β + + β n 1 )m(t v v)1 v + (1 + β + + β n 1 )m(t v v)1. Letting n, we have v σ T σ v + Note that T σ v = T v. By a similar procedure, we have v T v + Note finally that v v σ. β 1 m(t v v)1 v + m(t v v)1. 1 β 1 β β 1 M(T v v)1 v + M(T v v)1. 1 β 1 β 13 / 20

15 Remarks Similar estimates with T v v and T v v in place of m(t v v) and M(T v v) hold. (Start with T v v 1 T v v T v v 1.) Since m(x) x and M(x) x, we have span(t v v) 2 T v v. 14 / 20

16 Error Bounds and Termination Conditions Bound 1 Bound 2 Value iteration Modified policy iteration 15 / 20

17 Value Iteration with Norm Bounds Specify ε > Set n = 0. Choose any v Let v n+1 = T v n. 3. If v n+1 v n < 1 β 2β ε, then return ˆv = vn+1 and a ˆv-greedy policy ˆσ. Otherwise, let n = n + 1 and go to Step / 20

18 Proposition 7 Given an ε > 0, the value iteration algorithm as described terminates in finitely many steps, and ˆσ is an ε-optimal policy and ˆv is an ε 2 -approximation of v. 17 / 20

19 Modified Policy Iteration with Span Seminorm Bounds Specify ε > 0 and k Set n = 0. Choose any v [Policy improvement] Compute a v n -greedy policy σ n+1, i.e., a σ n+1 such that T σn+1 v n = T v n. Compute also u n = T v n (= T σn+1 v n ). 3. If span(u n v n ) < 1 β ε, then return ˆσ = σ n+1 and ˆv = u n + β 1 β β m(u n v n )+M(u n v n ) 2 1. Otherwise, go to the next step. 4. [Partial policy evaluation] Let v n+1 = (T σn+1 ) k v n = (T σn+1 ) k 1 u n. Let n = n + 1 and go to Step / 20

20 Fact 1 For modified policy iteration, as n, v n v and hence span(t v n v n ) 0. Proposition 8 Given an ε > 0, the modified policy iteration algorithm as described terminates in finitely many steps, and ˆσ is an ε-optimal policy and ˆv is an ε 2 -approximation of v. 19 / 20

21 References D. P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models, Prentice Hall, M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley-Interscience, / 20

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure