Long Term Values in MDPs Second Workshop on Open Games

Size: px

Start display at page:

Download "Long Term Values in MDPs Second Workshop on Open Games"

Tyler Heath
5 years ago
Views:

1 A (Co)Algebraic Perspective on Long Term Values in MDPs Second Workshop on Open Games Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

2 Introduction Joint work with Frank Feys (Delft) and Larry Moss (Indiana): Long-Term Values in Markov Decision Processes, (Co)Algebraically. Proc. of Coalgebraic Methods in Computer Science (CMCS 2018). Aim: Apply (co)algebraic techniques to reason about Markov Decision Processes More generally, infinite games and equilibria (cf. Abramsky & Winschel, Lescanne, Hedges, Zahn, Ghani, Kupke, Lambert, Nordvall-Forsberg,...) Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

3 Outline 1 MDP Preliminaries 2 Part I: Long-Term Values from b-corecursive Algebras 3 Part II: Policy Improvement (Co)Inductively 4 Conclusion Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

4 MDPs: Planning under Uncertainty Markov decision processes (MDPs) are state-based models of sequential decision-making under uncertainty. The system/agent chooses actions and collects rewards, but does not have full control over transitions. The decision maker wants to find a policy/plan that maximizes future expected long-term rewards. Applications: maintenance schedules, production planning, finance, reinforcement learning,... MDPs are one-player stochastic games. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

5 MDP Example A start-up company needs to decide to Advertise or Save money: S 1/2 1/2 S 1 Poor & Unknown +0 Rich & 1/2 A A 1/2 1/2 1/2 1/2 1/2 1/2 Poor Famous +0 Rich Unknown Famous +10 1/2 +10 S S & & A 1 1 A Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

6 Markov Decision Processes Def. A (discrete, time-independent) Markov Decision Process (MDP) is a Set-coalgebra m = u, t : S R ( S) A where S finite set of states, A is a finite set of actions, (, δ, µ) is the monad of finitely-supported distributions, t : X ( X ) A is a probabilistic transition function, u : X R is a reward function. (Alternatively, m : S (R S) A, i.e. rewards are given on transitions) Def. A (deterministic, stationary) policy σ is a map σ : S A. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

7 Expected Rewards via Trace Semantics Given m = u, t : S R ( S) A and policy σ : S A, we get Markov reward process m σ = u, t σ : S R S where t σ (s) := t(s)(σ(s)) mσ : S R S by determinisation (cf. Jacobs, Silva, Sokolova), details coming up. trc S δ S! R ω m σ= u,t σ m σ R S id R! R R ω Trace semantics trc(s) = (r σ 0 (s), r σ 1 (s), r σ 2 (s),...) where r σ n (s) is expected reward at time step n starting from s. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

8 Distributive Laws (cf. Bartels (2004)) Let (T, η, µ) be a monad on C and F : C C. A distributive law of (T, η, µ) over F is λ: TF FT satisfying λ η F = F η and λ µ F = F µ λ T T λ. Given such a λ, we obtain liftings F λ : EM(T ) EM(T ) (TA α A) (TFA λ A FTA F α FA) T λ : Coalg(F ) Coalg(F )...and determinisation (X c FX ) (TX Tc TFX λ X FTX ) ( ) : Coalg(FT ) Coalg(F λ ) (X c FTX ) (TX Tc TFTX λ TX FT 2 X F µ X FTX ) Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

9 Distributive Law for Markov Reward Processes Markov reward process m σ : S R S is H -coalgebra where H = R Id. There is distributive law of (, δ, µ) over H (cf. Jacobs (2006)) χ X : (R X ) π 1, π 2 R X E id R X i.e. χ = E π 1, π 2. which yields determinisation of Markov reward process m σ : m σ : S R S given by m σ(ϕ) = (E u)(ϕ), (µ S t σ )(ϕ) = ( s S u(s) ϕ(s), s s S t σ(s)(s ) ϕ(s )). Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

10 Trace Semantics of Markov Reward Process trc S δ S! R ω m σ= u,t σ m σ R S id R! R R ω Trace semantics trc(s) = (r σ 0 (s), r σ 1 (s), r σ 2 (s),...) where r σ n (s) is expected reward at time step n starting from s. Long-term expected value for σ in s depends on how you evaluate Different evaluation criteria exist... (r σ 0 (s), r σ 1 (s), r σ 2 (s),...) Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

11 Long-Term Value via Discounted Sums Let 0 γ < 1 be a discount factor. Def. The long-term value of policy σ according to the discounted sum criterion is V σ : S R: V σ (s) = γ n rn σ (s) n=0 Converges because reward map u : S R is bounded. We define: σ τ if for all s, V σ (s) V τ (s). σ is an optimal policy if σ τ for all τ. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

12 Optimal Value Def. The optimal value function V : S R of m is defined as V (s) = max V σ (s) σ Classical facts (wrt discounted sum criterion), cf. (Puterman, 2014): If σ is optimal, then V σ = V. Optimal policy always exists. Optimal policies need not be unique. Stationary (memoryless), deterministic policies suffice. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

13 Outline 1 MDP Preliminaries 2 Part I: Long-Term Values from b-corecursive Algebras 3 Part II: Policy Improvement (Co)Inductively 4 Conclusion Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

14 V σ as Coalgebra-to-Algebra Morphism V σ : S R satisfies for all s S: V σ (s) = u(s) + γ s S t σ(s)(s ) V σ (s ) i.e. V σ = u + γt σ V σ (as linear system) So, V σ arises as a fixpoint of the linear operator Ψ σ : R S R S given by Ψ σ (v) = u + γt σ v Observation: we can re-express (1) as V σ being a coalgebra-to-algebra morphism: (1) S V σ R α γ m σ= u,t σ R R R E R S R R R (V σ ) where α γ : R R R is α γ (x, y) = x + γ y. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

15 V σ via Universal Property? Recall that a corecursive algebra (for functor F ) is an F -algebra α s.t. f C FC! f A α FA Ff Question: Is α γ (R E) a corecursive algebra? By (Capretta et al., 2004): Let H = R Id. α γ (R E) a corecursive algebra (for H ) iff α γ corecursive algebra (for H) Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

16 Is α γ Corecursive Algebra? α γ : R R R is corecursive (for H = R Id) if: for all f : X R X there is unique f : X R such that f = α γ (R f ) f. Consider coalgebra f : X R X given by a f = x 0 a 0 1 a x1 2 x2... f is solution iff f (x n ) = a n + γ f (x n+1 ), n = 0, 1, 2,.... This system has infinitely many solutions when γ > 0, even if (a n ) n is bounded. So, α γ is not corecursive for γ > 0. However, if (a n ) n is bounded then this system has a unique bounded solution. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

17 Bounded Corecursive Algebra (bca) To get uniqueness incorporate boundedness information. Def. A b-category (C, B) is a category C with a subclass B Mor(C) of bounded morphisms s.t. for all f : f B f g B. (Also known as a sieve.) Main example: (Met, B) where Met is metric spaces with all maps, and B are all bounded maps. Def. Let (C, B) be a b-category and F : C C endofunctor. An F -algebra α: FA A is a b-corecursive algebra (bca) if X f B FX!f B A α FA Ff Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

18 V σ from Universal Property of bca We show α γ (R E) is bca for H : 1 Develop some theory of b-categories (b-functors, b-natural transformation, B-preservation properties,... ) 2 Prove b-version of (Capretta et al.)-result (Theorem 2). (From bca for H we obtain a bca for H under certain conditions). 3 Show that α γ is bca for H. 4 Show that conditions for Theorem 2 apply. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

19 Step 1.1: b-categories Let (C, B) and (C, B ) be b-categories. Let F, G : C C be functors. Def. A C-arrow f preserves B if g B f g B (whenever f g is defined). F is a b-functor if f B implies that Ff preserves B. F is a strong b-functor if f B implies that Ff B. A natural transformation σ : F G is a b-natural transformation if all components σ X preserve B. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

20 Step 1.2: MDPs in (Met, B) Functors H = R Id and are lifted to Met: The product (X, d X ) (Y, d Y ) = (X Y, d X d Y ) is given the maximum metric. (X, d X ) = ( X, d X ) where d X is the Kantorovich lifting of d X on X. We have: H : Met Met is b-functor on (Met, B), but not strong. : Met Met is strong b-functor on (Met, B). δ, µ, χ are b-natural transformations wrt (Met, B). B (bounded maps) is closed under determinisation: c B c B. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

21 Step 2: b-version of (Capretta et al.) Theorem 2 Let (C, B) be a b-category, F a C-endofunctor, (T, η, µ) a monad on C, and λ a distributive law of (T, η, µ) over F. Assume further that T is a strong b-functor and that λ and F µ are b-natural in (C, B). 1 If β : F λ (A, θ) (A, θ) is an F λ -algebra in EM(T ) such that underlying β : FA A is a bca for F, and θ preserves B, then β F θ : FTA A is a bca for FT. 2 Moreover, for all g : X FTX in B, we have g = (g ) η X and (g ) = θ Tg, where h h is the solution operation for the bca β : FA A and h h is the solution operation for the bca β F θ : FTA A. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

22 Step 3+4: Obtaining bcas Step 3: Apply Banach FPT to show that α γ is bca for H. Step 4: Apply Theorem 2: - α γ is H λ -algebra (R R) αγ R E π 1, π 2 R R E R E R R α γ R - E preserves B Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

23 Optimal Value V from bca S V R m= u,t α γ (R max A E A ) R ( S) A R ( V ) A R ( R) A But bca is not obtained via Theorem 2. Problem: max A is not affine/linear. We can show directly that we have bca. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

24 Outline 1 MDP Preliminaries 2 Part I: Long-Term Values from b-corecursive Algebras 3 Part II: Policy Improvement (Co)Inductively 4 Conclusion Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

25 Policy Iteration For σ : S A and ϕ S, let l σ (ϕ) = s S ϕ(s) V σ (s ) (expected long-term value for σ wrt ϕ). Policy Iteration Algorithm: 1 Initialise σ 0 to any policy. 2 Compute V σ k (e.g. by solving system of linear equations). 3 Define σ k+1 by σ k+1 (s) := argmax a A {l σk (t a (s))} 4 If σ k+1 = σ k then stop, else go to step 2. Termination: (since A S is finite). Correctness: follows if σ k+1 σ k. Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

26 Policy Improvement By definition, for all s S, σ k+1 (s) = argmax a A {l σk (t a (s))}. which implies for all s S, l σk (t σk+1 (s)) l σk (t σk (s)), i.e. l σk t σk+1 l σk t σk (in pointwise order on R S ). Policy Improvement Lemma: For all σ, τ: l σ t τ l σ t σ V τ V σ Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

27 Contraction (Co)Induction Def. Ordered metric space An ordered (complete) metric space (M, d, ) is a (complete) metric space (M, d) together with a partial order (M, ) such that for all y M, {z z y} and {z y z} are closed in the metric topology. Example: B(X, R) with the pointwise order and supremum metric. Theorem: Contraction (Co)Induction Let M be a non-empty, ordered complete metric space. If f : M M is both contractive and order-preserving, then the fixpoint x of f is: (i) a least pre-fixpoint (if f (x) x, then x x), and (ii) a greatest post-fixpoint (if x f (x), then x x ). Cf. Metric Coinduction (Kozen & Ruozzi, 2009) and (Denardo, 1967). Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

28 Proof of Policy Improvement Policy Improvement Lemma: For all σ, τ: l σ t τ l σ t σ V τ V σ Proof: Apply Contraction (Co)induction to Ψ σ : R S R S (contractive and order-preserving ) Ψ σ (v) = u + γt σ v, and V σ is its fixpoint. We have: (by contr. coind.) l σ t τ l σ t σ u + γ l σ t τ u + γ l σ t σ Ψ τ (V σ ) Ψ σ (V σ ) = V σ V τ V σ Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

29 Conclusion Value functions V σ and V from b-corecursive algebras. Coinductive proof of policy improvement theorem. Still need to resort to Banach FPT to get fixpoints. cf. metric coinduction (Kozen & Ruozzi, CALCO 2009). Future work Stochastic games (MDP is 1-player stochastic game): Existence of Nash Eq = Kakutani + Contraction (Co)Induction. Other types of equilibria (Subgame perfect, Markov,...) Make connections to Open games (Hedges et al.) Semantics of equilibria (Pavlovic, 2009) Coalgebraic infinite games (Abramsky & Winschel, 2017) Learning: reinforcement learning Thanks! Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July / 33

Long-Term Values in MDPs, Corecursively

Long-Term Values in MDPs, Corecursively Applied Category Theory, 15-16 March 2018, NIST Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) MDPs, Corecursively NIST, 15/Mar/2018