Convergence of Best-response Dynamics in Zero-sum Stochastic Games

Size: px

Start display at page:

Download "Convergence of Best-response Dynamics in Zero-sum Stochastic Games"

Dinah Atkinson
6 years ago
Views:

1 Convergence of Best-response Dynamics in Zero-sum Stochastic Games David Leslie, Steven Perkins, and Zibo Xu April 3, 2015 Abstract Given a two-player zero-sum discounted-payoff stochastic game, we introduce three classes of continuous-time best-response dynamics, stopping-time best-response dynamics, closed-loop best-response dynamics, and open-loop best-response dynamics. We show the global convergence of the first two classes to the set of minimax strategy profiles, and the convergence of the last class when the players are not patient. We also show that the payoffs in a modified closed-loop bestresponse dynamic converges to the asymptotic value in the zero-sum stochastic game. 1 Introduction Continuous-time best-response dynamic is a well-known evolutionary dynamic. It is in the form of differential inclusion, with a constant revision rate in myopic optimization; see Matsui (1989), Gilboa and Matsui (1991), Hofbauer (1995), and Balkenborg et al. (2013). A state in the dynamic specifies the strategy profile of all players, and the frequency of a strategy increases only if it is a best response to the current state. The continuous-time best-response dynamic has been analyzed in various classes of games; see Hofbauer and Sigmund (1998) and Sandholm (2010). In particular, the convergence of a continuous-time best-response dynamic to the set of Nash equilibria has been shown in Harris (1994), Hofbauer (1995), and Hofbauer and Sorin (2006) for two-player zero-sum games, in Harris (1994) for weighted-potential games, and in Berger (2005) for 2 n games. A zero-sum stochastic game with discounted payoff is introduced in Shapley (1953). The study on non-zero-sum stochastic games is on the basis of zero-sum stochastic games. By Folk Theorem, the analysis on the former games often needs 1

2 the values in related zero-sum stochastic games, which serve as punishment to any player who deviate from the strategy agreed beforehand; see Dutta (1995). In the present paper, we introduce several candidates to the continuous-time best-response dynamic in two-player zero-sum stochastic games with discounted payoff. We should first point out that it is tricky to define a best-response dynamic in a stochastic game, and indeed no established notion is available in the literature yet. We construct a few dynamics in the present paper, and the convergence result heavily depends on the dynamic condition. The key question here is what information and how much rationality the players have in the game. For instance, given the stationary strategy of the opponent, can the player find her best stationary strategy against it? This innocent-looking question is not easy to answer, or at least the searching takes quite a long time, even if the player is equipped with a modern computer. Note that the best stationary strategy may not be a pure strategy due to the concave or convex property of discounted payoff in a stochastic game. Maybe a more suitable question for myopic game players is that given a stationary strategy profile, whether a player can compute the discounted payoff, or whether a referee is available for such information. For the convergence in best-response dynamic, one may wonder why not view a zero-sum stochastic game as a general zero-sum game with a compact and convex strategy space for each player, and apply results such as those shown in Hofbauer and Sorin (2006). Our answer is that the transition of states after each stage in a stochastic game makes the dynamic more complicated than those studied in Hofbauer and Sorin (2006). In particular, Hofbauer and Sorin (2006) consider only those games with payoff concave in player 1 s strategy space and convex in player 2 s strategy space. In a stochastic game, however, it can be the opposite case, i.e., convex to player 1 and concave to player 2: player 1 s strategy in the current state may give her not only a better payoff today but also a better state position tomorrow. Another natural attempt is to approximate the stochastic game by a super normal-form game where each player s strategy is a bounded sequence of actions in the state games. However, one can only find an approximate value of the game. Furthermore, the stationary minimax strategy in the stochastic game may still looks elusive, as we can only see how players best respond in a truncated stochastic game. Finally, as the discount factor increases in the game, i.e., the players become more and more patient, we need a bigger and bigger super normal-form game to approximate the stochastic game. For an evolutionary dynamic in a stochastic game, we assume that a player 2

3 does not have the full rationality to compute the global best response given the strategy profiles of all players. However, the player should be able to see the difference between her state-game payoff and the continuation payoff in that state. For stopping-time best-response dynamics and closed-loop best-response dynamics defined in this paper, we specify how the continuation payoff vector evolves in each state game in the dynamic. We then study how to apply the convergence result of continuous-time best-response dynamic in normal-form zero-sum games in the context of evolving state games. We first follow the original idea in Shapley (1953), define stopping-time bestresponse dynamic in a zero-sum stochastic game, and show the convergence of the dynamic to the minimax strategy profile. This dynamic requires that both players just play best response against each other as in a standard continuous-time best-response dynamic. The continuation payoff vector in each state game is only updated to the payoff vector in that state game at countably many times, and the time interval between the two consecutive updates is sufficiently long. The stopping-time best-response dynamic reminds us that for a state game with fixed continuation payoff vector, best-response dynamic converges to the minimax strategy profile in that zero-sum game, but how to apply it to the learning in the original stochastic game? To this end, for each state s, we introduce a feedback loop from the current payoff in the state game at s to the continuation payoff of state s assumed in each state game. We further require that players play best response against each other in all state games simultaneously, and the commonly shared continuation payoff vector is approaching the state-game payoff vector as in a continuous-time fictitious play. As time passes, the continuation payoffs change more and more slowly, but the players can still adjust their strategies in the same speed as before. The key to the convergence of closed-loop best-response dynamic in a zero-sum stochastic game is simply the different adjustment speed between best-response dynamics on players strategy and fictitious play on the continuation payoffs. In the literature, there are a few paper concerning algorithms for computation of the value in a zero-sum stochastic game with discounted payoff; see, e.g., Vrieze and Tijs (1982) and Borkar (2002). However, they do not study from the perspective of evolutionary or learning process. Closed-loop best-response dynamic instead follows a rudimentary approach in the real world: when the state is changing very slowly, the player may simply best response in the current state, even if they know that the future states depend on their current behavior and that the currently assumed continuation payoffs may not match the payoffs generated in 3

4 the state games. The familiar examples are the production with consumption of natural resources and the issue of global warming. In the long run, what you obtain in a state game must match the corresponding continuation payoff, and the players will finally learn what behavior is best suited and how much payoff could be sustained in each state. We can progress further and propose a variant of the closed-loop best-response dynamic such that the value in each state game converges to the asymptotic value of the zero-sum stochastic game when the discount factor increases to 1. For this purpose, we just make the discount factor changes even more slowly than the continuation payoffs. Note that we can only see the convergence in value here but not necessarily in stationary strategies. Similarly to Harris (1994), we can show that the rate of convergence in payoff terms is 1/t and 1/ ln t for closed-loop best-response dynamic and discount-factor-converging best-response dynamic, respectively. In the case that the feedback loop is unavailable, we introduce open-loop bestresponse dynamics, and assume that a referee is telling each player her discounted payoff given the current stationary strategy profile of both players in the zero-sum stochastic game. In this dynamic, a player is unable to understand the infinitely long stochastic state-transition process, but naively reduce the stochastic game to a zero-sum state game with current discounted payoff vector as the continuation payoff vector. When players best response against each other in a such defined state game, each player still assumes that the other player would use the same current strategy from stage 2 on, even if she is aware that the opponent is adjusting her strategy at stage 1 in the stochastic game. We regard the open-loop bestresponse dynamic as a primary model to study the myopic behavior in state games that approximate the stochastic game. We are also interested in whether this dynamic can serve as an alternative algorithm to compute the value of the zerosum stochastic game. In Section 3.3, we show that when the discount factor is not too big, i.e., when the players are not patient, the open-loop best-response dynamic converges to the set of stationary minimax strategy profiles. 2 The Model We begin by reviewing two-player normal-form zero-sum games. 4

5 2.1 Normal-form Zero-sum Games In a zero-sum game G where player 1 and 2 s pure strategy sets are A 1 are A 2, respectively, the (a 1, a 2 ) element u(a 1, a 2 ) in the payoff matrix denotes the payoff to player 1 when player 1 plays a 1 and 2 plays a 2. We can then linearly extend the payoff function of player 1 to mixed strategies, i.e. u(x 1, x 2 ) is defined for any x 1 (A 1 ) and x 2 (A 2 ). Recall the value of the game is v(g) = max min x 1 (A 1 ) x 2 (A 2 ) u(x 1, x 2 ) = min max x 2 (A 2 ) x 1 (A 1 ) u(x 1, x 2 ). A minimax strategy of player 1 guarantees the payoff no less than v(g), regardless of the strategy of player 2; similarly, a minimax strategy of player 2 guarantees the payoff no more than v(g). A minimax strategy profile is also a Nash equilibrium in G Preliminary Results Lemma 2.1. Given a positive finite number c, if we modify the payoff function u to u with the property u (a 1, a 2 ) u(a 1, a 2 ) c for all (a 1, a 2 ) A 1 A 2, then for any (mixed) strategy profile (x 1, x 2 ), u (x 1, x 2 ) u(x 1, x 2 ) c. Proof. This follows from the linear property of u. Lemma 2.2. Given a positive finite number c, if we modify the payoff function u to u with the property u (a 1, a 2 ) u(a 1, a 2 ) c for all (a 1, a 2 ) A 1 A 2, then v(g) v(g ) c, where G is the game with the modified payoff function u. Proof. For any minimax strategy profile (x 1, x 2 ) in G and any minimax strategy profile ( x 1, x 2 ) in G, we have Thus u(x 1, x 2 ) u(x 1, x 2 ) and u ( x 1, x 2 ) u (x 1, x 2 ). u(x 1, x 2 ) u ( x 1, x 2 ) u(x 1, x 2 ) u (x 1, x 2 ) max a 1,a 2 u (a 1, a 2 ) u(a 1, a 2 ) c, by Lemma 2.1. Similarly, we can show that u ( x 1, x 2 ) u(x 1, x 2 ) max a 1,a 2 u (a 1, a 2 ) u(a 1, a 2 ) c. 5

6 2.1.2 Best-response Dynamics in Normal-form Zero-sum Games In continuous-time best-response dynamics, player 1 revises her strategy according to the set of current best-response strategies br 1 (x 2 ) := argmax u(ρ 1, x 2 ); similarly, ρ 1 (A 1 ) for player 2, br 2 (x 1 ) := argmin u(x 1, ρ 2 ). A best-response dynamic in a normalform zero-sum game G ρ 2 (A 2 ) satisfies ẋ i br i (x i ) x i, i = 1, 2, (2.1) where the dot represents derivative with respect to time, and we have suppressed the time argument. Since best-response strategies are in general not unique, rigorously speaking, this is a differential inclusion, not always reduced to a differential equation. In normal-form zero-sum games, the set br i (x i ) is upper semicontinuous in x i. Hence, from any initial strategy profile (x 1 (0), x 2 (0)), a solution trajectory (x(t)) t 0 exists, and x(t) is Lipschitz continuous satisfying (2.1) for almost all t 0; see Aubin and Cellina (1984). Given a strategy profile (x 1, x 2 ), we define and H(x 2 ) := L(x 1 ) := max u(y 1, x 2 ), (2.2) y 1 (A 1 ) min u(x 1, y 2 ). (2.3) y 2 (A 2 ) For any t 0, we define in a solution trajectory (x t ) t 0 w(t) := H(x 2 (t)) L(x 1 (t)). (2.4) We call w(t) the energy of the dynamic at time t. It is straightforward to see that u(x 1 (t), x 2 (t)) v(g) w(t), t 0, (2.5) and that w(t) = 0 if and only if x 1 (t) and x 2 (t) are minimax strategies of player 1 and 2, respectively. Thus, at that time t, u(x 1 (t), x 2 (t)) = v(g). Harris (1994) and Hofbauer and Sorin (2006) show the following result. Theorem 2.3. Given a normal-form zero-sum game G, along every solution trajectory (x t ) t 0 of (2.1), w(t) satisfies ẇ(t) = w(t) for almost all t, (2.6) hence w(t) = e t w(0). (2.7) Thus, every solution trajectory of (2.1) converges to the set of minimax strategy profiles in G. 6

7 Sketch Proof: ẇ = H (x 2 ; ẋ 2 ) L (x 1 ; ẋ 1 ), where f (x i ; ẋ i ) denotes the one-sided directional derivative of f in the direction x i. The picked best-response strategies in the solution trajectory at time t are denoted by b 1 t br 1 (x 2 t ) and b 2 t br 2 (x 1 t ). Then, by the envelope theorem, we have ẇ = u(b 1, ẋ 2 ) u(ẋ 1, b 2 ). From (2.1), it follows that ẇ = u(b 1, b 2 x 2 ) u(b 1 x 1, b 2 ) = u(b 1, x 2 ) + u(b 1, x 1 ) = w. Hofbauer and Sorin (2006) have extended this convergence result to continuous concave-convex zero-sum games, i.e., u(x 1, x 2 ) is concave for each fixed x 2 and convex for each fixed x 1. Barron et al. (2009) further extend the convergence result to all continuous quasiconcave-quasiconvex zero-sum games for the bestresponse dynamics on convex/concave envelops of the payoff function. 2.2 Zero-sum Stochastic Games A two-player zero-sum discounted-payoff stochastic game is a tuple Γ = I, S, A, P, r, w constructed as follows. Let I = {1, 2} be the set of players. Let S be a finite set of states. For each player i in state s, A i s denotes the set of finitely many actions. For each state s, we put the set of action pairs A s := A 1 s A 2 s. For each pair of states (s, s ) and each action pair a A s, we define P s,s (a) to be the transition probability from state s to state s given the action pair a. We define r s ( ) to be the stage payoff function for player 1. That is, when the process is at a state s, r s (a) is the stage payoff to player 1 for the action pair a A s. Note that, in a zero-sum game, player 2 always receives stage payoff r s (a). ω is a discount factor that affects the importance of future stage payoffs relative to current stage payoff. 7

8 In any state s, player i can play a mixed action π i s (A i s): π i s(a i ) denotes the probability that when in state s, player i selects action a i A i s. In this paper, we only consider stationary strategies for both players. A stationary strategy π i i := s S (A i s) of player i specifies for each state s a mixed action π i s to be played whenever the state is s. For convenience, we may write i s for (A i s). We denote a strategy profile by π = (π 1, π 2 ) = ((π 1 s) s S, (π 2 s) s S ), and the set of strategy profiles by := 1 2. Given a strategy profile π, for any state s, we may write r s (π) = r s (π 1 s, π 2 s) = a A(s) π 1 s(a 1 )π 2 s(a 2 )r s (a), and similar treatment applies for the transition probability P s,s (π) from state s to s. We can then define the expected discounted payoff for player 1 starting in state s under the strategy profile π as [ u s (π) := E (1 ω) ] ω n r sn (π) s 0 = s, (2.8) n=0 where {s n } n N is a stochastic process representing the state of the process at each iteration, and (1 ω) is to normalize the discounted payoff. Of course, player 2 has an expected discounted payoff u s (π). Define b 1 := min r s (a), b 2 := max r s (a), and B := [b 1, b 2 ]. (2.9) s S,a A s s S,a A s Then B is the set of achievable discounted payoff in Γ, and u s (π) is in B for any strategy profile π starting at any state s. A Nash equilibrium π requires for both player i = 1 and 2 and for all states s in S, u i s( π) u i s(π i, π i ), π i i. (2.10) Note that in a zero-sum stochastic game Γ, any strategy of any player i in a Nash equilibrium is a minimax strategy of that player. Shapley (1953) proves that for all two-player zero-sum discounted-payoff stochastic games Γ starting at any state s, there exists a unique optimal value V s, called the value of state s, equal to the expected discounted payoff of player 1 which she can guarantee by any minimax strategy. Shapley (1953) further shows the existence of a stationary minimax strategy profile, and that for any stationary minimax strategy profile π, V s satisfies equations V s = (1 ω)r s ( π) + ω s S P s,s ( π)v s s S. (2.11) 8

9 We can also study the asymptotic behavior in a stochastic game Γ(ω) where ω increases to 1. Given a finite stochastic game, at each state s S, the asymptotic value lim ω 1 V s (ω) exists; see Bewley and Kohlberg (1976) and Mertens and Neyman (1981). Denote the set of stationary minimax strategies of player 1 and player 2 in the stochastic game by X 1 and X 2, respectively, and the set of stationary minimax strategy profiles by X. 2.3 A State Game If every stage payoff after the initial stage is a constant, then the stochastic game reduces to a one-shot normal-form game. Given a zero-sum stochastic game Γ, for any payoff vector u = (u s ) s S in finite numbers, we define for each state s a normal-form zero-sum state game G s ( u) upon action set A 1 s and A 2 s for player 1 and 2, respectively: the payoff function of player 1 in G s ( u) is zs u (a) := (1 ω)r s (a) + ω P s,s (a)u s, a A s. (2.12) s S As G s ( u) is a zero-sum game, player 2 receives payoff zs u (a). We view an action at state s in Γ as a strategy in the state game G s ( u). The payoff function above can be linearly extended to mixed strategy profiles in (A s ). We call the payoff vector u in (2.12) as the continuation payoff vector in the state game. We denote the value of this state game by v s ( u) and a minimax strategy of player i by x i s( u) for player i = 1, 2. We call G s ( V ) the value state game at state S, where V is defined in (2.11). To define a best-response dynamic in a stochastic game and to show the convergence, we will apply continuous-time best-response dynamics in state games. We are now ready to study three classes of continuous-time best-response dynamics in zero-sum stochastic games in discounted payoff. 3 Best-response Dynamics in Zero-sum Stochastic Games All three classes of continuous-time best-response dynamics below may be viewed as a variation of an agent-form best-response dynamic; see Appendix B. In these dynamics, each player in the stochastic game is represented by an agent in each state, and two agents continuously play best response against each other in the 9

10 state game under some condition, along with all other agents playing simultaneously in all other states. The profile of the strategies of agents in all state games for player i at any time t is the evolving (stationary) strategy of player i in the stochastic game at t in the dynamic. 3.1 Stopping-time Best-response Dynamics For each state in the zero-sum stochastic game Γ, we construct below a continuoustime best-response dynamic in that state game where its continuation payoff vector is updated at countably many discrete times. To be specific, the continuation payoffs in each state game are updated at a sequence of stopping times defined on the energy w s in that state game. We choose an arbitrary positive number µ (0, 1), and we first pick an arbitrary payoff vector u 0 with each element u s,0 B. Given any initial condition (x s (0)) s S, for each state s, consider a continuous-time best-response dynamics (x s (t)) 0 t T1 defined in (2.1) in the state game G s ( u 0 ), where T 1 is to be defined in (3.2) later. If there is no ambiguity, we abbreviate the payoff function in G s ( u 0 ) to z 0 s( ). We denote w s (t) to be the energy in G s ( u 0 ) at time t (0, T 1 ]. Note that when t (0, T 1 ], x s (t) is a minimax strategy profile in the state game G s ( u 0 ) if and only if w s (t) = 0. Moreover, from (2.5), it follows that We stop the best-response dynamic at time z 0 s(x s (t)) v s ( u 0 ) w s (t). (3.1) T 1 := min t 0 ( max s S w s(t) µ ), (3.2) and record u 1 := (z 0 s(x s (T 1 ))) s S. We then run the best-response dynamics in state games G s ( u 1 ) at all states s for all t (T 1, T 2 ] with ( ) T 2 := min max w s(t) µ/2, t T 1 s S where w s (t) is defined in G s ( u 1 ). After recording u 2 := (z 1 s(x s (T 2 ))) s S, we then run best-response dynamics in state games G s ( u 2 )... For completeness, we let T 0 = 0. In thus defined best-response dynamics (x s (t)) s S, there is an increasing sequence (T n ) n N, possibly with T n = T n+1 at some n, such that for each n 0, T n+1 := min t T n ( max w s(t) µ ), (3.3) s S 2 n 10

11 where w s (t) is defined in G s ( u n ), and u n+1 := (z n s (x s (T n+1 ))) s S is defined recursively. For every finite time t 0, there is n 0 such that T n t T n+1. (3.4) Note that we run the best-response dynamic in all state games G s ( u n ) with such n at this t. If there is no ambiguity about n, we further denote at each t (T n, T n+1 ]. y s (t) := z n s (x s (t)) s S Under thus defined stopping-time best-response dynamics (x s (t)) s S,t 0, at each state s, x s (t) is Lipschitz continuous except countably many times and w(t) is continuous in every period (T n, T n+1 ] for n 0, but possibly discontinuous at some T n. Lemma 3.1. Each T n is bounded. Proof. From the definition of b 1, b 2, and B in (2.9), it follows that a continuation payoff u s,n is always in B. Thus, at any state s and at any time t 0, 0 w s (t) b 2 b 1. (3.5) From Theorem 2.3, it follows that ẇ s (t) = w s (t) for almost all t in [T n, T n+1 ) for any n 0. Therefore, at any state s, w s (t) = w s (T n ) exp( (t T n )) T n < t T n+1, n 0. (3.6) From (3.5) and (3.3), we have lim t 0 w s (T + t) b 2 b 1 and max s S w s(t n+1 ) = µ 2 n, when T n < T n+1. For such pair of (T n, T n+1 ), we may then further infer from (3.6) that T n+1 T n ln(b 2 b 1 ) ln µ 2 n. (3.7) Theorem 3.2. For each state s, as t, y s (t) V s, and x 1 (t) and x 2 (t) converge to the set of stationary minimax strategies of player 1 and 2, respectively, in the stochastic game Γ. 11

12 Proof. The proof is analogous to the argument used in Shapley (1953). For each s S and n 0, denote by v s,n the value of the state game G s ( u n ). Note that u s,n+1 is the payoff to player 1 in the state game G s ( u n ) at time t = T n+1. After time T n+1, the state game G s ( u n ) transforms to G s ( u n+1 ) with continuation payoffs y s (T n+1 ) = u s,n+1 for each state s in S. From (3.3), it follows that at each state s, for each n 0, Therefore, for all n > 0, u s,n+1 v s,n µ 2 n. (3.8) u s,n+1 u s,n µ 2 n + µ 2 n 1 + v s,n v s,n 1. (3.9) Compare the state game G s ( u n 1 ) and G s ( u n ), we find that each element in the payoff matrix changes at most ω max s S u s,n 1 u s,n. Hence, by Lemma 2.2, From (3.9) and (3.10), it follows that max s S v s,n 1 v s,n ω max s S u s,n 1 u s,n. (3.10) max s S u s,n+1 u s,n 3µ 2 n + ω max s S u s,n 1 u s,n. By iteration, we find that for any n > 0, max s S u s,n+1 u s,n ω n ( max s S u s,1 u s,0 ) + n k=1 ( ω n k 3µ ). (3.11) 2 k Note that for fixed n, n k=1 (ωn k /2 k ) is increasing with respect to ω (0, 1). It then follows { n k=1 ω n k 2 k < ω n, when ω > 0.75, 2ω 1 2(0.75 n ), when ω Thus, from (3.11), as n increases to, max s S u s,n+1 u s,n decreases to 0. From (3.8), it then follows that max s S u s,n v s,n 0. (3.12) For each state s, as the state game G s ( u n 1 ) transforms to to G s ( u n ), w s (T n ) jumps at most 2ω max s S u s,n u s,n 1, i.e., lim t 0 w s (T n + t) w s (T n ) 2ω max s S u s,n u s,n 1. From (3.3), it then follows that w s (t) (2ω max s S u s,n u s,n 1 ) + µ/2 n 12

13 for all t (T n, T n+1 ]. Thus, w s (t) decreases to 0, as t increases to. Hence max s S y s (t) u s,n decreases to 0, where n is defined with respect to t in (3.4). Therefore, by (3.12) and (2.11), s S, y s (t) V s, as t. The convergence of x i (t) to X i with i = 1, 2 follows the standard arguments in Shapley (1953). Comment: By (3.7), we can define for the best-response dynamic a sequence of bounded stopping times independent of (w s (t)) s S such that Theorem 3.2 still holds. 3.2 Closed-loop Best-response Dynamics Inspired by Shapley (1953) and the stopping-time best-response dynamics, we study a continuous-time dynamic system where the continuation payoff vector is slowly and continuously affected by the current payoffs in all state games. There is a closed loop between the continuation payoff and the state-game payoff, and each player s strategy in each state game is always moving towards her current best response there. Given a zero-sum stochastic game Γ, we adapt continuous-time best-response dynamics (x s (t)) t 0 in each evolving state game in the following way. Pick an arbitrary u(0) = (u s (0)) s S with u s (0) B for every s S, where B is defined in (2.9). Suppose that the initial stationary strategy profile (x s (0)) s S is given. At each time t 0, for each state s S, we consider the state game G s (t) with continuation payoff vector u(t) defined in the following dynamic system u s (t) = y s(t) u s (t) (3.13) 1 + t ẋ i s br i (x i s ) x i s i = 1, 2, (3.14) where y s (t) := z ut s (x s (t)) is the payoff to player 1 in the state game G s (t). We call such a defined dynamic system a closed-loop best-response dynamic. (3.14) says that the best-response dynamics defined in (2.1) are played in the state game G s (t) at every time t 0, while the continuation payoff vector in G s (t) continuously evolves according to (3.13). Thus, there is a feedback from y s to u s such that u s is always moving towards y s, though more and more slowly. We may view that as a fictitious play applied on the continuation payoff vector with respect to the state-game payoff vector. 13

14 With similar argument to (2.1), from any initial condition (x s (0), u s (0)) s S, there exists a solution trajectory (x s (t), u s (t)) s S,t 0, where x s (t) and u s (t) are Lipschitz continuous for almost all t 0 at all states s; see Aubin and Cellina (1984).. For each state s in S, at each time t 0, denote the value of the state game G s (t) by v s (t). We still define w s (t) to be the energy in G s (t), as in (2.4). Lemma 3.3. For each state s in S, as t increases to infinity, y s (t) converges to v s (t) in the state game G s (t) and w s (t) decreases to 0. Proof. We prove it by the standard results of the convergence of best-response dynamics in normal-form zero-sum games. Suppose that a finite number ɛ > 0 is given. The definition of b 1 and b 2 in (2.9) implies that at any state s, y s (t) u s b 2 b 1 for all t 0. Therefore, it follows from (3.13) that there exists t ɛ > 0 such that u s (t) ɛ t t ɛ, s S. (3.15) On the one hand, from Harris (1994) and Hofbauer and Sorin (2006), we know that if u s (t) and w s (t) are differentiable and u s (t) = 0 holds for all states s at some time t, then ẇ s (t) = w s (t) at all s. On the other hand, from Lemma 2.1, it follows that given any period of time [t 1, t 2 ], if max s S u s (t 1 ) u s (t 2 ) c for some c > 0 and x(t 1 ) = x(t 2 ) for the state game G s (t 1 ) and G s (t 2 ), then w s (t 1 ) w s (t 2 ) 2ωc. From these two observations, it follows that the total derivative of w s satisfies for almost all time t 0. ẇ s w s + 2ω max u s s s S S Taking (3.15) into account, we can find a time T ɛ > t ɛ such that w s (t) 2ɛ for all states s and all time t T ɛ. Note that ɛ in (3.15) can be arbitrarily small. Hence, y s (t) converges to v s (t) in the evolving state game G s (t), and w s (t) converges to 0. Recall that for all states s, u s (t) and v s (t) are Lipschitz continuous and hence differentiable almost everywhere. For convenience, we mean in the lemmata below the right derivative whenever the derivative of u s or v s does not exist. We take an arbitrarily small ɛ > 0. Here are some preparation for the next lemma. 14

15 For any time t 0, we mark a state For any time t 0, we mark a state s(t) arg max s S y s(t) u s (t). (3.16) s(t) arg max s S v s(t) u s (t). (3.17) From Lemma 3.3, there exists a time t 1 such that for all t t 1 and all states s in S, Lemma 3.4. For any time t t 1, if then for any state s with the property we have y s (t) v s (t) (1 ω)ɛ/64. (3.18) u s(t) (t) y s(t) (t) ɛ 4, (3.19) us(t) (t) v s(t) (t) u s (t) v s (t) (1 ω)ɛ, (3.20) 32 d u s v s dt 1 2 (1 ω) du s( ) dt (1 ω)ɛ 8(1 + t). (3.21) This lemma says that under the condition (3.19), at any state s with the property (3.20), the distance u s v s is always decreasing with speed at least linear to 1/(1 + t). Proof. From Lemma 2.2, the definition of s(t), and the differential equation (3.13), it follows that s S, dv s dt ω du s( ) dt. (3.22) For a state with the property (3.20) at time t t 1, we may infer from the definition of t 1 that u s (t) y s (t) u s(t) (t) y s(t) (t) Thus, from (3.19), it follows that du s ɛ dt (1 ω)ɛ 4 16 ɛ 4 du s( ) dt ( ω + 1 ω 2 (1 ω)ɛ. 16 ) du s( ) dt. (3.23) Note that u s is moving to v s regardless of the movement of v s. By (3.22), we have ( ( d u s v s ω + 1 ω ) ) du s( ) + ω dt 2 dt. We complete the proof by (3.19). 15

16 Lemma 3.5. There exists time t(ɛ) such that for all t t(ɛ), u s(t) (t) y s(t) (t) ɛ. Proof. For any time t t 1, u s(t) (t) v s(t) (t) u s(t) (t) v s(t) (t) (1 ω)ɛ. 32 From Lemma 3.4, it follows that for all t t 1 with the property (3.19) d u s( ) v s( ) dt (1 ω)ɛ 8(1 + t). (3.24) Hence, there exists a time period (t 1, t 2 ) such that at any t between t 1 and t 2, (3.24) holds and u s(t2 )(t 2 ) y s(t2 )(t 2 ) ɛ/4. Recall that X i is the set of stationary minimax strategies of player i in Γ. Theorem 3.6. For each state s and for both players, as time t increases to infinity, both y s (t) and u s (t) converge to v s (t); x i s(t) converges to the set of stationary minimax strategies of player i = 1, 2 in the state game G s (t), and hence x i (t) converges to the X i in the stochastic game Γ. Proof. Note that y s (t) v s (t) by Lemma 3.3. We take a sequence (ɛ/2 n ) n 0 in Lemma 3.5 and find that u s (t) also converges to v s (t). We complete the proof by the results in Shapley (1953), in particular the equations (2.11). We now show that for the asymptotic value of the zero-sum stochastic game when discount factor increase to 1, u s (t) of any solution trajectory to the following system converges to the asymptotic value. u s (t) = y s(t) u s (t) (3.25) t + 2 ẋ i s br i (x i s ) x i s, i = 1, 2 (3.26) 1 ω(t) ω(t) = (2 + t) ln(t + 2). (3.27) We call such a dynamic an ω-converging best-response dynamic. Again, one can show the existence of a solution trajectory to the dynamic system. Note that it is straightforward to see from (3.27) that ω(ω) = 1 e c ln(t + 2) (3.28) where 1 e c / ln 2 = ω(0). 16

17 Lemma 3.7. In any ω-converging best-response dynamic, for all ɛ > 0, there exists t(ɛ) such that for all t t(ɛ), u s(t) y s(t) < ɛ. Proof. We firstly observe that Lemma 3.3 still holds for any ω-converging bestresponse dynamic and that both u s (t) and v s (t) are still Lipschitz continuous. We can then find a time t 1 with the property t t 1, s S, y s (t) v s (t) ɛ 64. (3.29) We define δ := max{ b 1, b 2 }, where b 1 and b 2 are defined in (2.9). We then take a time t 2 t 1 such that t t 2, 4δ ɛ ln(t + 2) ɛ 8. (3.30) We still use the notation of s(t) and s(t) introduced in (3.16) and (3.17), respectively. Suppose that at a time t t 2 Then, by (3.25) y s(t) (t) u s(t) (t) ɛ/4. (3.31) du s( ) dt ɛ 4(2 + t), (3.32) at this t. We now consider both v s( ) and u s( ) as functions of ω and t. (When in closed-loop best-response dynamics, they are functions of t only.) From Lemma 2.2 and (3.30), it follows that at this t v s( ) ω du s( ) dt dω dt δ(1 ω(t)) (t+2) ln(t+2) ɛ 4(2+t) On the other hand, (3.21) implies that v s( ) u s( ) t = 4δ(1 ω(t)) ɛ ln(t + 2) 1 ω(t) 2 du s( ) dt 1 ω(t). (3.33) 8. (3.34) Note that u s( ) is moving to v s( ) regardless of the movement of v s( ). Thus, from (3.33), (3.34), and (3.32), it follows that d v s( ) u s( ) dt 3(1 ω(t)) 8 We may further deduce from (3.28) that where c is defined in (3.28). d v s( ) u s( ) dt du s( ) dt ω(t)) 3ɛ(1. (3.35) 32(t + 2) 3ɛe c 32(t + 2) ln(t + 2), (3.36) Thus, there exists time t 3 t 2 such that (3.36) holds for all t between t 2 and t 3, and y s(t3 )(t 3 ) u s(t3 )(t 3 ) ɛ/4. 17

18 Theorem 3.8. For each state s, as time t increases to infinity in the ω-converging best-response dynamic, y s (t), u s (t), and v s (t) all converge to the asymptotic value at state s in the stochastic game where the discount factor ω increases to 1. Proof. This follows from Lemma 3.7 and Theorem Open-loop Best-response Dynamics In contrast to the closed-loop best-response dynamics, for open-loops ones in zerosum stochastic games, the continuation payoff vector in each state game is equal to the expected discounted payoff generated by the current stationary strategy profile in the stochastic game starting from that state The Dynamics Given a stationary strategy profile π, recall the expected discounted payoff u s (π) at each state s defined in (2.8). Denote the vector (u s (π)) s S by u(π). For each state s, we denote the payoff function of player 1 in the state game G s ( u(π)) by Q s, i.e., for a joint action a A s and the current strategy profile π Q s (π, a) := (1 ω)r s (a) + ω P s,s (a)u s (π). (3.37) s S We can linearly extends this payoff function to Q s (π, ρ s ) for a strategy profile π and a ρ s (A s ) at state s. Note that Q s (π, π s ) = u s (π). Given the current strategy profile π, the best-response set of player 1 and player 2 for Q at state s are the sets and BR 1 s(π) := argmax ρ 1 s 1 s BR 2 s(π) := argmin ρ 2 s 2 s respectively. We denote {BR 1 s(π), BR 2 s(π)} by BR s (π). Q s (π, ρ 1 s, π 2 s) (3.38) Q s (π, π 1 s, ρ 2 s), (3.39) The open-loop best-response dynamic in the stochastic game is defined by the differential inclusions: π i s BR i s(π) π i s, i, s. (3.40) A solution to the open-loop best-response inclusion, also called an open-loop bestresponse trajectory, is an absolutely continuous function π( ) : R + satisfies (3.40) for almost all t in R that

19 An open-loop best-response trajectory starts from the initial strategy profile π(0). At any time t in the trajectory, given the current strategy profile π(t), each player i calculates the expected discounted payoff u i s(t) and the set BRs(t) i at every state s, based on (2.8) and (3.38) or (3.39). Each player then chooses an element in BRs(t) i to generate π s( ) i at t, which specifies the adjustment direction and rate of the mixed action πs(t). i The Result As we can see, the adjustment of a continuation payoff in the open-loop bestresponse dynamic depends on the movement of the current strategy profile, while in the closed-loop dynamic, it depends on the distance between the continuation payoff and the current state-game payoff, as well as the current time. Perkins (2013) shows the following theorem. Theorem 3.9. Given any two-player zero-sum stochastic game with ω 1 ( 1 + max s S s S max ), (3.41) a A(s) (P s,s (a)) from any initial strategy profile π 0, any open-loop best-response trajectory converges to the set of stationary minimax strategy profiles. That is, lim π(t) X and lim u s (π(t)) = V s s S. t t The convergence result holds when the discount factor ω is not too big. In particular, for a zero-sum stochastic game with S states, a sufficient condition for the convergence of an open-loop best-response trajectory to X is ω < 1/(1 + S ). Comment: The proof in Perkins (2013) elaborates the approach of the Lyapunov function used in the proof of best-response dynamics in normal-form zerosum games. To be specific, if the continuation payoffs were fixed in each state game, then the technique of Lyapunov function could be applied in all state games, as in normal-form zero-sum games. We also know that in the dynamic, if the continuation payoff u s changes δ, then the energy w s in the Lyapunov function changes at most ωδ. However, the problem is that the contribution to w s due to the magnitude of u s may overpower the decline tendency of w s in the one-shot state game G s. We have so far only found an upper bound of u s with respect to Q s, the transition probabilities, and ω; see Lemma in Perkins (2013). In short, at each s, u s in a stochastic game may be convex due to the discount factor and the transition probabilities, while Lyapunov function can only be applied to a concave (or linear) payoff function; see Hofbauer and Sorin (2006). The 19

20 Figure 1: BR may not be a global best response. closer ω to 1 in a stochastic game and the more diverse the transition probabilities, the more convex u s can be Discussion We would like to point out that in a stochastic game Γ, for player 1, BRs(π) 1 in (3.38) is a best-response set to the expected discounted payoff Q s (π, ), a betterresponse subset to the expected discounted payoff u s (π) (see Lemma 3.10), but BRs(π) 1 is not necessary the best-response set to u s (π). The example below (see Figure 2) is a one-player stochastic game, so called a Markov-decision process. We may view it as a trivial two-player zero-sum game, if we let player 2 not affect any r(s) by any means. From any state in this game, the player can move to an adjacent state, or stay at the current state, both with probability 1. The stage payoffs are independent of the player s moves: r(s 1 ) = 1, r(s 5 ) = 100, r(s 2 ) = r(s 3 ) = r(s 4 ) = 0. Suppose that the initial strategy satisfies π(s 3, s 1 ) = 1 and π(s 4, s 2 ) = 1, while at all other states the player just stays there at time 0. It follows that Q s3 (π(0), (s 3, s 1 )) = ω > Q s3 (π(0), (s 3, s 4 )) = 0, and hence (s 3, s 4 ) is not included in BR s3 (π 0 ). However, when ω is close to 1, the best-response strategy for the expected discounted payoff u at any time t is a constant one, denoted as π, where the player is always moving towards state s 5, i.e., π(s 1, s 3 ) = π(s 3, s 4 ) = π(s 4, s 5 ) = π(s 2, s 4 ) = π(s 5, s 5 ) = 1. The difference between BR s3 (π 0 ) and π arises from the fact that BR s3 (π 0 ) is optimal for the payoff in the state game G s3 ( u(0)), while π is optimal for the expected discounted payoff of the player in the whole game. The best-response dynamic in (3.40) is in the agent-form setting: the player has one agent in each state, and each agent chooses an action independently. In contrast, a strategy of 20

21 the player in the whole game is a sequence of correlated actions. (See Appendix B for more discussion on the agent-form best-response dynamic.) However, we can still show that BRs(π) 1 is a subset of better-response strategies for player 1 at s. The proof of the following lemma is straightforward. Lemma Given a strategy profile π in a zero-sum stochastic game Γ, for player 1 at any state s and any action x 1 s (A 1 s), Q s (π, x 1 s, πs) 2 u s (π) u s (π s, x 1 s, πs) 2 Q s (π, x 1 s, πs) 2 and Q s (π, x 1 s, π 2 s) u s (π) u s (π s, x 1 s, π 2 s) Q s (π, x 1 s, π 2 s) where π s = s sπ s. Appendix A Minor results on open-loop bestresponse dynamics For further research, we present additional results regarding players behavior in the open-loop best-response dynamics in a zero-sum stochastic game Γ where each player has at most two actions in each state and the minimax strategy of each player in each value state game is a mixed strategy. To ease the exposition, we consider in this toy game the case S = 2 for the stochastic game Γ. The result can be generalized to the case of any finite S. We define the maximum payoff to player 1 given the strategy profile π as U 1 s (π) := max ρ 1 s 1 s Q s (π, ρ 1 s, π 2 s), (A.1) and similarly the minimum payoff to player 1 as U 2 s (π) := min ρ 2 s 2 s Q s (π, π 1 s, ρ 2 s). (A.2) For convenience, given a best-response trajectory (π(t)) t 0, we may sometimes write u s (t) for u s (π(t)), BR i s(t) for BR i s(π(t)), U i s(t) for U i s(π(t)), and given a mixed action a i i s, Q s (t, a i ) for Q s (π(t), a i, πs i (t)). We denote two states in S by α and β. When we are referring to continuation payoffs (u, v) in a state game G s at any state s, we mean the continuation payoff to state α is u and the payoff to β is v. We pick a strategy π i s in each X i s for each state s and each player i. Given a best-response trajectory (π(t)) t 0, we put δ s (t) := V s u s (t) for s {α, β}. The lemmata in this section concern the behavior of player 1, but the dual results hold for player 2. 21

22 Lemma A.1. If at time t, δ α (t) = max s S δ s (t), then Uα(t) 1 Q α (t, π α) 1 V α ωδ α (t). If this δ α (t) > 0, then for π α (t), π α 1 is a better-response strategy for player 1 in the state game G α with continuation payoffs (V α δ α (t), V β δ α (t)). Proof. Since π α 1 is a component of the minimax strategy of player 1, we may infer that Q α (t, π α) 1 =r α ( π α, 1 πα(t)) 2 + ω P αs ( π α, 1 πα(t))u 2 s (t) s S =r α ( π 1 α, π 2 α(t)) + ω s S P αs ( π 1 α, π 2 α(t))(v s δ s (t)) V α ωδ α (t). Here is a dual lemma. Lemma A.2. If at time t, δ α (t) = min s S δ s (t), then Q α (t, π 1 α) V α ωδ α (t). If this δ α (t) < 0, then for π α (t), π 1 α is not a better-response strategy for player 1 in the state game G α with continuation payoffs (V α δ α (t), V β δ α (t)). Recall that in (3.40) the strategy of a player is alway moving to a current best-response strategy in the best-response dynamic, and that A 1 s = 2 for all s in Γ. At time t, if BR 1 s(π(t)) = {a 1 s}, i.e., the best-response strategy for player 1 in the state game G s (t) is a pure strategy a 1 s, and if the strategy π 1 s X 1 s is a convex combination of a 1 s and π 1 s(t), then the strategy π 1 s(t) is also moving towards π 1 s. In fact, π 1 s(t) is moving towards any better-response strategy when BR 1 s(π(t)) is a singleton. In this case, Q s (t, π i s) > u s (t). Lemma A.3. If u α (t) < V α and δ β (t) δ α (t), then Q α (t, π 1 α) > u α (t), π 1 α(t) is moving towards π 1 α, and BR 1 α(t) is the same as the set of best-response strategies of player 1 in the state game G α with continuation payoffs (V α δ α (t), V β δ α (t)). If it also satisfies that δ α (t) < δ β (t)/ω, (A.3) then Q β (t, π β 1) > u β(t), πβ 1(t) is also moving towards π1 β, and BR1 β (t) is the same as the set of best-response strategies of player 1 in the state game G β with continuation payoffs (V α δ β (t), V β δ β (t)). 22

23 Note that this result is independent of player 2 s strategy π 2 (t). Proof. The conclusion for state α follows from Lemma A.1. At state β, we first observe that 0 < δ β (t) δ α (t) and Q β (t, π β) 1 (A.4) =r β ( π β, 1 πβ(t)) 2 + ωp βα ( π β, 1 πβ(t))(v 2 α δ α (t)) + ωp ββ ( π β, 1 πβ(t))(v 2 β δ β (t)) =V β ωp βα ( π β, 1 πβ(t))δ 2 α (t) ωp ββ ( π β, 1 πβ(t))δ 2 β (t). 1 From (A.3), it follows that for any P ββ ( π β, πβ 2 (t)), δ α (t) < ( 1 ωpββ ( π 1 β, π 2 β (t))) δ β (t) ω ( 1 P ββ ( π 1 β, π 2 β (t))), and thus, Therefore, δ β (t) > ωp βα( π 1 β, π2 β (t))δ α(t) 1 ωp ββ ( π 1 β, π2 β (t)). δ β (t) > ωp βα ( π 1 β, π 2 β(t))δ α (t) + ωp ββ ( π 1 β, π 2 β(t))δ β (t). Combined with (A.4), it follows that π β 1 is a better-response strategy for player 1 in the state game G β (t). Since there are only two pure strategies for each player in G β (t), πβ 1 (t) moving to the best-response strategy is equivalent to moving to a better-response strategy. Finally, by a similar argument to Lemma A.1 and the case of state α above, we reach the conclusion that BRβ 1 (t) is the same as the set of best-response strategies of player 1 in the state game G β with continuation payoffs (V α δ β (t), V β δ β (t)). Here is a dual lemma. Lemma A.4. If u α (t) > V α and δ β (t) δ α (t), then Q α (t, π 1 α) < u α (t), π 1 α(t) is moving away from π 1 α, and BR 1 α(t) is the same as the set of best-response strategies of player 1 in the state game G α with continuation payoffs (V α δ α (t), V β δ α (t)). If it also satisfies that δ α (t) < δ β (t)/ω, then Q β (t, π β 1) < u β(t), πβ 1(t) is also moving away from π1 β, and BR1 β (t) is the same as the set of best-response strategies of player 1 in the state game G β with continuation payoffs (V α δ β (t), V β δ β (t)). The next lemma concerns the behavior in the best-response dynamic when δ α δ β < 0. 23

24 Lemma A.5. If u α (t) < V α and u β (t) V β, then regardless of player 2 s current strategy π 2 (t), it follows that πα(t) 1 π α 1 and πβ 1(t) π1 β. Moreover, player 1 is moving towards π α 1 at state α and moving away from π β 1 at state β at time t. BRα(t) 1 and BRβ 1(t) in G β are the same as the set of best-response strategies of player 1 in the state game G α with continuation payoffs (V α δ α (t), V β δ α (t)) and (V α δ β (t), V β δ β (t)), respectively. Proof. This is a corollary of Lemma A.3 and Lemma A.4. Appendix B Agent-form Best-response Dynamics Given a zero-sum stochastic game Γ in discounted payoff, for any stationary strategy π i of player i and any state s, we denote player i s actions except the one at state s by π i s. Given a strategy profile π and a state s, the agent-form bestresponse set of player 1 and player 2 for the expected discounted payoff u s are defined as and respectively. as ABR 1 s(π) := argmax ρ 1 s 1 s ABR 2 s(π) := argmin ρ 2 s 2 s u s (ρ 1 s, π 1 s, π 2 ) u s (π 1, ρ 2 s, π 2 s), (B.1) (B.2) We can then define the player i s agent-form best-response differential inclusion s S, π i s ABR i s(π) π i s, i = 1, 2. (B.3) Again, as in normal-form zero-sum games, the set ABR i s(π) is upper semi-continuous. Hence, from any initial strategy profile π(0), a solution trajectory exists, and π(t) is Lipschitz continuous satisfying (B.3) for almost all t 0. We conjecture that not every agent-form best-response trajectory in every Γ converges to the set of minimax strategy profiles. As for the general convergence results in Barron et al. (2009), in a stochastic game we need to consider the expected discounted payoff function u : X R rather than u s at only one state s. It would be interesting to study the characterization of which Γ has a quasiconcave u in π 1 for a fixed π 2 (or the weaker condition in that paper). 24

25 By the one-player game example in Section 3.3.3, we can show that ABRs(π) 1 may still not be the set of best-response strategies of the agent for player 1 to u s (π) at state s: when players best reply against each other in the state game G s ( u(π t )) at time t, they do not take into account that the continuation payoff vector u(π t ) is adapting under current strategies. References Aubin, J.P. and A. Cellina, Differential Inclusion, Springer, Berlin, Balkenborg, D., C. Kuzmics, and J. Hofbauer (2013): Refined Best-Response Correspondence and Dynamics, Theoretical Economics, 8 (1), Barron, E.N., R. Goebel, and R.R. Jensen (2010): Best Response Dynamics for Continuous Games, Proceedings of the American Mathematical Society, 138 (3), Berger, U. (2005): Fictitious play in 2 n games, Journal of Economic Theory, 120, Bewley, T. and E. Kohlberg (1976): The asymptotic theory of stochastic games, Mathematics of Operations Research, 1, Borkar, V. (2002): Reinforcement learning in Markovian evolutionary games, Advances in Complex Systems, 5, Dutta, P.K. (1995): A Folk Theorem for Stochastic Games, Journal of Economic Theory, 66, Harris, C. (1998): On the Rate of Convergence of Continuous-Time Fictitious Play, Games and Economic Behavior, 22, Hofbauer, J. (1995): Stability for the Best Response Dynamics, mimeo, University of Vienna. Hofbauer, J., and K. Sigmund (1998): Evolutionary Games and Population Dynamics. Cambridge University Press. Hofbauer, J., and S. Sorin (2006): Best Response Dynamics for Continuous Zerosum Games, Discrete and Continuous Dynamical Systems-Series B, 6 (1),

26 Gilboa, Y., and A. Matsui (1991): Social Stability and Equilibrium, Econometrica, 59, Matsui, A. (1989): Social Stability and Equilibrium, CMS-DMS No. 819, Northwestern University. Mertens, J.-F. and A. Neyman, (1981): Stochastic games, International Journal of Game Theory, 10, Perkins, S., Advanced Stochastic Approximation Frameworks and their Applications, PhD thesis, University of Bristol, Sandholm, W. H., Population Games and Evolutionary Dynamics. MIT Press. Shapley, L. (1953): Stochastic Games, Proceedings of National Academy of Sciences of the United States of America, 39, Vrieze, O. and S. Tijs, (1982): Fictitious play applied to sequences of games and discounted stochastic games, International Journal of Game Theory, 11,

Total Reward Stochastic Games and Sensitive Average Reward Strategies

JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 98, No. 1, pp. 175-196, JULY 1998 Total Reward Stochastic Games and Sensitive Average Reward Strategies F. THUIJSMAN1 AND O, J. VaiEZE2 Communicated