Existence of Multiagent Equilibria with Limited Agents

Size: px

Start display at page:

Download "Existence of Multiagent Equilibria with Limited Agents"

Angelina Preston
5 years ago
Views:

1 Exstence of Multagent Equlbra wth Lmted Agents Mchael Bowlng Manuela Veloso Computer Scence Department, Carnege Mellon Unversty, Pttsburgh PA, Abstract Multagent learnng s a necessary yet challengng problem as multagent systems become more prevalent and envronments become more dynamc. Much of the groundbreakng work n ths area draws on notable results from game theory, n partcular, the concept of Nash equlbra. Learners that drectly learn equlbra obvously rely on ther exstence. Learners that nstead seek to play optmally wth respect to the other players also depend upon equlbra snce equlbra are, and are the only, learnng fxed ponts. From another perspectve, agents wth lmtatons are real and common. These may be undesred physcal lmtatons as well as self-mposed ratonal lmtatons, such as abstracton and approxmaton technques, used to make learnng tractable. Ths artcle explores the nteractons of these two mportant concepts, rasng for the frst tme the queston of whether equlbra contnue to exst when agents have lmtatons. We look at the general effects lmtatons can have on agent behavor, and defne a natural extenson of equlbra that accounts for these lmtatons. Usng ths formalzaton, we show that the exstence of equlbra cannot be guaranteed n general. We then prove ther exstence for certan classes of domans and agent lmtatons. These results have wde applcablty as they are not ted to any partcular learnng algorthm or specfc nstance of agent lmtatons. We then present emprcal results from a specfc multagent learner appled to a specfc nstance of lmted agents. These results demonstrate that learnng wth lmtatons s possble, and our theoretcal analyss of equlbra under lmtatons s relevant.. Introducton Multagent domans are becomng more prevalent as more applcatons and stuatons requre multple agents. Learnng n these systems s as useful and mportant as n sngle-agent domans, possbly more so. Optmal behavor n a multagent system may depend on the behavor of the other agents. For example, n robot soccer, passng the ball may only be optmal f the defendng goale s gong to move to block the player s shot and no defender wll move to ntercept the pass. Ths s complcated by the fact that the behavor of the other agents s often not predctable by the agent desgner, makng learnng and adaptaton a necessary component of the agent tself. In addton, the behavor of the other agents, and therefore the optmal response, can be changng as they also adapt to acheve ther own goals. Game theory provdes a framework for reasonng about these strategc nteractons. The game theoretc concepts of stochastc games and Nash equlbra are the foundaton for much of the recent research n multagent learnng, e.g., (Lttman, 994; Hu & Wellman, 998; Greenwald & Hall, 22; Bowlng & Veloso, 22). Nash equlbra defne a course of acton for each agent, such that no agent could beneft by changng ther behavor. So, all agents are playng optmally, gven that the other agents contnue to play accordng to the equlbrum.

2 From the agent desgn perspectve, optmal agents n realstc envronments are not practcal. Agents are faced wth all sorts of lmtatons. Some lmtatons may physcally prevent certan behavor, e.g., a soccer robot that has tracton lmts on ts acceleraton. Other lmtatons are selfmposed to help gude an agent s learnng, e.g., usng a subproblem soluton for advancng the ball down the feld. In short, lmtatons prevent agents from playng optmally and possbly from followng a Nash equlbrum. Ths clash between the concept of equlbra and the realty of lmted agents s a topc of crtcal mportance. Do equlbra exst when agents have lmtatons? Are there classes of domans or classes of lmtatons where equlbra are guaranteed to exst? Ths artcle ntroduces these questons and provdes concrete answers. Secton 2 ntroduces the stochastc game framework as a model for multagent learnng. We defne the game theoretc concept of equlbra, and examne the dependence of current multagent learnng algorthms on ths concept. Secton 3 enumerates and classfes some common agent lmtatons and presents two formal models ncorporatng the effects of lmtatons nto the stochastc game framework. Secton 4 s the major contrbuton of the artcle, presentng both proofs of exstence for certan domans and lmtatons as well as counterexamples for others. Secton 5 gves an example of how these results affect and relate to one partcular multagent learnng algorthm. We present the frst known results of applyng an explctly multagent learnng algorthm n a settng wth lmted agents. Fnally, Secton 6 concludes wth mplcatons of ths work and future drectons. 2. Stochastc Games A stochastc game s a tuple (n, S, A...n, T, R...n ), where, n s the number of agents, S s a set of states, A s the set of actons avalable to agent wth A beng the jont acton space, A... A n, T s a transton functon, S A S [, ], such that, s S a A T (s, a, s ) =, s S and R s a reward functon for the th agent, S A R. Ths s very smlar to the framework of a Markov Decson Process (MDP). Instead of a sngle agent, though, there are multple agents whose jont acton determnes the next state and rewards to the agents. The goal of an agent, as n MDPs, s to maxmze ts long-term reward. Notce, though, that each agent has ts own ndependent reward functon that t s seekng to maxmze. The goal of maxmzng long-term reward wll be made formal n Secton 2.2. Stochastc games can equally thought of as an extenson of the concept of matrx games to multple states. Two common matrx games are n Fgure. In these games there are two players; one selects a row and the other selects a column of the matrx. The entry of the matrx they jontly select determnes the payoffs. Rock-Paper-Scssors n Fgure (a) s a zero-sum game, where the column player receves the negatve of the row player s payoff. In the general case (general-sum games; e.g., 2

3 Bach or Stravnsky n Fgure (b)) each player has an ndependent matrx that determnes ts payoff. Stochastc games, then, can be vewed as havng a matrx game assocated wth each state. The mmedate payoffs at a partcular state are determned by the matrx entres R (s, a). After selectng actons and recevng ther rewards from the matrx game, the players are transton to another state and assocated matrx game, whch s determned by ther jont acton. So stochastc games contan both MDPs (when n = ) and matrx games (when S = ) as subsets of the framework. R r (s, ) = R c (s, ) = (a) Rock-Paper-Scssors R r (s, ) = R c (s, ) = Table : Two example matrx games. ( 2 ( 2 ) ) (b) Bach or Stravnsky 2. Polces Unlke n sngle-agent settngs, determnstc polces, whch assocate a sngle acton wth every state, can often be exploted n multagent settngs. Consder Rock-Paper-Scssors as shown n Fgure (a). If the column player were to play any acton determnstcally, the row player could wn a payoff of one every tme. Ths requres us to consder stochastc strateges and polces. A stochastc polcy for player, π : S P D(A ), s a functon that maps states to mxed strateges, whch are probablty dstrbutons over the player s actons. We use the notaton Π to be the set of all possble stochastc polces avalable to player, and Π = Π... Π n to be the set of jont polces of all the players. We also use the notaton π to refer to a partcular jont polcy of all the players except player, and Π to refer to the set of such jont polces. Fnally, the notaton π, π refers to the jont polcy where player follows π whle the other players follow ther polcy from π. In ths work, we make the dstncton between the concept of stochastc polces and mxtures of polces. A mxture of polces, σ : P D(S A ), s a probablty dstrbuton over the set of determnstc polces. An agent followng a mxture of polces selects a determnstc polcy accordng to ts mxture dstrbuton at the start of the game and always follows ths polcy. Ths s smlar to the dstncton between mxed strateges and behavoral strateges n extensve-form games (Kuhn, 953). Ths work focuses on stochastc polces as they () are a more compact representaton requrng A S parameters nstead of A S parameters to represent the complete space of polces, () are the common noton of stochastc polces n sngle-agent behavor learnng, e.g., (Jaakkola, Sngh, & Jordan, 994; Sutton, McAllester, Sngh, & Mansour, 2; Ng, Parr, & Koller, 999), and () don t requre the artfcal commtment to a sngle determnstc polcy at the start of the game, whch can be dffcult to understand wthn a learnng context. 3

4 2.2 Reward Formulatons There are a number of possble reward formulatons n sngle-agent learnng that defne the agent s noton of optmalty. These formulatons also apply to stochastc games. We wll explore two of these reward formulatons n ths artcle: dscounted reward and average reward. Although, ths work focuses on dscounted reward, many of our theoretcal results also apply to average reward. Dscounted Reward. In the dscounted reward formulaton, the value of future rewards s dmnshed by a dscount factor γ. Formally, gven a jont polcy π for all the agents, the value to agent of startng at state s S s, V π (s) = γ t E { rt s = s, π }, () t= where rt s the mmedate reward to player at tme t wth the expectaton condtoned on s as the ntal state and the players followng the jont polcy π. In our formulaton, we wll assume an ntal state, s S, s gven and defne the goal of each agent as maxmzng V π(s ). Ths dffers from the usual goal n MDPs and stochastc games, whch s to smultaneously maxmze the value of all states. We requre ths weaker goal snce our exploraton nto agent lmtatons makes smultaneous maxmzaton unattanable. Ths same dstncton was requred by Sutton and colleagues (Sutton et al., 2) n ther work on parameterzed polces, one example of an agent lmtaton. Average Reward. In the average reward formulaton all rewards n the sequence are equally weghted. Formally, ths corresponds to, V π (s) = lm T T t= T E {r t s = s, π}, (2) wth the expectaton defned as n Equaton. As s common wth ths formulaton, we assume that the stochastc game s ergodc. A stochastc game s ergodc f for all jont polces any state can be reached n fnte tme from any other state wth non-zero probablty. Ths assumpton makes the value of a polcy ndependent of the ntal state. Therefore, s, s S V π (s) = V π (s ). So any polcy that maxmzes the average value from one state maxmzes the average value from all states. These results along wth more detals on the average reward formulaton for MDPs are summarzed by Mahadevan (996). For ether formulaton we wll use the notaton V π to refer to the value of the jont polcy π to agent, whch n ether formulaton s smply V π(s ), where s can be any arbtrary state for the average reward formulaton.. Ths fact s demonstrated later by the example n Fact 5 n Secton 4. In ths game wth the descrbed lmtaton, f the column player randomzes among ts actons, then the row player cannot smultaneously maxmze the value of the left and rght states. 4

5 2.3 Best-Response and Equlbra Even wth the concept of stochastc polces and well-defned reward formulatons, there are stll no optmal polces that are ndependent of the other players polces. We can, though, defne a noton of best-response. Defnton For a game, the best-response functon for player, BR (π ), s the set of all polces that are optmal gven the other player(s) play the jont polcy π. A polcy π s optmal gven π f and only f, π Π V π,π V π,π. The major advancement that has drven much of the development of game theory, matrx games, and stochastc games s the noton of a best-response equlbrum, or Nash equlbrum (Nash, Jr., 95). Defnton 2 A Nash equlbrum s a jont polcy, π =...n, wth =,..., n π BR (π ). Bascally, each player s playng a best-response to the other players polces. So, no player can do better by changng polces gven that all the other players contnue to follow the equlbrum polcy. What makes the noton of an equlbrum nterestng s that at least one, possbly many, exst n all matrx games and stochastc games. Ths was proven by Nash (95) for matrx games, Shapley (953) for zero-sum dscounted stochastc games, Fnk (964) for general-sum dscounted stochastc games, and Mertens and Neyman (98) for zero-sum average reward stochastc games. The exstence of equlbra of general-sum average reward stochastc games s stll an open problem (Flar & Vreze, 997). In the Rock-Paper-Scssors example n Fgure (a), the only equlbrum conssts of each player playng the mxed strategy where all the actons have equal probablty. In the Bach-or-Stravnsky example n Fgure (b), there are three equlbra. Two consst of both players selectng ther frst acton or both selectng ther second. The thrd nvolves both players selectng ther preferred cooperatve acton wth probablty 2/3, and the other acton wth probablty / Learnng n Stochastc Games Learnng n stochastc games has receved much attenton n recent years as the natural extenson of MDPs to multple agents. The Mnmax-Q algorthm (Lttman, 994) was the frst renforcement learnng to explctly consder the stochastc game framework. Developed for dscounted reward, zero-sum stochastc games, the essence of the algorthm was to use Q-learnng to learn the values of jont actons. The value of the next state was then computed by solvng for the value of the unque Nash equlbrum of that state s Q-values. Lttman proved that under usual exploraton requrements, Mnmax-Q would converge to the Nash equlbrum of the game, ndependent of the opponent s play. Other algorthms have snce been presented for learnng n stochastc games. We wll summarze these algorthms by broadly groupng them nto two categores: equlbra learners and best-response learners. The man focus of ths summarzaton s to demonstrate how the exstence of equlbra under lmtatons s a crtcal queston to exstng algorthms. 5

6 Equlbra Learners. Mnmax-Q has been extended n many dfferent ways. Nash-Q (Hu & Wellman, 998), Frend-or-Foe-Q (Lttman, 2), Correlated-Q (Greenwald & Hall, 22) are all varatons on ths same theme wth dfferent restrctons on the applcable class of games or the noton of equlbra learned. All of the algorthms, though, seek to learn an equlbrum of the game drectly, by teratvely computng ntermedate equlbra. They are, generally speakng, guaranteed to converge to ther part of an equlbrum soluton regardless of the play or convergence of the other agents. We refer collectvely to these algorthms as equlbra learners. What s mportant to observe s that these algorthms depend explctly on the exstence of equlbra. If an agent or agents were lmted n such a way so that no equlbra exsted then these algorthms would be, for the most part, ll-defned. 2 Best-Response Learners. Another class of algorthms s the class ofbest-response learners. These algorthms do not explctly seek to learn equlbra, nstead seekng to learn best-responses to the other agents. The smplest example of one of these algorthms s Q-learnng (Watkns, 989). Although not an explctly multagent algorthm, t was one of the frst algorthms appled to multagent envronments (Tan, 993; Sen, Sekaran, & Hale, 994). Another less nave best-response learnng algorthm s WoLF-PHC (Bowlng & Veloso, 22), whch vares the learnng rate to account for the other agents learnng smultaneously. Other best-response learners nclude Fcttous Play (Robnson, 95; Vreze, 987), Opponent-Modelng Q-Learnng (Uther & Veloso, 997), Jont Acton Learners (Claus & Boutler, 998), and any sngle-agent learnng algorthm that learns optmal polces. Although these algorthms have no explct dependence on equlbra, there s an mportant mplct dependence. If algorthms that learn best-responses converge when playng each other, then t must be to a Nash equlbrum (Bowlng & Veloso, 22). Therefore, Nash equlbra are, and are the only, learnng fxed ponts. In the context of agent lmtatons, ths means that f lmtatons cause equlbra to not exst, then best-response learners could not converge. Ths s exactly one of the problems faced by Q-learnng n stochastc games. Q-learnng s lmted to determnstc polces. Ths determnstc polcy lmtaton can, n fact, cause no equlbra to exst (see Fact n Secton 4.) So there are many games for whch Q-learnng cannot converge when playng wth other best-response learners, such as other Q-learners. In summary, both equlbra and best-response learners depend on the exstence of equlbra. The next secton explores agent lmtatons that are lkely to be faced n realstc learnng stuatons. In Secton 4, we then present our man results examnng the effect these lmtatons have on the exstence of equlbra, and consequently on both equlbra and best-response learners. 3. Lmtatons The soluton concept of Nash equlbra depends on all the agents playng optmally. From the agent development perspectve, agents have lmtatons that prevent ths from beng a realty. The workng defnton of lmtaton n ths artcle s anythng that can restrct the agent from learnng or playng optmal polces. Broadly speakng, lmtatons can be classfed nto two categores: physcal lmtatons and ratonal lmtatons. Physcal lmtatons are those caused by the nteracton 2. It should be noted that n the case of Mnmax-Q, the algorthm and soluton concept are stll well-defned. A polcy that maxmzes ts worst-case value may stll exst even f lmtatons make t such that no equlbra exst. But, ths mnmax optmal polcy mght not necessarly be part of an equlbrum. Later, n Secton 4, Fact 5, we present an example of a zero-sum stochastc game and agent lmtatons where the mnmax optmal polces exst but do not comprse an equlbrum. 6

7 of the agent wth ts envronment and are often unavodable. Ratonal lmtatons are lmtatons specfcally chosen by the agent desgner to make the learnng problem tractable, ether n memory or tme. We brefly explore some of these lmtatons nformally before presentng a formal model of lmtatons that attempts to capture ther effect wthn the stochastc game framework. 3. Physcal Lmtatons One obvous physcal lmtaton s that the agent smply s broken. A moble agent may cease to move or less drastcally may lose the use of one of ts actuators preventng certan movements. Smlarly, another agent may appear to be broken when n fact the moton s smply outsde ts capabltes. For example, n a moble robot envronment where the rules allow robots to move up to two meters per second, there may be a robot that sn t capable of reachng that speed. An agent that s not broken, may suffer from poor control where ts actons aren t always carred out as desred, e.g., due to poorly tuned servos, nadequate wheel tracton, or hgh system latency. Another common physcal lmtaton s hardwred behavor. Most agents n dynamc domans need some amount of hard-wrng for fast response and safety. For example, many moble robot platforms are programmed to mmedately stop f an obstacle s too close. These hardwred actons prevent certan behavor by the agent, whch s often unsafe but s potentally optmal. Sensng s a common area of agent lmtatons contanng everythng from nose to partal observablty. Here we ll menton just one broad category of sensng problems: state alasng. Ths occurs when an agent cannot dstngush between two dfferent states of the world. An agent may need to remember past states and actons n order to properly dstngush the states, or may smply execute the same acton n both states. 3.2 Ratonal Lmtatons Ratonal lmtatons are a requrement for agents to learn n even moderately szed problems. Technques for makng learnng scale, whch often focus on near-optmal solutons, contnue to be proposed and nvestgated n sngle-agent learnng. They are lkely to be even more necessary n multagent envronments whch tend to have larger state spaces. We wll examne a few specfc methods. In domans wth sparse rewards one common technque s reward shapng, e.g., (Matarc, 994). A desgner artfcally rewards the agent for actons the desgner beleves to be progressng toward the sparse rewards. Ths can often speed learnng by focusng exploraton, but also can cause the agent to learn suboptmal polces. For example, n robotc soccer movng the ball down the feld s a good heurstc for goal progresson, but at tmes the optmal goal-scorng polcy s to pass the ball backwards to an open teammate. Subproblem reuse also has a smlar effect, where a subgoal s used n a porton of the state space to speed learnng, e.g., (Hauskrecht, Meuleau, Kaelblng, Dean, & Boutler, 998; Bowlng & Veloso, 999). These subgoals, though, may not be optmal for the global problem and so prevent the agent from playng optmally. Temporally abstract optons, ether provded (Sutton, Precup, & Sngh, 998) or learned (McGovern & Barto, 2; Uther, 22), also enforce a partcular subpolcy on a porton of the state space. Although n theory, the prmtve actons are stll avalable to the agents to play optmal polces, n practce abstracton away from prmtve actons s often necessary n large or contnuous state spaces. Parameterzed polces are recevng a great deal of attenton as a way for renforcement learnng to scale to large problems, e.g., (Wllams & Bard, 993; Sutton et al., 2; Baxter & Bartlett, 7

8 2). The dea s to gve the learner a polcy that depends on far less parameters than the entre polcy space actually would requre. Learnng s then performed n ths smaller space of parameters usng gradent technques. Ths smplfes and speeds learnng at the expense of possbly not beng able to represent the optmal polcy n the parameter space. 3.3 Models of Lmtatons Ths enumeraton of lmtatons shows that there are a number and varety of lmtatons wth whch agents may be faced, and they cannot be realstcally avoded. In order to understand ther mpact on equlbra we model lmtatons formally wthn the game theoretc framework. We ntroduce two models that capture broad classes of lmtatons: mplct games and restrcted polcy spaces. Implct Games. Lmtatons may cause an agent to play suboptmally, but t may be that the agent s actually playng optmally n a dfferent game. If ths new game can be defned wthn the stochastc game framework we call ths the mplct game, n contrast to the orgnal game called the explct game. For example, reward shapng adds artfcal rewards to help gude the agent s search. Although the agent s no longer learnng an optmal polcy n the explct game, t s learnng an optmal polcy of some game, specfcally the game wth these addtonal rewards added to that agent s R functon. Another example s due to broken actuators preventng an agent from takng some acton. The agent may be suboptmal n the explct game, whle stll beng optmal n the mplct game defned by removng these actons from the agent s acton set, A. We can formalze ths concept n the followng defnton. Defnton 3 Gven a stochastc game (n, S, A...n, T, R...n ) the tuple (n, S, Â...n, ˆT, ˆR...n ) s an mplct game f and only f t s tself a stochastc game and there exst mappngs, such that, τ : S Â A, s, s S â Â ˆT (s, â =...n, s ) = T (s, τ (s, â ) =...n, s ). Reward shapng and broken actuators can both be captured wthn ths model. For reward shapng the mplct game s (n, S, A...n, T, ˆR...n ), where ˆR adds the shaped reward nto the orgnal reward, R. In ths case the τ mappngs are just the dentty, τ (s, a) = a. For the broken actuator example, let a A be some null acton for agent and let a b A be some broken acton for agent that under the lmtaton has the same effect as the null acton. The mplct game, then, s (n, S, A...n, ˆT, ˆR...n ), where, ˆT (s, a, s ) = ˆR(s, a) = { T (s, a, a, s ) f a = a b T (s, a, s ) otherwse { R(s, a, a ) f a = a b R(s, a) otherwse, and, τ (s, a) = { a f a = a b a otherwse. 8

9 Lmtatons captured by ths model can be easly analyzed wth respect to ther effect on the exstence of equlbra. Usng the ntutve defnton of equlbra as a jont polcy such that no player can do better by changng polces, an equlbrum n the mplct game acheves ths defnton for the explct game. Snce all stochastc games have at least one equlbrum, so must the mplct game, and therefore the explct game when accountng for the agents lmtatons also has an equlbrum. On the other hand, many of the lmtatons descrbed above cannot be modeled n ths way. None of the lmtatons of abstracton, subproblem reuse, parameterzed polces, or state alasng lend themselves to be descrbed by ths model. Ths leads us to our second, and n many ways more general, model of lmtatons. Restrcted Polcy Spaces. The second model s that of restrcted polcy spaces, whch models lmtatons as restrctng the agent from playng certan polces. For example, a fxed exploraton strategy restrcts the player to polces that select all actons wth some mnmum probablty. Parameterzed polcy spaces have a restrcted polcy space correspondng to the space of polces that can be represented by ther parameters. We can defne ths formally. Defnton 4 A restrcted polcy space for player s a non-empty and compact subset, Π Π. The assumpton of compactness 3 may at frst appear strange, but t s not partcularly lmtng, and s crtcal for any equlbra analyss. It should be straghtforward to see that parameterzed polces, exploraton, state alasng (wth no memory), and subproblem reuse all can be captured as a restrcton on polces the agent can play. Therefore they can be naturally descrbed as restrcted polcy spaces. On the other hand, the analyss of the exstence of equlbra under ths model s not at all straghtforward. Snce restrcted polcy spaces capture most of the really nterestng lmtatons we have dscussed, ths s precsely the focus of the next secton. Before movng on to ths analyss, we summarze our enumeraton of lmtatons n Table 2. The lmtatons that we have been dscussed are lsted as well as denotng the model that most naturally captures ther effect on agent behavor. 4. Exstence of Equlbra In ths secton we defne formally the concept of restrcted equlbra, whch account for agents restrcted polcy spaces. We then carefully analyze what can be proven about the exstence of restrcted equlbra. The results presented range from somewhat trval examples (Facts, 2, 3, and 4) and applcatons of known results from game theory and basc analyss (Theorems and 5) to results that we beleve are completely new (Theorems 2, 3, and 4), as well as a crtcal counterexample to the wder exstence of restrcted equlbra (Fact 5). But all of the results are n a sense novel snce ths specfc queston has receved no drect attenton n the game theory nor the multagent learnng lterature. 3. Snce Π s a subset of a bounded set, the requrement that Π s compact merely adds that the lmt pont of any sequence of elements from the set s also n the set. 9

10 Physcal Lmtatons Implct Games Restrcted Polces Broken Actuators X X Hardwred Behavor X X Poor Control X State Alasng X Ratonal Lmtatons Implct Games Restrcted Polces Reward Shapng or Incentves X Exploraton X X State Abstracton/Optons X Subproblems X Parameterzed Polcy X Table 2: Common agent lmtatons. The column check-marks correspond to whether the lmtaton can be modeled straghtforwardly usng mplct games and/or restrcted polcy spaces. 4. Restrcted Equlbra We begn by defnng the concept of equlbra under the model of restrcted polcy spaces. Frst we need a noton of best-response that accounts for the players lmtatons. Defnton 5 A restrcted best-response for player, BR (π ), s the set of all polces from Π that are optmal gven the other player(s) play the jont polcy π. We can now use ths to defne an equlbrum. Defnton 6 A restrcted equlbrum s a jont polcy, π =...n, where, π BR (π ). So no player can wthn ther restrcted polcy space do better by changng polces gven that all the other players contnue to follow the equlbrum polcy. 4.2 Exstence of Restrcted Equlbra We can now state some results about when equlbra are preserved by restrcted polcy spaces, and when they are not. Unless otherwse stated (as n Theorems 2 and 4, whch only apply to dscounted reward), the results presented here apply equally to both the dscounted reward and the average reward formulatons. We wll separate the proofs for the two reward formulatons when needed. The frst four facts show that the queston of the exstence of restrcted equlbra does not have a trval answer. Fact Restrcted equlbra do not necessarly exst. Proof. Consder the Rock-Paper-Scssors matrx game wth players restrcted to the space of determnstc polces. There are nne jont determnstc polces, and none of these jont polces are equlbra.

11 Fact 2 There exst restrcted polcy spaces such that restrcted equlbra exst. Proof. One trval restrcted equlbrum s n the case where all agents have a sngleton polcy subspace. The sngleton jont polcy therefore must be a restrcted equlbrum. Fact 3 If π s a Nash equlbrum and π Π, then π s a restrcted equlbrum. Proof. If π s a Nash equlbrum, then we have Snce Π Π, then we also have {... n} π Π {... n} π Π V π V π V π,π. V π,π, and thus π s a restrcted equlbrum. On the other hand, the converse s not true; not all restrcted equlbra are of ths trval varety. Fact 4 There exst non-trval restrcted equlbra that are nether Nash equlbra nor come from sngleton polcy spaces. Proof. Consder the Rock-Paper-Scssors matrx game from Fgure. Suppose the column player s forced, due to some lmtaton, to play Paper exactly half the tme, but s free to choose between Rock and Scssors otherwse. Ths s a restrcted polcy space that excludes the only Nash equlbrum of the game. We can solve ths game usng the mplct game model, by gvng the lmted player only two actons, s = (.5,.5, ) and s 2 = (,.5,.5), whch the player can mx between. Ths s depcted graphcally n Fgure. We can solve the mplct game and convert the two actons back to actons of the explct game to fnd a restrcted equlbrum. Notce ths restrcted equlbrum s not a Nash equlbrum. Notce that the Fact 4 example has a convex polcy space,.e., all lnear combnatons of polces n the set are also n the set. Also, notce that the Fact counterexample has a non-convex polcy space, Ths suggests that restrcted equlbra may exst as long as the restrcted polcy space s convex. We can prove ths for matrx games, but unfortunately t s not generally true for stochastc games. Theorem When S =,.e. n matrx games, f Π s convex, then there exsts a restrcted equlbrum. Proof. One mght thnk of provng ths by appealng to mplct games as was used n Fact 4. In fact, f Π was a convex hull of a fnte number of strateges, ths would be the case. In order to prove t for any convex Π we apply Rosen s theorem about the exstence of equlbra n concave games (Rosen, 965). In order to use ths theorem we need to show the followng:. Π s non-empty, compact, and convex. 2. V π as a functon over π Π s contnuous. 3. For any π Π, the functon over π Π defned as V π,π s concave.

12 s R Restrcted Polcy Space Nash Equlbrum Restrcted Equlbrum P s 2 S Payoffs Nash Equlbrum Restrcted Equlbrum Explct Game 3, 3, 3, 3, 3, 3 Implct Game , 3, 2 3, 2 3, 3, 3, 2 3, 3, 2, 6 Fgure : Example of a restrcted equlbrum that s not a Nash equlbrum. Here, the column player n Rock-Paper-Scssors s restrcted to playng only lnear combnatons of the strateges s = 2, 2, and s 2 =, 2, 2. Condton s by assumpton. In matrx games, where S = {s }, we can smplfy the defnton of a polcy s value from Equatons and 2. V π = γ R (s, a)π n =π (s, a ), (3) a A where γ = for the average reward formulaton. Equaton 3 shows that the value s a multlnear functon wth respect to the jont polcy and therefore s contnuous. So Condton 2 s satsfed. Observe that by fxng the polces for all but one player Equaton 3 becomes a lnear functon over the remanng player s polcy and so s also concave satsfyng Condton 3. Therefore Rosen s theorem apples and ths game has a restrcted equlbrum. Fact 5 For a stochastc game, even f Π s convex, restrcted equlbra do not necessarly exst. Proof. Consder the stochastc game n Fgure 2. Ths s a zero-sum game where only the payoffs to the row player are shown. The dscount factor γ (, ). The actons avalable to the row player are U and D, and for the column player L or R. From the ntal state, the column player may select ether L or R whch results n no rewards but wth hgh probablty, ɛ, transtons to the specfed state (regardless of the row player s acton), and wth low probablty, ɛ, transtons to the opposte state. Notce that ths stochastcty s not explctly shown n Fgure 2. In each of the resultng states the players play the matrx game shown and then determnstcally transton back to the ntal state. Notce that ths game s unchan, where all the states are n a sngle ergodc set, thus satsfyng the average reward formulaton requrement. 2

13 L s R s L s R Fgure 2: An example stochastc game where convex restrcted polcy spaces don t preserve the exstence of equlbra. Now consder the restrcted polcy space where players have to play ther actons wth the same probablty n all states. So, Π = { π Π s, s S a A π (s, a) = π (s, a) }. (4) Notce that ths s a convex set of polces. That s, f polces x and x 2 are n Π (accordng to Equaton 4), then for any α [, ], x 3 must also be n Π, where, x 3 (s, a) = αx (s, a) + ( α)x 2 (s, a). (5) Ths can be seen by examnng x 3 (s, a) for any s S. From Equaton 5, we have, x 3 (s, a) = αx (s, a) + ( α)x 2 (s, a) (6) = αx (s, a) + ( α)x 2 (s, a) (7) = x 3 (s, a). (8) Therefore, x 3 s n Π and hence Π s convex. Ths game, though, does not have a restrcted equlbrum. The four possble jont determnstc polces, (U, L), (U, R), (D, L), and (D, R), are not equlbra. So f there exsts an equlbrum t must be mxed. Consder any mxed strategy for the row player. If ths strategy plays U wth probablty less than 2 then the unque best-response for the column player s to play L; f greater than 2 then the unque best-response s to play R; f equal then the unque best-responses are to play L or R determnstcally. In all cases all best-responses are determnstc, so ths rules out mxed strategy equlbra, and so no equlbra exsts. Convexty s not a strong enough property to guarantee the exstence of restrcted equlbra. Standard equlbrum proof technques fal for ths example due to the fact that the player s bestresponse sets are not convex, even though ther restrcted polcy spaces are convex. Notce that the best-response to the row player mxng equally between actons s to play ether of ts actons determnstcally. But, lnear combnatons of these actons (e.g., mxng equally) are not bestresponses. Ths ntuton s proven n the followng lemma. Lemma For any stochastc game, f Π s convex and for all π Π, BR (π ) s convex, then there exsts a restrcted equlbrum. 3

14 Proof. The proof reles on Kakutan s fxed pont theorem. We frst need to show some facts about the restrcted best-response functon. Frst, remember that Π s non-empty and compact. Also, note that the value (wth both dscounted and average reward) to a player at any state of a jont polcy s a contnuous functon of that jont polcy (Flar & Vreze, 997, Theorem and Lemma 5..4). Therefore, from basc analyss (Gaughan, 993, Theorem 3.5 and Corollary 3.), the set of maxmzng (or optmal) ponts must be a non-empty and compact set. So BR (π ) s non-empty and compact. Defne the set-valued functon, F : Π Π, F (π) = n = BR (π ). We want to show F has a fxed pont. To apply Kakutan s fxed pont theorem we must show the followng condtons to be true,. Π s a non-empty, compact, and convex subset of a Eucldean space, 2. F (π) s non-empty, 3. F (π) s compact and convex, and 4. F s upper hem-contnuous. Snce the Cartesan product of non-empty, compact, and convex sets s non-empty, compact, and convex we have condton () by the assumptons on Π. By the facts of BR from above and the lemma s assumptons we smlarly get condtons (2) and (3). What remans s to show condton (4). Consder two sequences x j x Π and y j y Π such that j y j F (x j ). It must be shown that y F (x), or just y BR (x). Let v be y s value aganst x. By contradcton assume there exsts a y wth hgher value, v than y ; let δ = v v. Snce the value functon s contnuous we can choose an N large enough that the value of y aganst x N dffers from v by at most δ/4 4, and the value of y aganst x N dffers from v by at most δ/4, and the value of y N aganst x N dffers from y aganst x N by at most δ/4. The comparson of values of these varous jont polces s shown n Fgure 3. Addng all of these together, we have a pont n the sequence y n>n whose value aganst x n s less than the value of y aganst x n. So y n / BR (x n ), and therefore y n / F (x n ) creatng our contradcton. (y, x) (y, x N ) (y N, x N ) (y, x N ) (y, x) v δ/4 δ/4 δ/4 v Fgure 3: An llustraton of the demonstraton by contradcton that the best-response functons are upper hem-contnuous. We can now apply Kakutan s fxed pont theorem. So there exsts π Π such that π F (π). Ths means π BR (π ), and therefore ths s a restrcted equlbrum. 4. Ths value s arbtrarly selected and s only requred to be strctly smaller than δ/3. 4

15 The consequence of ths lemma s that, f we can prove that the sets of restrcted best-responses are convex then restrcted equlbra exst. As we have stated earler ths was not true of the counterexample n Fact 5. The next four theorems all further lmt ether the restrcted polcy spaces or the stochastc game to stuatons where the best-response sets are provably convex. We wll frst examne a specfc class of restrcted polcy spaces, and then examne specfc classes of stochastc games A SUBCLASS OF RESTRICTED POLICIES Our frst result for general stochastc games uses a stronger noton of convexty for restrcted polcy spaces. Defnton 7 A restrcted polcy space Π s statewse convex, f t s the Cartesan product over all states of convex strategy sets. Equvalently, f for all x, x 2 Π and all functons α : S [, ], the polcy x 3 (s, a) = α(s)x (s, a) + ( α(s))x 2 (s, a) s also n Π. Theorem 2 In the dscounted reward formulaton, f Π s statewse convex, then there exsts a restrcted equlbrum. Proof. Wth statewse convex polcy spaces, there exst optmal polces n the strong sense as mentoned n Secton 2. Specfcally, there exsts a polcy that can smultaneously maxmze the value of all states. Formally, for any π there exsts a π Π such that, s S π Π V π,π (s) V π,π (s). Suppose ths were not true,.e., there were two polces each whch maxmzed the value of dfferent states. We can construct a new polcy that n each state follows the polcy whose value s larger for that state. Ths polcy wll maxmze the value of both states that those polces maxmzed, and due to statewse convexty s also n Π. We wll use that fact to redefne optmalty to ths strong sense for ths proof. We wll now make use of Lemma. Frst, notce the lemma s proof stll holds even wth ths new defnton of optmalty. We just showed that under ths redefnton, BR (π ) s non-empty, and the same argument for compactness of BR (π ) holds. So we can make use of Lemma and what remans s to prove that BR (π ) s convex. Snce π s a fxed polcy for all the other players ths defnes an MDP for player (Flar & Vreze, 997, Corollary 4.2.). So we need to show that the set of polces from the player s restrcted set that are optmal for ths MDP s a convex set. Concretely, f x, x 2 Π are optmal for ths MDP, then the polcy x 3 (s, a) = αx (s, a) + ( α)x 2 (s, a) s also optmal for any α [, ]. Snce x and x 2 are optmal n the strong sense,.e., maxmzng the value of all states smultaneously, then they must have the same per-state value. Here, we wll use the notaton V x (s) to refer to the value of polcy x from state s n ths fxed MDP. The value functon for any polcy satsfes the Bellman equatons, specfcally, ( ) s S V x (s) = a x(s, a) R(s, a) + γ s T (s, a, s )V x (s ). (9) For x 3 then we get the followng, V x 3 (s) = ( x 3 (s, a) R(s, a) + γ ) T (s, a, s )V x 3 (s ) a s () 5

16 = ( (αx (s, a) + ( α)x 2 (s, a)) R(s, a) + γ ) T (s, a, s )V x 3 (s ) () a s = α ( x (s, a) R(s, a) + γ ) T (s, a, s )V x 3 (s ) + a s ( α) ( x 2 (s, a)) R(s, a) + γ ) T (s, a, s )V x 3 (s ). (2) a s Notce that V x 3 (s) = V x (s) = V x 2 (s) satsfes these equatons. So x 3 has the same values as x and x 2, and s therefore also optmal. Therefore BR (π ) s convex, and from Lemma we get the exstence of restrcted equlbra under ths strcter noton of optmalty, whch also makes the polces a restrcted equlbra under our orgnal noton of optmalty, that s only maxmzng the value of the ntal state SUBCLASSES OF STOCHASTIC GAMES Unfortunately, most ratonal lmtatons that allow renforcement learnng to scale are not statewse convex restrctons, and usually have some dependence between states. For example, parameterzed polces nvolve far less parameters than the number of states, whch can be ntractably large, and so the space of polces cannot select actons at each state ndependently. Smlarly subproblems force whole portons of the state space to follow the same subproblem soluton. Therefore, these portons of the state space do not select ther actons ndependently. One way to relax from statewse convexty to general convexty s to consder only a subset of stochastc games. Theorem 3 Consder no-control stochastc games, where all transtons are ndependent of the players actons,.e., s, s S a, b A T (s, a, s ) = T (s, b, s ). If Π s convex, then there exsts a restrcted equlbrum. Proof (Dscounted Reward). Ths proof also makes use of Lemma, leavng us only to show that BR (π ) s convex. Just as n the proof of Theorem 2 we wll consder the MDP defned for player when the other players follow the fxed polcy π. As before t suffces to show that for ths MDP, f x, x 2 Π are optmal for ths MDP, then the polcy x 3 (s, a) = αx (s, a) + ( α)x 2 (s, a) s also optmal for any α [, ]. Agan we use the notaton V π (s) to refer to the tradtonal value of a polcy π at state s n ths fxed MDP. Snce T (s, a, s ) s ndependent of a, we can smplfy the Bellman equatons (Equaton 9) to V x (s) = x(s, a)r(s, a) + γ x(s, a)t (s, a, s )V x (s ) (3) a s a = a x(s, a)r(s, a) + γ s T (s,, s )V x (s ). (4) For the polcy x 3, the value of state s s then, V x 3 (s) = α a x (s, a)r(s, a) + ( α) a x 2 (s, a)r(s, a) + 6

17 γ s T (s,, s )V x 3 (s ). (5) Usng equaton 4 for both x and x 2 we get, V x 3 (s) = α(v x (s) γ s T (s,, s )V x (s )) + ( α)(v x 2 (s) γ s T (s,, s )V x 2 (s )) + γ s T (s,, s )V x 3 (s ) (6) = αv x (s) + ( α)v x 2 (s) + γ s T (s,, s ) ( V x 3 (s ) αv x (s ) ( α)v x 2 (s ) ) (7) Notce that a soluton to these equatons s V x 3 (s) = αv x (s)+( α)v x 2 (s). Therefore V x 3 (s ) s equal to V x (s ) and V x 2 (s ), whch are equal snce both are optmal. So x 3 s optmal, and BR (π) s convex. Applyng Lemma we get that restrcted equlbra exst. Proof (Average Reward). An equvalent defnton to Equaton 2 of a polcy s average reward s, V π (s) = d π (s) a π(s, a)r(s, a), (8) where d π (s) defnes the dstrbuton over states vsted whle followng π after nfnte tme. For a stochastc game or MDP that s unchan we know that ths dstrbuton s ndependent of the ntal state. In the case of no-control stochastc games or MDPs, ths dstrbuton becomes ndependent of the actons and polces of the players, and depends solely on the transton probabltes. So Equaton 8 can be wrtten, V π (s) = d(s) a π(s, a)r(s, a). (9) As before, we must show that BR (π ) s convex to apply Lemma. Consder the MDP defned for player when the other players follow the polcy π. It suffces to show that for ths MDP, f x, x 2 Π are optmal for ths MDP, then the polcy x 3 (s, a) = αx (s, a) + ( α)x 2 (s, a) s also optmal for any α [, ]. Usng Equaton 9, we can wrte the value of x 3 as, V x 3 (s) = d(s) a x 3 (s, a)r(s, a) (2) = d(s) (αx (s, a) + ( α)x 2 (s, a)) R(s, a) (2) a ( = d(s) αx (s, a)r(s, a) + ) ( α)x 2 (s, a)r(s, a) (22) a a ( ) ( ) = α d(s) x (s, a)r(s, a) + ( α) d(s) x 2 (s, a)r(s, a) (23) a a = αv x (s) + ( α)v x 2 (s). (24) 7

18 Therefore x 3 has the same average reward as x and x 2 and so s also optmal. So BR (π ) s convex and by Lemma there exsts an equlbrum. We can now merge Theorem 2 and Theorem 3 allowng us to prove exstence of equlbra for a general class of games where only one of the player s actons affects the next state. Theorem 4 Consder sngle-controller stochastc games (Flar & Vreze, 997), where all transtons depend solely on player s actons,.e., s, s S a, b A a = b T (s, a, s ) = T (s, b, s ). In the dscounted reward formulaton, f Π s statewse convex and Π s convex, then there exsts a restrcted equlbrum. Proof. Ths proof agan makes use of Lemma, leavng us to show that BR (π ) s convex. For = we use the argument from the proof of Theorem 2. For we use the argument from Theorem 3. The prevous results have looked at stochastc games whose transton functons have partcular propertes. Our fnal theorem examnes stochastc games where the rewards have a partcular structure. Specfcally we address team games, where the agents all receve equal payoffs. Theorem 5 For team games,.e.,, j {,..., n} s S a A R (s, a) = R j (s, a), there exsts a restrcted equlbrum. Proof. The only constrants on the players restrcted polcy spaces are those stated at the begnnng of ths secton: non-empty and compact. Snce Π s compact, beng a Cartesan product of compact sets, and player one s value n ether formulaton s a contnuous functon of the jont polcy, then the value functon attans ts maxmum (Gaughan, 993, Corollary 3.). Specfcally, there exsts π Π such that, π Π V π V π. Snce V = V we then get that the polcy π maxmzes all the players rewards, and so each must be playng a restrcted best-response to the others polces. 4.3 Summary Facts and 5 provde counterexamples that show the threat lmtatons play to equlbra. Theorems, 2, 4, and 5 gve us four general classes of stochastc games and restrcted polcy spaces where equlbra are guaranteed to exst. The fact that equlbra do not exst n general rases concerns about equlbra as a general bass for multagent learnng n domans where agents have lmtatons. On the other hand, combned wth the model of mplct games, the presented theoretcal results lays the ntal groundwork for understandng when equlbra can be reled on and when ther exstence may be n queston. These contrbutons also provde some formal foundaton for applyng multagent learnng n lmted agent problems. 8

19 5. Learnng wth Lmtatons In Secton 2, we hghlghted the mportance of the exstence of equlbra to multagent learnng algorthms. Ths secton presents results of applyng a partcular learnng algorthm to a settng of lmted agents. We use the best-response learner, WoLF-PHC (Bowlng & Veloso, 22). Ths algorthm s ratonal, that s, t s guaranteed to converge to a best-response f the other players polces converge. In addton, t has been emprcally shown to converge n self-play, where both players use WoLF-PHC for learnng. In ths artcle we apply ths algorthm n self-play to matrx games, both wth and wthout player lmtatons. Snce the algorthm s ratonal, f the players converge ther converged polces must be an equlbrum (Bowlng & Veloso, 22). The specfc lmtatons we examne fall nto both the restrcted polcy space model as well as the mplct game model. One player s restrcted to playng strateges that are the convex hull of a subset of the avalable strateges. From Theorem, there exsts a restrcted equlbrum wth these lmtatons. For best-response learners, ths amounts to a possble convergence pont for the players. For the lmted player, the WoLF-PHC algorthms modfed slghtly so that the player mantans Q- values of ts restrcted set of avalable strateges and performs ts usual hll-clmbng n the mxed space of these strateges. The unlmted player s unchanged and completely unnformed of the lmtaton of ts opponent. 5. Rock-Paper-Scssors The frst game we examne s Rock-Paper-Scssors. Fgure 4 shows the results of learnng when nether player s lmted. Each graph shows the mxed polcy the player s playng over tme. The labels to the rght of the graph sgnfy the probabltes of each acton n the game s unque Nash equlbrum. Observe that the players strateges converge to ths learnng fxed pont..8 Player (Unlmted) P(Rock) P(Paper) P(Scssors).8 Player 2 (Unlmted) P(Rock) P(Paper) P(Scssors) Rock Paper Scssors.4.2 Rock Paper Scssors 2 3 E(Reward) = 2 3 E(Reward) = Fgure 4: WoLF-PHC n Rock-Paper-Scssors. Nether player s lmted. Fgure 5 shows the results of restrctng player to a convex restrcted polcy space, defned by requrng the player to play Paper exactly half the tme. Ths s the same restrcton as shown graphcally n Fgure. The graphs agan show the players strateges over tme, and the labels to the rght now label the game s restrcted equlbrum, whch accounts for the lmtaton (see Fgure ). The player s strateges now converge to ths new learnng fxed pont. If we examne the expected rewards to the players, we see that the unrestrcted player gets a hgher expected reward n 9

20 the restrcted equlbrum than n the game s Nash equlbrum (/6 compared to.) In summary, both players learn optmal best-response polces wth the unrestrcted learner approprately takng advantage of the other player s lmtaton..8 Player (Lmted) P(Rock) P(Paper) P(Scssors).8 Player 2 (Unlmted) P(Rock) P(Paper) P(Scssors).6.4 Paper Rock.6.4 Scssors Paper.2 Scssors E(Reward) = -.67 Rock 2 3 E(Reward) =.67 Fgure 5: WoLF-PHC n Rock-Paper-Scssors. Player must play Paper wth probablty Colonel Blotto The second game we examned s Colonel Blotto (Gnts, 2), whch s also a zero-sum matrx game. In ths game, players smultaneously allot regments to one of two battlefelds. If one player allots more armes to a battlefeld than the other, they receve a reward of one plus the number of armes defeated, and the other player loses ths amount. If the players te, then the reward s zero for both. In the unlmted game, the row player has four regments to allot, and the column player has only three. The matrx of payoffs for ths game s shown n Fgure 6. R (s, a) = Fgure 6: Colonel Blotto Game. The row player s rewards are shown; the column player receves the negatve of ths reward. Fgure 7 shows expermental results wth unlmted players. The labels on the rght sgnfy the probabltes assocated wth the Nash equlbrum to whch the players strateges converge. Player s then gven the lmtaton that t could only allot two of ts armes, the other two would be allotted randomly. Ths s also a convex restrcted polcy space and therefore by Theorem has a restrcted equlbrum. Fgure 8 shows the learnng results. The labels to the rght correspond to the acton probabltes for the restrcted equlbrum, whch was computed by hand. As n Rock-Paper-Scssors, the players strateges converge to the new learnng fxed pont. Smlarly, the expected reward for the unrestrcted player resultng from the restrcted equlbrum s consderably 2

OPERATIONS RESEARCH. Game Theory

OPERATIONS RESEARCH. Game Theory OPERATIONS RESEARCH Chapter 2 Game Theory Prof. Bbhas C. Gr Department of Mathematcs Jadavpur Unversty Kolkata, Inda Emal: bcgr.umath@gmal.com 1.0 Introducton Game theory was developed for decson makng