Coalitional Bargaining with Agent Type Uncertainty

Size: px

Start display at page:

Download "Coalitional Bargaining with Agent Type Uncertainty"

Sydney Merritt
5 years ago
Views:

1 Coaltonal Barganng wth Agent Type Uncertanty Georgos Chalkadaks and Crag Boutler Department of Computer Scence Unversty of Toronto, Toronto, Canada { gehalk, cebly }@cs.toronto.edu Abstract Coalton formaton s a problem of great nterest n AI, allowng groups of autonomous, ndvdually ratonal agents to form stable teams. Automatng the negotatons underlyng coalton formaton s, naturally, of specal concern. However, research to date n both AI and economcs has largely gnored the potental presence of uncertanty n coaltonal barganng. We present a model of dscounted coaltonal barganng where agents are uncertan about the types (or capabltes) ofpotental partners, and hence the value of a coalton. We cast the problem as a Bayesan game n extensve form, and descrbe ts Perfect Bayesan Equlbra as the solutons to a polynomal program. We then present a heurstc algorthm usng teratve coalton formaton to approxmate the optmal soluton, and evaluate ts performance. 1 Introducton Coalton formaton, wdely studed n game theory and economcs [8], has attracted much attenton n AI as means of dynamcally formng partnershps or teams of cooperatng agents. Whle most models of coalton formaton (e.g., coaltonal barganng processes) assume that agents have full knowledge of types of ther potental partners, n most natural settngs ths wll not be the case. Generally, agents wll be uncertan about varous characterstcs of others (e.g., ther capabltes), whch n turn mposes uncertanty on the value of any coalton. Ths presents the opportunty to learn about the types of others based on ther behavor durng negotaton and by observng ther performance n settngs where coaltons form repeatedly. Agents must be able to form coaltons and dvde the generated value even n such settngs. Here we present a model of dscounted coaltonal barganng under agent type uncertanty. We formulate ths as a Bayesan extensve game wth observable actons [8], where the actons correspond to proposng choces of potental partners and a payoff allocaton, or acceptng or reectng such proposals. Our model generalzes related barganng models by explctly dealng wth uncertanty about agent types (or capabltes) and coaltonal values. We formulate the perfect Bayesan equlbrum (PBE) soluton of ths game as a decdable polynomal program. The complexty of the program makes t ntractable for all but trval problems, so we propose an alternatve heurstc algorthm to fnd good agent strateges n the coaltonal barganng game. Prelmnary experments llustrate the performance of ths heurstc approach. Although there s a consderable body of work on coaltonal barganng, no exstng models deal wth explct type uncertanty. Okada [7] suggests a form of coaltonal barganng where agreement can be reached n one barganng round f the proposer s chosen randomly. Chatteree et al. [3] present a barganng model wth a fxed proposer order, whch results n a delay of agreement. Nether model deals wth type uncertanty nstead, they focus on calculatng subgame-perfect equlbra (SPE). Sus et al. [9] ntroduce stochastc cooperatve games (SCGs), comprsng a set of agents, a set of coaltonal actons, and a functon assgnng to each acton a random varable wth fnte expectaton, representng acton-dependent coalton payoff. Though they provde strong theoretcal foundatons for games wth ths restrcted form of acton uncertanty, they do not model explctly a coalton formaton process. Kraus et al. [4] model coalton formaton under a restrcted form of uncertanty regardng coaltonal values n a request for proposal doman. However, type uncertanty s not captured; rather, the mean value of coaltons s common knowledge, and a manager handles proposals (they also focus on socal welfare maxmzaton rather than ndvdual ratonalty). Chalkadaks and Boutler [2] proposean explct model of type uncertanty and show how ths translates nto coaltonal value uncertanty. We adopt ther model n our paper. However, ther results focus on stablty concepts and how coaltons evolve durng repeated nteracton, as agents gradually learn more about each other s capabltes (n renforcement learnng style). The actual coalton formaton processes used are farly smple and are not nfluenced by strategc consderatons, nor do agents update ther belefs about other agents types durng barganng. Our work analyzes the actual barganng process n more depth. 2 Bayesan Coaltonal Barganng We begn by descrbng the Bayesan coalton formaton model and then defne our coaltonal barganng game. We assume a set of agents N = {1,...,n}, and for each agent a fnte set of possble types T. Each agent has a specfc type t T.WeletT = N T denote the set of type profles. Each knows ts own type t, but not those of other agents. Agent s belefs μ comprse a ont dstrbuton 1227

2 over T,whereμ (t ) s the probablty assgns to other agents havng type profle t. Intutvely, s type reflects ts abltes; and ts belefs about the types of others capture ts uncertanty about ther abltes. For nstance, f a carpenter wants to fnd a plumber and electrcan wth whom to buld a house, her decson to propose (or on) such a partnershp, to engage n a specfc type of proect, and to accept a specfc share of the surplus generated should all depend on her probablstc assessment of ther abltes. A coalton C N of members wth actual types t C has a value V (t C ), representng the value ths group can acheve by actng optmally. However, ths smple characterstc functon representaton of the model [8] s nsuffcent, snce ths value s not common knowledge. An agent can only assess the expected value of such a coalton based on ts belefs: V (C) = t C T C μ (t C )V (t C ). A coalton structure CS parttons N nto coaltons of agents. A payoff allocaton P = x, gven the stochastc nature of payoffs n ths settng, assgns to each agent n coalton C ts share of the value attaned by C (and must be such that C x =1for each C CS ). Chalkadaks and Boutler [2] defne the Bayesan core as a generalzaton of the standard core concept, capturng an ntutve noton of stablty n the Bayesan coalton formaton game. Whle coalton structures and allocatons can sometmes be computed centrally, n many stuatons they emerge as the result of some barganng process among the agents, who propose, accept and reect partnershp agreements [3]. We now defne a (Bayesan) coaltonal barganng game for the model above as a Bayesan extensve game wth observable actons. The game proceeds n stages, wth a randomly chosen agent proposng a coalton and allocaton of payments to partners, who then accept or reect the proposal. A fnte set of barganng actons s avalable to the agents. A barganng acton corresponds to ether some proposal π = C, P C to form a coalton C wth a specfc payoff allocaton P C specfyng payoff shares x to each C, or to the acceptance or reecton of such a proposal. The fntehorzongame proceedsn S stages, and ntally all agents are actve. At the begnnng of stage s S, one of the (say n) actve agents s chosen randomly wth probablty γ = 1 n to make a proposal C, P C (wth C). Each other C smultaneously (wthout knowledge of other responses) ether accepts or reects ths proposal. If all C accept, the agents n C are made nactve and removed from the game. Value V s (t C )=δ s 1 V (t C ) s realzed by C at s, and splt accordng to P C,whereδ (0, 1) s the dscount factor. 1 If any C reects the proposal, the agents reman actve (no coalton s formed). At the end of a stage, the responses are observed by all partcpants. At the end of stage S, any not n any coalton receves ts dscounted reservaton value δ S 1 V (t ) (dscounted sngleton coalton value). 3 Perfect Bayesan Equlbrum The coaltonal barganng game descrbed above s clearly an extensve form Bayesan game. We assume each agent wll 1 Agents could have dfferent δ s. As long as these are common knowledge, our analyss holds wth only trval modfcatons. adopt a sutable behavoral strategy, assocatng wth each node n the game tree at whch t must make a decson a dstrbuton over acton choces for each of ts possble types. Furthermore, snce t s uncertan about the types of other agents, ts observed hstory of other agents proposals and responses gve t nformaton about ther types (assumng they are ratonal). Thus, the preferred soluton concept s that of a perfect Bayesan equlbrum (PBE) [8]. A PBE comprses a profle of behavoral strateges for each agent as well a system of belefs dctatng what each agent beleves about the types of ts counterparts at each node n the game tree. The standard ratonalty requrements must also hold: the strategy for each agent maxmzes ts expected utlty gven ts belefs; and each agent s belefs are updated from stage to stage usng Bayes rule, gven the specfc strateges beng played. In ths secton, we formulate the constrants that must hold on both strateges and belefs n order to form a PBE. Let σ denote a behavoral strategy for, mappng nformaton sets (or observable hstores h) n the game tree at whch must act nto dstrbutons over admssble actons A(h). If s a proposer at h (at stage s), let A(h) =P, the fnte set of proposals avalable at h. Thenσ h,t (π) denotes the (behavoral strategy) probablty that makes proposal π Pat h gven ts type s t. If s a responder at h, thenσ h,t (y) s the probablty wth whch accepts the proposal on the table (says yes)ath (and σ h,t (n) =1 σ h,t (y) s the probablty says no). Let μ denote s belefs wth μ h,t (t ) beng s belefs about the types of others at h gven ts own type s t. We defne the PBE constrants for the game by frst defnng the values to (generc) agent at each node and nformaton set n the game tree, gven a fxed strategy for other agents, and the ratonalty constrants on hs strateges and belefs. We proceed n stages. (1) Let ξ be a proposal node for at hstory h at stage s. Snce the only uncertanty n nformaton set h nvolves the types of other agents, each ξ h corresponds to one such type vector t T ;leth(t ) denote ths node n h. The value to of a proposal π = C, P C at h(t ) s: q h(t ),t (π) =p h(t ) acc (π)x X V s(t C)+ r p h(t ) (π,r)q ξ/π/r,t where: pacc h(t ) (π) s the probablty that all C (other than ) accept π (ths s easly defned n terms of ther fxed strateges); x s s payoff share n P C ; r ranges over response vectors n whch at least one C refuses the proposal; p h(t ) (π, r) denotes the probablty of such a response; and q ξ/π/r,t denotes the contnuaton payoff for at stage s +1at the node ξ/π/r (followng n after proposal π and responses r). Ths contnuaton payoff s defned (recursvely) below. The value of π at hstory h (as opposed to a node) s determned by takng the expectaton w.r.t. possble types: q h,t (π) = t μ h,t (t )q h(t ),t (π). (2) Suppose s a responder at node ξ = h(t ) n hstory h at stage s. As above, ξ corresponds to specfc t n h. W.l.o.g. we can assume s the frst responder (snce all responses are smultaneous). Let pacc h(t ) (π) denote the probablty that all other responders accept π. We then defne the 1228

3 value to of acceptng π at ξ as: q h(t ),t (y) =pacc h(t ) (π)x V s (t C )+ p h(t ) (π, r)q ξ/y/r,t r where agan r ranges over response vectors n whch at least one C,, refuses π; p h(t ) (π, r) s the probablty of such a response; and q ξ/y/r,t s the contnuaton payoff for at stage s +1after responses r by ts counterparts. The value of acceptng at h s gven by the expectaton over type vectors t C w.r.t. s belefs μ h,t as above. The value of reectng π at ξ = h(t ) s the expected contnuaton payoff at stage s +1: q h(t ),t (n) = r p h(t ) (π, r)q ξ/n/r,t (where r ranges over all responses, ncludng pure postve responses, of the others). (3) We have defned the value for takng a specfc acton at any of ts nformaton sets. It s now straghtforward to defne the value to of reachng any other stage s node controlled by or by nature (.e., chance nodes where a random proposer s chosen). Frst we note that, by assumng responds frst to any proposal, our defnton above means that we need not compute the value to at any response node (or nformaton set) controlled by. For an nformaton set h where makes a proposal, consder a node ξ = h (t ) where s assumed to be of type t. Then, s strategy σ h,t specfes a dstrbuton over proposals π (determned gven the values q h,t (π) whch can be calculated as above, and s type t ). Agent s value q t,h(t) at ths node s gven by the expectaton (w.r.t. ths strategy dstrbuton) of ts accept or reect values (or f t s not nvolved n a proposal, ts expected contnuaton value at stage s +1gven the responses of others). Its value at h s then Q t (h )= t μ h,t (t )q t,h(t).wedefneq t (h ) (where s the proposer) as n Case 1 above. Fnally, s value at nformaton set h that defnes the stage s contnuaton game (.e., where nature chooses proposer) s q h,t = 1 Q t m (h ) m where m s the number of actve agents, and h s the nformaton set followng h n whch s the proposer. (4) We are now able to defne the ratonalty constrants. We requre that the payoff from the equlbrum behavoral strategy σ exceeds the payoffs of usng pure strateges. Specfcally, n PBE, for all, t T,allh that correspond to one of s nformaton sets, and all actons b A(h), we have: X X μ h (t ) t a A(h) σ h,t (a)q h(t ),t (a) X μ h (t )q h(t ),t (b) t We also add constrants for the Bayesan update of belef varables for any agent regardng type t κ of agent performng a at any h (for all, t T,allhand all a ): μ h a,t (t κ )=μ h,t (t κ )σ h,tκ (a )/ μ h,t (t k )σ h,tk (a ) t k T Fnally, we add the obvous constrants specfyng the doman of the varous varables denotng strateges or belefs (they take values n [0, 1] and sum up to 1 as approprate). Ths ends the formulaton of the program descrbng the PBE. Ths s a polynomal constrant satsfacton problem: fndng a soluton to ths system of constrants s equvalent to decdng whether a system of polynomal equatons and nequaltes has a soluton [1]. The problem s decdable, but s ntractable. For example, an algorthm for decdng ths problem has been proposed wth exponental complexty [1]. Specfcally, the complexty of decdng whether a system of s polynomals, each of degree at most d n k varables has a soluton s s k+1 d O(k). In our case, assumng a random choce of proposer at each of S rounds, we can show that f α s the number of pure strateges, N the number of agents, T the number of types, then s = O(N S ), d = NS and k = O(αNT ). Ths s due to a varety of combnatoral nteractons evdent n the constrants above, creatng as they do nterdependences between belef and strategy varables. In summary, the formulaton above characterzes the PBE soluton of our coaltonal barganng game as a soluton of a polynomal program. However, t does not seem possble that ths soluton can be effcently computed n general. Nevertheless, ths PBE formulaton may prove useful for the computaton of a PBE n a barganng settng wth a lmted number of agents, types, proposals and barganng stages. 4 Approxmatons The calculaton of the PBE soluton s extremely complex due to both the sze of the strategy space (as a functon of the sze of the game tree, whch grows exponentally wth the problem horzon), and the dependence between varables representng strateges and belefs, as explaned above. We present an approxmaton strategy that crcumvents these ssues to some degree by: (a) performng only a small lookahead n the game tree n order to decde on a acton at any stage of the game; and (b) fxng the belefs of each agent durng ths process. Ths latter approach, n partcular, allows us to solve the game tree by backward nducton, essentally computng an equlbrum for ths fxed-belefs game. Note that whle belefs are held fxed durng the lookahead (whle computng an mmedate acton), they do get updated once the acton s selected and executed, and thus do evolve based on the actons of others (ths s n the sprt of recedng horzon control). Furthermore, we allow samplng of type vectors n the computaton to further reduce the tree sze. More precsely, at any stage of the game, wth a partcular collecton of actve agents (each wth ther own belefs), we mplement the followng steps: 1. An agent (e.g., proposer) constructs a game tree consstng of the next d rounds of barganng (for some small lookahead d). 2 All actve agents are assumed to have fxed belefs at each node n ths tree correspondng to ther belefs at the current stage. The agent computes ts optmal acton for the current round usng backward nducton to approxmate an equlbrum (smlar n nature to an SPE) of ths lmted depth game. (We elaborate below.) Furthermore, they sample partners types when calculatng the values of coaltons and proposals. 2 If less than d rounds reman, the tree s sutably truncated. 1229

4 2. Each player executes ts acton computed for the current round of barganng. If a coalton s formed, t breaks away, leavng the remanng players as actve. 3. All actve agents update ther belefs, gven the observed actons of others n the current round, usng Bayesan updatng. Further, each agent keeps track of the belef updates that any other agent of a specfc type would perform at ths pont. 4. The next barganng round s mplemented by repeatng these steps untl a complete coalton structure s determned or the maxmum number of barganng rounds s reached. We stress that the algorthm above does not approxmate the PBE soluton; gettng good bounds for a true PBE approxmaton would only be lkely by assumng belef updatng at every node of the game tree mentoned n Step 1. However, f our algorthmc assumptons are shared by all agents, each can determne ther best responses to others (approxmately) optmal play, and thus ther play approxmates an equlbrum of the fxed-belefs game. Indeed, we can defne a sequental equlbrum under fxed belefs (SEFB) as an extenson of the SPE and a restrcton of the PBE for a fxed-belefs barganng game, and can show the followng (stated nformally here): Theorem 1 If the Bayesan core (BC) of a Bayesan coaltonal game G [2] s non-empty, and so s the BC of each one of G s subgames, then regardless of nature s choce of proposers there s an SEFB strategy profle of the correspondng fxed-belefs dscounted Bayesan coaltonal barganng game that produces a BC element; and conversely, f there s an order ndependent 3 SEFB profle for a Bayesan coaltonal barganng game, then t leads to a confguraton that s n the BC of the underlyng G. Ths result descrbes some noton of equvalence between cooperatve and non-cooperatve Bayesan coalton formaton soluton concepts, and s smlar to results (e.g., Moldovanu et al. [5]) for non-stochastc envronments. It also motvates further Step 1 of our heurstc algorthm, equatng fxed belef equlbrum computaton wth determnaton of ( s part of) the Bayesan core. We now elaborate on ths process. We assume that the agents proceed to negotatons that wll last d rounds (correspondng to the algorthm s lookahead value d) under the assumpton that all belefs wll reman fxed to ther present values throughout the (Step 1) process. We wll present the delberatons of agent durng negotatons. For fxed types t of possble partners, drawn accordng to μ, wll reason about the game tree and assume fxed belefs of other agents. (Agents wll track of the updates of other agents belefs after ths stage of barganng; see Step 2 above). Then, can calculate the optmal acton of any t agent (ncludng hmself) at any nformaton set by takng expectatons over the correspondng tree nodes. We begn our analyss at the last stage d of negotatons. In any node ξ after hstory h where of type t s a responder to proposal π P and assumes a specfc type vector for partners, he expects a value for acceptng that s dfferent to hs (dscounted) reservaton value only f all other responders accept the proposal as well: q h(t ),t xv (y) = d (t C) f all t t C accept (1) V d (t ) otherwse 3 Astrategyproflesorder ndependent ff when played t leads to a specfc CS,P, ndependently of the choce of proposers. However, to evaluate ths acceptance condton, would need to know the other responders strateges (whch n turn depend on s strategy). Therefore, wll make the smplfyng assumpton that all other responders evaluate ther response to π by assumng that the rest of the agents (ncludng ) wll accept the proposal. Thus, any wth t t s assumed by to accept f he evaluates hs expected payoff from acceptance as beng greater than hs (dscounted) reservaton payoff: X x μ (t )V d ({t,t }) V d (t ) (2) t t C Wth ths assumpton, s able to evaluate the acceptance condton n Eq. 1 above, and so calculate a specfc q h(t ),t (y) value. Note that the use of ths assumpton can sometmes lead to an overestmate of the value of a node. At node ξ = h(t ), can also evaluate hs refusal value as q h(t ),t (n) =V d (t ) n ths last round. Then, responder s actual strategy at h can be evaluated as the strategy maxmzng s expected value gven μ h,t : σ h,t =arg max { μ h,t r {y,n} t t C (t )q h(t ),t (r)} If s a proposer of type t delberatng at ξ = h(t ),the value of makng proposal π s: q h(t ),t (π) = xv d (t C) f σ h,t = y, (3) V d (t ) otherwse (.e., wll get hs reservaton value unless all the responders of the specfc type confguraton agree to ths proposal). Furthermore, s expected value q h,t (π) from makng proposal π to coalton C at h can be determned gven μ h,t. Thus, the best proposal that of type t can make to coalton C s the one wth maxmum expected payoff: σ C;h,t = arg max π q h,t (π) wth expected payoff q C;h,t. However, can also propose to other coaltons at h as well. Therefore, the coalton C to whch should propose s the one that guarantees hm the maxmum expected payoff: C =argmax C {q C;h,t }. If P s the payoff allocaton assocated wth that proposal, then the optmal coaltonallocaton par that t can propose n ths subgame (that starts wth proposng at h)s:σ ;h,t = {C,P } wth maxmum expected payoff q C ;h,t. Fnally, f there exst more than one optmal proposal for, randomly selects any of them (ths s taken nto account n agents delberatons accordngly). Of course, when the subgame starts an agent does not know who the proposer n ths subgame wll be; and has only probablstc belefs about the types of hs potental partners. Thus, has to calculate hs contnuaton payoff q d:ξ,t at stage d (that starts at node ξ) wth m partcpants, n the way explaned n the prevous secton. Ths s straghtforward, as can calculate hs expected payoffs from partcpatng n any subgame where some proposes, gven that any can calculate the optmal strateges (and assocated payoffs) for any n ths round d subgame. Now consder play n a subgame startng n perod d 1, agan wth the partcpaton of m agents. The analyss for ths round can be performed n a way completely smlar to 1230

5 the one performed for the last round of negotatons. However, there s one man dfference: the payoffs n the case of a reecton are now the contnuaton payoffs (for agents of specfc type) from the last round subgame. We have to ncorporate ths dfference n our calculatons. Other than that, we can employ a smlar lne of argument to the one used for dentfyng the equlbrum strateges n the last perod. Proceedng n ths way, we defne the contnuaton payoffs and players strateges for each pror round, and fnally determne the frst round actons for any proposer of type t or any responder of type t respondng to any proposal. 5 Expermental Evaluaton To evaluate our approach, we frst conducted experments n two settngs, each wth 5 agents havng 5 possble types. Agents repeatedly engage n epsodes of coalton formaton, each epsode consstng of a number of negotaton rounds. We compare our Bayesan equlbrum approxmaton method (BE) wth KST, an algorthm nspred by a method presented by Kraus et al. [4]. Though ther method s better talored to other settngs, focusng on socal welfare maxmzaton, t s a rare example of a successfully tested dscounted coaltonal barganng method under some restrcted form of uncertanty, whch combnes heurstcs wth prncpled game theoretc technques. It essentally calculates an approxmaton of a kernel-stable allocaton for coaltons that form n each negotaton round wth agents ntentonally compromsng part of ther payoff n order to form coaltons. Lke [4], our KST uses a compromse factor of 0.8, but we assume no central authorty, only one agent proposng per round, and coalton values estmated gven type uncertanty. Durng an epsode, agents progressvely buld a coalton structure and agree on a payment allocaton. The acton executed by a coalton at the end of an epsode (the coaltonal acton) results n one of three possble stochastc outcomes o O = {0, 1, 2} each of dfferng value. Each agent s type determnes ts qualty and the qualty of a coalton s dctated by the sum of the qualty of ts members less a penalty for coalton sze. 4 Coalton qualty then determnes the odds of realzng a specfc outcome (hgher qualty coaltons have greater potental). Fnally, the value of a coalton gven member types s the expected value w.r.t. the dstrbuton over outcomes. In our frst settng, sngleton coaltons receve a penalty of -1 qualty ponts. We compare BE and KST under varous learnng models by measurng average total reward garnered by all coaltons n 30 runs of 500 formaton epsodes each, wth a lmt of 10 barganng rounds per epsode and a barganng dscount factor of δ =0.9. We also compare average reward to the reward that can be attaned usng the optmal, fxed kernel-stable coalton structure { 1, 2, 3, 4, 0 }. We compared BE and KST usng agents that update ther pror over partner types after observng coaltonal actons thus learnng by renforcement (RL) after each epsode and those that do not (No RL). In all cases, BE agents update ther belefs after observng the barganng actons of others 4 We omt the detals here. We only note that agent 0 (of type 0) s detrmental to any coalton (n our 2 frst settngs). durng each negotaton round. There are 388 proposals a BE agent consders when negotatng n a stage wth all fve agents present (fewer n other cases). Table 1(a) shows performance when each agent has a unform pror regardng the types of others. The BE algorthm consstently outperforms KST, even though KST promotes socal welfare (.e., s well-algned wth total reward crteron) rather than ndvdual ratonalty. KST agents wthout RL always converge to the coalton structure { 4, 3, 2, 0, 1 }; ths s due to the fact that they are dscouraged from cooperatng due to the lack of nformaton about ther counterparts. When KST agents learn from observed actons after each epsode (KST-Un-RL) they form the coaltons { 2, 3, 4, 0, 1 } n the last epsode n 16 of 30 runs. BE agents, n contrast, form coaltons based on evolvng belefs about others, and do not form the optmal structure { 1, 2, 3, 4, 0 }. 5 Rather they tend to form coaltons of 2 or 3 members whch exclude agent 0 from beng ther partner. In addton, payoff dvson for BE agents s more algned wth ndvdual ratonalty than t s wth KST. The shares of (averaged) total payoff of KST- Un-RL agents 0 4 are 0.8%, 0.7%, 28.8%, 29.6%, 40.1%, respectvely, whle for BE-Un-RL (SS:10, LA:2) they are 1.3%, 13.4%, 18.8%, 29.5%, 37%; ths more accurately reflects the power [6] of the agents. BE results are reasonably robust wth changng sample sze and lookahead value (at least n ths envronment wth 3125 possble type vectors n a 5-agent coalton). We attrbute the poor performance of KST agents to the fact that they make ther proposals wthout n any way takng nto consderaton the changng belefs of others. Wth the belefs of the agents varyng, negotatons drag (up to the maxmum of 10 rounds) due to refusals, resultng n reduced payoffs. BE agents do not suffer from ths problem, snce they keep track of all possble partners updated belefs, and use them durng negotaton. Thus, they typcally form a coalton structure wthn the frst four rounds of an epsode. We also expermented wth a second settng n whch sngleton coaltons receve a penalty of -2 qualty ponts (rather than -1 above), and where q( t C ) = t t C q(t )/ C (as coaltons get bgger they get penalzed to reflect coordnaton dffcultes). Ths settng makes the qualty of coaltons more dffcult to dstngush. Here, a near-optmal confguraton contans the structure { 4, 3, 2, 1, 0 }. Weusethree dfferent prors: unform, msnformed (agents have an ntal belef of 0.8 that an agent wth type t has type t +2), and nformed (belef 0.8 n the true type of each other agent). The results (Table 1(b)) ndcate that KST agents agan do not do very well, engagng n long negotatons due to unaccounted-for dfferences n belefs among the varous agents. KST-Un-RL agents, for example, typcally use all ten barganng rounds; n contrast, BE-Un-RL usually form structures wthn 3 rounds. Even when KST uses nformed prors, the fact that the expected value of coaltons s not common knowledge takes ts toll. BE agents, on the other hand, derve the true types of ther partners wth 5 Nor should they, gven barganng horzon and δ the kernel and other stablty concepts do not consder barganng dynamcs. 1231

6 Method Reward Optmal CS (expected) KST-Un-NoRL (49.4%) KST-Un-RL (67.3%) BE-Un-NoRL SS=20, LA= (91.2%) BE-Un-RL SS=20, LA= (87.8%) BE-Un-NoRL SS=10, LA= (93.4%) BE Un-RL SS=10, LA= (91.3%) BE-Un-NoRL SS=3, LA= (93.1%) BE-Un-RL SS=3, LA= (91.6%) Method Reward Optmal CS (expected) KST-Un-NoRL (59.6)% KST-Un-RL (59.5)%) BE-Un-NoRL (93.7%) BE-Un-RL (95.2%) KST-Ms-NoRL (59.6)% KST-Ms-RL (63.9)% BE-Ms-NoRL (93.5%) BE-Ms-RL (95.3%) KST-Inf-NoRL (65.6%) KST-Inf-RL (73%) BE-Inf-NoRL (93.3%) BE-Inf-RL 32401(95.6%) Method Q A/B KST-NoRL BE-NoRL KST-NoRL BE-NoRL KST-RL BE-RL KST-RL BE-RL (c) Settng C; Unform Prors; BE (a) Settng A uses SS=5, LA=2; A/B denotes observed relatve power of A over B (b) Settng B; (BE uses SS=10, LA=2) Table 1: Settngs results (average). SS :sample sze; LA :lookahead; Un :unform, Ms :msnformed, Inf :nformedpror. certanty n all experments, and typcally form proftable confguratons wth structures such as { 4, 3, 2, 1, 0 } or { 4, 2, 3, 1, 0 }. We can also see that RL enhances the performance of BE agents slghtly, helpng them further dfferentate the qualty of varous partners. We also report brefly on the results n a settng wth 8 agents, of 2 possble types per agent (4 agents of type A, 4 of type B). The relatve power of type A over B s In ths settng, formng coaltons by mxng agent types s detrmental, wth the excepton of the A, A, B, B ( optmal ), A, A, B and A, B coaltons. There are 2841 proposals an agent consders when negotatng n a stage wth all 8 agents present. The settng makes dscovery of opponent types dffcult, and thus ratonal agents should settle for suboptmal coaltons (hopefully usng them as steppng stones to form better ones later). We also vared the barganng δ (0.95 and 0.5). Agents do not accumulate much reward n ths settng, barganng for many rounds. Instead of reportng reward, we report expected value Q of formaton decsons, Q = C f CV (C), wth f C beng the observed average frequency wth whch coalton C forms and V (C) ts expected value. Results (Table 1(c)) show that BE agents outperform KST agents both n terms of socal welfare and ndvdual ratonalty (the observed relatve power of types the fracton of respectve observed payoffs s close to the true power), and that RL updates are qute benefcal. Further, lowerng the dscount rate to 0.5 forces the agents to form coaltons early, but also contrbutes to better decsons, because t enables the agents to dscover the types of opponents wth more accuracy, effectvely reducng the number of possble opponent responses durng barganng (ntutvely, gven more tme, both a strong and a weak type mght refuse a proposal, whle f tme s pressng the weak mght be the only one to accept). 6 Concludng Remarks and Future Work We proposed an algorthm for coaltonal barganng under uncertanty about the capabltes of potental partners. It uses 6 Relatve power A/B s the expected payoff of A n coaltons excludng B, over the expected payoff of B n coaltons wthout A. teratve coalton formaton wth belef updatng based on the observed actons of others durng barganng, and s motvated by our formulaton of the PBE soluton of a coaltonal barganng game. The algorthm performs well emprcally, and can be combned wth belef updates after observng the results of coaltonal actons (n renforcement learnng style). Future and current work ncludes mplementng a contnuous barganng acton space verson of our algorthm, and also ncorporatng t wthn a broader RL framework facltatng coalton formaton and sequental coaltonal decson makng under uncertanty. We are also nvestgatng approxmaton bounds for our heurstc algorthm. Acknowledgments Thanks to Vangels Markaks for extremely useful dscussons and helpful comments. References [1] S. Basu, R. Pollack, and M.-F. Roy. On the Combnatoral and Algebrac Complexty of Quantfer Elmnaton. Journal of the ACM, 43(6): , [2] G. Chalkadaks and C. Boutler. Bayesan Renforcement Learnng for Coalton Formaton Under Uncertanty. In Proc. of AAMAS 04, [3] K. Chatteree, B. Dutta, and K. Sengupta. A Noncooperatve Theory of Coaltonal Barganng. Revew of Economc Studes, 60: , [4] S. Kraus, O. Shehory, and G. Taase. The Advantages of Compromsng n Coalton Formaton wth Incomplete Informaton. In Proc. of AAMAS 04, [5] B. Moldovanu and E. Wnter. Order Independent Equlbra. Games and Economc Behavor, 9, [6] R.B. Myerson. Game Theory: Analyss of Conflct [7] A. Okada. A Noncooperatve Coaltonal Barganng Game Wth Random Proposers. Games and Econ. Behavor, 16, [8] M.J. Osborne and A. Rubnsten. A course n game theory [9] J. Sus, P. Borm, A. De Wagenaere, and S. Ts. Cooperatve games wth stochastc payoffs. European Journal of Operatonal Research, 113: ,

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME Vesna Radonć Đogatovć, Valentna Radočć Unversty of Belgrade Faculty of Transport and Traffc Engneerng Belgrade, Serba