Regret Minimization Algorithms for the Follower s Behaviour Identification in Leadership Games

Similar documents
SUPPLEMENTAL MATERIAL

Game Theory. Lecture Notes By Y. Narahari. Department of Computer Science and Automation Indian Institute of Science Bangalore, India July 2012

The material in this chapter is motivated by Experiment 9.

Statistics for Economics & Business

DESCRIPTION OF MATHEMATICAL MODELS USED IN RATING ACTIVITIES

5. Best Unbiased Estimators

Subject CT1 Financial Mathematics Core Technical Syllabus

An Empirical Study of the Behaviour of the Sample Kurtosis in Samples from Symmetric Stable Distributions

Hopscotch and Explicit difference method for solving Black-Scholes PDE

Standard Deviations for Normal Sampling Distributions are: For proportions For means _

FINM6900 Finance Theory How Is Asymmetric Information Reflected in Asset Prices?

point estimator a random variable (like P or X) whose values are used to estimate a population parameter

x satisfying all regularity conditions. Then

Calculation of the Annual Equivalent Rate (AER)

5 Statistical Inference

Monetary Economics: Problem Set #5 Solutions

Unbiased estimators Estimators

Non-Inferiority Logrank Tests

Estimating Proportions with Confidence

Solution to Tutorial 6

AY Term 2 Mock Examination

Forecasting bad debt losses using clustering algorithms and Markov chains

MODIFICATION OF HOLT S MODEL EXEMPLIFIED BY THE TRANSPORT OF GOODS BY INLAND WATERWAYS TRANSPORT

A New Constructive Proof of Graham's Theorem and More New Classes of Functionally Complete Functions

INTERVAL GAMES. and player 2 selects 1, then player 2 would give player 1 a payoff of, 1) = 0.

Bayes Estimator for Coefficient of Variation and Inverse Coefficient of Variation for the Normal Distribution

Productivity depending risk minimization of production activities

A random variable is a variable whose value is a numerical outcome of a random phenomenon.

CHAPTER 8 Estimating with Confidence


Limits of sequences. Contents 1. Introduction 2 2. Some notation for sequences The behaviour of infinite sequences 3

Chapter 8: Estimation of Mean & Proportion. Introduction

Rafa l Kulik and Marc Raimondo. University of Ottawa and University of Sydney. Supplementary material

Appendix 1 to Chapter 5

Sampling Distributions and Estimation

Models of Asset Pricing

CreditRisk + Download document from CSFB web site:

CAPITAL PROJECT SCREENING AND SELECTION

Lecture 4: Parameter Estimation and Confidence Intervals. GENOME 560 Doug Fowler, GS

Models of Asset Pricing

Models of Asset Pricing

Research Article The Probability That a Measurement Falls within a Range of n Standard Deviations from an Estimate of the Mean

III. RESEARCH METHODS. Riau Province becomes the main area in this research on the role of pulp

The Limit of a Sequence (Brief Summary) 1

Binomial Model. Stock Price Dynamics. The Key Idea Riskless Hedge

14.30 Introduction to Statistical Methods in Economics Spring 2009

ii. Interval estimation:

Reinforcement Learning

of Asset Pricing R e = expected return

Chapter 8. Confidence Interval Estimation. Copyright 2015, 2012, 2009 Pearson Education, Inc. Chapter 8, Slide 1

ST 305: Exam 2 Fall 2014

The Time Value of Money in Financial Management

Equilibrium Analysis of Multi-Defender Security Games

We analyze the computational problem of estimating financial risk in a nested simulation. In this approach,

of Asset Pricing APPENDIX 1 TO CHAPTER EXPECTED RETURN APPLICATION Expected Return

Control Charts for Mean under Shrinkage Technique

The Valuation of the Catastrophe Equity Puts with Jump Risks

Mine Closure Risk Assessment A living process during the operation

18.S096 Problem Set 5 Fall 2013 Volatility Modeling Due Date: 10/29/2013

1 Random Variables and Key Statistics

This article is part of a series providing

Policy Improvement for Repeated Zero-Sum Games with Asymmetric Information

We learned: $100 cash today is preferred over $100 a year from now


A Technical Description of the STARS Efficiency Rating System Calculation

Math 312, Intro. to Real Analysis: Homework #4 Solutions

Combining imperfect data, and an introduction to data assimilation Ross Bannister, NCEO, September 2010

A New Approach to Obtain an Optimal Solution for the Assignment Problem

Institute of Actuaries of India Subject CT5 General Insurance, Life and Health Contingencies

1. Suppose X is a variable that follows the normal distribution with known standard deviation σ = 0.3 but unknown mean µ.

Generative Models, Maximum Likelihood, Soft Clustering, and Expectation Maximization

Monopoly vs. Competition in Light of Extraction Norms. Abstract

Introduction to Probability and Statistics Chapter 7

Anomaly Correction by Optimal Trading Frequency

CHAPTER 2 PRICING OF BONDS

Neighboring Optimal Solution for Fuzzy Travelling Salesman Problem

A DOUBLE INCREMENTAL AGGREGATED GRADIENT METHOD WITH LINEAR CONVERGENCE RATE FOR LARGE-SCALE OPTIMIZATION

Inferential Statistics and Probability a Holistic Approach. Inference Process. Inference Process. Chapter 8 Slides. Maurice Geraghty,

Online appendices from The xva Challenge by Jon Gregory. APPENDIX 10A: Exposure and swaption analogy.

Overlapping Generations

Subject CT5 Contingencies Core Technical. Syllabus. for the 2011 Examinations. The Faculty of Actuaries and Institute of Actuaries.

APPLICATION OF GEOMETRIC SEQUENCES AND SERIES: COMPOUND INTEREST AND ANNUITIES

Chapter 5: Sequences and Series

Faculdade de Economia da Universidade de Coimbra

Portfolio Optimization for Options

1 The Black-Scholes model

A Bayesian perspective on estimating mean, variance, and standard-deviation from data

0.07. i PV Qa Q Q i n. Chapter 3, Section 2

1 Estimating sensitivities

Supersedes: 1.3 This procedure assumes that the minimal conditions for applying ISO 3301:1975 have been met, but additional criteria can be used.

EU ETS Hearing, European Parliament Xavier Labandeira, FSR Climate (EUI)

Journal of Statistical Software

A point estimate is the value of a statistic that estimates the value of a parameter.

Maximum Empirical Likelihood Estimation (MELE)

1. Find the area under the standard normal curve between z = 0 and z = 3. (a) (b) (c) (d)

Topic-7. Large Sample Estimation

Notes on Expected Revenue from Auctions

Online appendices from Counterparty Risk and Credit Value Adjustment a continuing challenge for global financial markets by Jon Gregory

Lecture 4: Probability (continued)

Today: Finish Chapter 9 (Sections 9.6 to 9.8 and 9.9 Lesson 3)

Transcription:

Regret Miimizatio Algorithms for the Follower s Behaviour Idetificatio i Leadership Games Lorezo Bisi, Giuseppe De Nittis, Fracesco Trovò, Marcello Restelli, Nicola Gatti Dipartimeto di Elettroica, Iformazioe e Bioigegeria Politecico di Milao, Milao, 20133, Italy lorezo.bisi@mail.polimi.it, {giuseppe.deittis, fracesco1.trovo, marcello.restelli, icola.gatti}@polimi.it Abstract We study for the first time, a leadership game i which oe aget, actig as leader, faces aother aget, actig as follower, whose behaviour is ot ow a priori by the leader, beig oe amog a set of possible behavioural profiles. The mai motivatio is that i real-world applicatios the commo game-theoretical assumptio of perfect ratioality is rarely met, ad ay specific assumptio o bouded ratioality models, if wrog, could lead to a sigificat loss for the leader. The questio we pose is whether ad how the leader ca lear the behavioural profile of a follower i leadership games. This is a atural olie idetificatio problem: i fact, the leader aims at idetifyig the follower s behavioural profile to exploit at best the potetial o-ratioality of the oppoet, while miimizig the regret due to the iitial lac of iformatio. We propose two algorithms based o differet approaches ad we provide a regret aalysis. Furthermore, we experimetally evaluate the pseudo-regret of the algorithms i cocrete leadership games, showig that our algorithms outperform the olie learig algorithms available i the state of the art. 1 INTRODUCTION The study of scearios i which multiple strategic agets iteract is a challegig problem that is cetral i Artificial Itelligece from may years. The modellig of these scearios ca be elegatly achieved by meas of o-cooperative game theory tools (Fudeberg ad Tirole, 1991), while the tas of solvig a game is i may cases a ope problem, i which the most suitable techiques to adopt strictly deped o iformatio available to the agets. Two extreme situatios ca be distiguished: whe all the iformatio about the game is commo to the players (e.g., utility fuctios ad ratioality either perfect or bouded), the problem is basically a optimizatio problem, solvable by meas of techiques from operatios research (Shoham ad Leyto-Brow, 2008), coversely, whe players have o iformatio about the oppoets, the problem is a multilearig problem, ad learig techiques are commoly employed (Tuyls ad Weiss, 2012). Some attempts were also doe to pair these two approaches, allowig agets to play at the equilibrium if the oppoet is ratioal ad to play off the equilibrium learig to exploit her at best otherwise (Coitzer ad Sadholm, 2007). Recetly, there has bee a icreasig iterest i leadership games, where a aget called leader publicly commits to a strategy ad subsequetly aother aget called follower observes the commitmet ad the taes her decisio. Such a paradigm has bee successfully employed i a umber of applicatios i the security domai (Basilico et al., 2017; Pita et al., 2008; Tsai et al., 2009), where a defeder (actig as leader) must protect some targets i a eviromet from a attacer (actig as follower), who aims at compromisig such targets without beig detected. The success of leadership games i real-world applicatios is due to a umber of reasos: committig to a strategy is the best the leader ca do, the equilibrium fidig problem is coceptually simple sice the follower ca merely play her best respose to the commitmet of the leader without ay strategic reasoig about the leader s behaviour, ad the solutio is uique except degeeracy. The crucial issue is that i real-world applicatios the follower may be ot perfectly ratioal, ot ecessarily playig her best respose to the leader s commitmet. For istace, a terrorist could decide either to attac a target that is ot patrolled, sice she is sure to ot be caught, or a target ot so valuable itself, but that would cause a huge

paic reactio i the populatio (e.g., this is what happeed i November 2015 i Paris attacs at the Batacla theatre). The same challege may be faced by a compay that aims at plaig the productio of a product ad has to decide whe ad how it is coveiet to eter the maret whe aother compay is already the leader i such a maret this is the well-ow Stacelberg duopoly (Vo Stacelberg, 1934). Wheever the assumptio of perfect ratioality is ot met, each aget may i priciple exploit her oppoet s strategy. I the preset paper, we focus o leadership games i which the follower may be ot ratioal. The literature provides a umber of models of bouded ratioality (A et al., 2013; Nguye et al., 2013). Probably, the most elegat oe is the Quatal Respose (QR) (McFadde, 1984), which fixes the probability distributio over the o-optimal actios of a aget o the basis of their optimality gap. The crucial issue is that all the wors o bouded ratioality mae a assumptio about the specific behaviour of the oppoet ad this assumptio could be ever met i real-world applicatios. I that case, such a assumptio may lead to a arbitrarily loss for the leader. Differetly from the existig literature, we study the origial sigle-aget-learig problem i which the behaviour of the follower is oe amog a set of possible behavioural profiles e.g., the ratioal oe (i.e., best respose), a ratioally bouded oe based o the QR, a stochastic strategy ad the leader does ot iitially ow it, but she ca lear it by exploitig the oppoet s behaviour at best. Our goal is to desig olie learig techiques capable to idetify the behaviour of the follower while miimizig the regret due to the iitial lac of iformatio. We propose a set of algorithms based o sequetial learig techiques (Bubec et al., 2012) that are able to ifer the behaviour of the follower the leader is playig agaist exploitig the repeated iteractios betwee the two players. Origial cotributios The mai origial cotributios we provide i this paper are as follows. We defie a ovel sceario i which a leader plays agaist a follower whose behaviour is uow but it belogs to a set of ow profiles. We show that state-of-the-art badit ad expert algorithms suitable for our problem suffers from a liear ad logarithmic regret, respectively, i the legth of the time horizo. Thus, we itroduce two ovel approaches to deal with our problem, bridgig together game-theoretical techiques ad olie learig tools. I the first approach, the leader has a belief about the follower ad updates it durig the game. We ame the algorithm Follow the Belief () ad we provide a fiite-time aalysis showig that the regret of the algorithm is costat i the legth of the time horizo. I the secod approach, amely Follow the Regret (), the learig policy is drive directly by the estimated expected regret ad is based o a bacward iductio procedure. Fially, we provide a thorough experimetal evaluatio i cocrete leadership settigs ispired to security domais, comparig our algorithms with the mai algorithms available i the state of the art of the olie learig field ad showig that our approaches provide a remarable improvemet i terms of expected pseudoregret miimizatio. 2 RELATED WORKS Here, we metio the mai wors related to ours. We maily refer to the literature o security games sice most of the wors o leadership games with bouded ratioality ad/or learig deal with these games. Security games model the problem of fidig the optimal schedule of scarce resources whe facig strategic adversaries. May of them deal with real-world problems, e.g., i Pita et al. (2008) game theoretic techiques have bee applied to esure the security of the Los Ageles Iteratioal Airport (LAX), i Tsai et al. (2009) the authors exploit the Stacelberg paradigm to study how to schedule udercover federal air marshals o domestic U.S. flights, while i Pita et al. (2011) such paradigm is employed to allocate the Trasportatio Security Admiistratio (TSA) scarce resources to provide protectio withi several U.S. airports. A higher degree of iteractio amog the agets is captured i Basilico et al. (2017), where a alarm system to detect potetial attacs is itroduced. The mai issue is that such wors oly deal with a fully ratioal attacer while i real-life scearios attacers might be ratioally bouded. Bouded ratioality has bee itroduced i security games models i the so called Gree Security Games (GSGs), a geeralizatio of Stacelberg games (Fag et al., 2015). A remarable example is Qia et al. (2014), i which the problem of protectig atural resources from illegal extractio is studied: sice such extractios are frequet, it is possible for the defeder to lear the distributio of the resources aalyzig the attacer s behavior. A recet applicatio i which a ad hoc adaptio of the QR fuctio, amed Subjective Utility Quatal Respose (SUQR) (Nguye et al., 2013), has bee employed is the prevetio from poachers, who hut edagered species (Ford et al., 2014; Yag et al., 2014). Here, the QR is employed to model the o-ratioal behavior of the poachers. I a similar settig, Qia et al. (2016) aalyze the problem i which the defeder is aware oly of the attac activities at targets they protect, modelig it with Restless Multi-Armed Badit ad usig Whittle idex policy to compute patrol strategies. I security games, Balca et al. (2015); Blum et al.

(2015); Paruchuri et al. (2008) deal with a sigle ratioal attacer whose prefereces may be of multiple types i Bayesia fashio. Specifically, the differet attacers are discrimiated accordig to the evaluatios they give to the targets, thus leadig to the problem of solvig Bayesia Stacelberg Games. The mai limitatio of all the aforemetioed wors is that the defeder plays agaist a attacer whose behavioral profile is a priori ow, while i real-world situatio it may be uow. Whe dealig with sequetial decisio learig problems, a customary approach cosists i exploitig Multi-Armed Badit (MAB) techiques, as doe by Klíma et al. (2014) ad Xu et al. (2016). Eve though both wors focus o miimizig the expected regret, the differet actios correspodig to the arms are the possible targets that may be chose, while i our wor we are discrimiatig amog differet attacer types. 3 PROBLEM FORMULATION Although our wor ca be employed i priciple for ay leadership sceario, for the sae of clarity, we focus o security domais, thus referrig to the leader as defeder ad to the follower as attacer. Let us cosider a 2-player ormal-form repeated game G N defied over a fiite umber of rouds N N, where a defeder D ad a attacer A play agaist each other i some eviromet with some valuable targets M = {1,..., M}, characterized by values v = (v 1,..., v M ) T, v m (0, 1]. The goal of the defeder D is to protect such targets while the attacer A aims at compromisig them. The space of actios of D ad A is give by the set of targets such that D chooses the target to protect, while A chooses the target to attac. The course of the game is represeted i Figure 1. Specifically, at each roud {1,..., N}, the defeder D aouces the strategy she commits to σ D, M (Lie 1), where M deotes the M-dimesioal simplex, while A observes such a commitmet (Lie 2). The, they cocurretly play their actio over the target space (Lie 3), i.e., the defeder plays actios i D, M accordig to σ D, while A, the follower, plays i A, M accordig to some attacer model σ A (σ D, ) M. The game is zero-sum: if D ad A choose the same target at roud, they both get a utility equal to 0, coversely, if A attacs the i-th target while D decides to protect the j- th oe, A gets v i ad D gets v i sice she lost the target. More cocisely, the defeder icurs i the loss (Lie 4): l := v ia, I{i A, i D, }, (1) ot sufferig from ay loss if both players select the same for each {1,..., N} 1. D publicly commits to a strategy σ D, 2. A observes the strategy D committed to 3. D ad A play i D, ad i A,, respectively 4. D icurs i loss l accordig to Equatio (1) Figure 1: Leader-follower iteractio target. 1 Hereafter, we assume that the defeder is able to compute the best respose strategy σ D (A) M if she is give the attacer model she is playig agaist. Similarly, we deote with σ A (σ D) M the best respose A plays agaist strategy σ D of D. Accordig to such assumptio, we ca compute the expected loss of D agaist a geeric attacer A as: L(A) := m M σ A (σ D(A)) m v m (1 σ D(A) m ), (2) where σ ( ) m is the probability associated with target m by the strategy. The problem we study i this wor is defied as follows. Defiitio 1. The Follower s Behaviour Idetificatio i Security Games (I-SG) problem is a tuple (G N, A, A ), where G N is a 2-player ormal-form repeated game ad A = {A 1,..., A K } is a set of possible attacer behavioural profiles, with A A deotig the actual profile of the attacer i G N, uow to the defeder D. I this wor, we cast the I-SG as a sequetial decisio learig problem, where, at each roud, the defeder aims at selectig her best respose to the attacer i order to idetify the actual attacer profile A A while miimizig the loss suffered durig the learig process. Defiitio 2. A policy U is a algorithm able to provide at each roud a strategy profile σ D, for the defeder D. Formally: U(h ) := σ D,, where h is the history collected so far, i.e., all the strategies declared by the defeder {σ D,1,..., σ D, 1 }, the actios played by the two players {i D,1, i A,1,..., i D, 1, i A, 1 } i the past rouds ad the correspodig losses {l 1,..., l 1 }. We evaluate the performace of a give policy U over a fiite-time horizo of N rouds by meas of the expected 1 Hereafter, we deote with I{E} the idicator fuctio of a geeric evet E.

cumulative pseudo-regret, defied as: [ N ] R N (U) = E l L N, =1 where L := L(A ) is the expected loss icurred by the defeder if she plays the best respose to the actual attacer A, l is the loss icurred by usig the policy U at roud ad the expectatio E[ ] is tae w.r.t. the stochasticity of the attacer strategy, the defeder policy ad the policy U. The goal of a geeric policy U is to miimize the pseudo-regret R N (U) icurred while learig the true attacer s profile. 4 ANALYSED ATTACKER PROFILES I this sectio, we describe the differet attacer profiles we study i this wor ad formalize the defiitio of the attacer strategy σ A ( ) for two sets of attacers, grouped depedig o their ability to chage their behavior w.r.t. the strategy D commits to. Specifically, o oe side, we tae ito accout stochastic attacers, which disregard the strategy of D, o the other, we focus o strategy-aware attacers, able to modify their strategies depedig o the defeder aouced strategy σ D,. 4.1 STOCHASTIC ATTACKER The first class of attacers is the Stochastic (Sto) oe, where the attacig player does ot tae ito accout the strategy σ D, aouced by the defeder D ad thus has a fixed probability over time to attac the available targets. This class of attacers models oppoets focused o specific targets ad whose prefereces are ot iflueced by the defeder behaviour. At roud, a stochastic attacer Sto plays accordig to the strategy: σ Sto (σ) = p(sto) σ M, where p(sto) M is a probability distributio over the targets, which is ow to D. I this case, the defeder best respose σ D (σ Sto) is defied as: σ D(Sto) m = { 1 if m = arg max {v i p(sto) i } i M. 0 otherwise 4.2 STRATEGY AWARE ATTACKER The secod class of attacers we examie i this paper cosists of strategy aware attacers, correspodig to followers able to modify their strategy depedig o the strategy of the defeder D. I particular, we study Stacelberg (Sta) attacers (Vo Stacelberg, 1934), who are able to exploit the iformatio provided by strategy profile declared by the defeder D ad optimally respod to it, ad SUQR attacers (Nguye et al., 2013), havig bouded ratioality ad beig capable to partially exploit the iformatio provided by the defeder, disregardig heavily patrolled targets. Stacelberg Attacer Give a strategy profile declaratio σ D,, a Stacelberg attacer Sta respods with: σ Sta (σ) = arg max σ σ m v m (1 σ m ) M m M ad the defeder best-respods to this attacer is: σ D(Sta) = arg mi max σ σ m v m (1 σ m ), M σ M m M as reported i (Coitzer ad Sadholm, 2006), where it is proved that, for 2-player zero-sum games, the optimal mixed strategy for the leader to commit to is equivalet to computig the mimax strategy, i.e., to miimize the maximum expected utility that the oppoet ca obtai. SUQR Attacer The SUQR attacer respods to the commitmet σ D, as: σ SUQR (σ) m = exp{ ασ m + βv m + γ} M h=1 exp{ ασ h + βv h + γ}, where α R +, β, γ R are parameters ow to the defeder, characterizig the attacer ad depedig o the uderlyig applicatio. I this case, we do ot have a closed form for the best respose, but we ca compute the mimax solutio to the problem followig the steps tae i (Yag et al., 2011). We will refer to σ D (SUQR) as the best respose to a attacer with a SUQR profile. 5 IDENTIFYING THE ATTACKER Iitially, we describe how the state-of-the-art techiques ca be adapted to address the I-SG problem. Direct approaches are provided by MAB (Bubec et al., 2012) ad expert (Cesa-Biachi ad Lugosi, 2006) algorithms, where arms/experts represet the differet attacer behavioural profiles i A. These are geeral-purpose techiques ot exploitig the structure of the problem we are taclig. Summarily, MAB algorithms do ot use the expert feedbac to lear the attacer behaviour, while expert algorithms do ot differetiate amog feedbacs received after the defeder committed to differet strategies. We show below the regret obtaied whe these algorithms are used i a I-SG problem. Whe usig MAB algorithms, we are able to directly apply the derivatio of a upper boud over the pseudoregret available i the literature to our problem. We ca

state the followig result for the case of (Auer et al., 2002). Theorem 1 ( Pseudo-regret upper boud). Let us cosider a istace of the I-SG problem ad apply the algorithm, where each possible behavioural profile A A is a arm which receives reward l if played. The, we icur i the followig pseudo-regret: R N (U) 8 l N ( L ) + (1 + π2 3 ) L, where L = M m=1 σ A (σ D (A )) m v m (1 σ D (A ) m ) L is the expected regret of playig the best respose to attacer A whe the real attacer is A. Whe usig a expert algorithm, for istace Follow the Perturbed Leader () (Cesa-Biachi ad Lugosi, 2006), we could exploit a (expert) feedbac over all arms sice we ca compute the expected loss also for the attacer profiles that have ot bee played at tur. Nevertheless, if the attacer is strategy aware ad we adopt a expert feedbac, D icurs i a liear regret. We formally state this result i the followig theorem. Theorem 2 (Expert pseudo-regret upper boud). Let us cosider a istace of the I-SG problem ad apply the algorithm, where each possible profile A is a expert ad receives, at roud, a expert reward equal to mius the loss she would have icurred observig i A, by playig the best respose to the attacer A. The, there always exists a attacer set A s.t. the defeder D icurs i a expected pseudo-regret of: R N (U) L N. The proof of Theorem 2 is reported i Appedix A for reasos of space. The above results show that MAB techiques provide, i the geeral case, better guaratees tha expert algorithms, assurig a worst-case pseudoregret of O(l N) vs. O(N). I the followig, we propose two differet techiques that effectively exploit the iformatio both o stochastic ad strategy aware attacers, providig better guaratees over the worst-case pseudo-regret. The first algorithm, Follow the Belief (), coducts the learig process taig ito accout the belief of the learer about the differet behavioural profiles. The secod method, Follow the Regret (), is based o a value iteratio algorithm over the belief space that miimizes the expected regret over the ext rouds. 5.1 FOLLOW THE BELIEF The pseudo-code of is preseted i Algorithm 1. At the begiig, iitializes a set of active attacers Algorithm 1 1: P = A 2: for all A P do 3: b 1 (A ) = 1 K 4: for all {1,..., N} do 5: Select A = arg max (A ) A P 6: Play σ D (A ) 7: Observe attacer actio i A, 8: for all A P do 9: if σ A (σ D (A )) = 0 the ia, 10: P P \ A 11: else 12: Compute b +1 (A ) with Equatio (3) P = A ad a belief b 1 (A ) = 1/K for all the attacer profiles A P (Lies 1-3). At each roud, the algorithm selects the attacer A for which the belief is the largest oe (where ties are broe arbitrarily), best respods with the strategy σ D (A ) ad observes the actio actually played by the attacer i A, (Lies 4-7). After that, the belief is updated as follows: b +1 (A ) = w (A ) A P w (A), (3) where w (A) = b (A ) σ A (σ D (A )) ia, (Lies 8-12). I other words, the algorithm updates the lielihood of the sequece of the actios for each profile i A P accordig to the observed actio i A, at roud (Lie 12). If the realizatio i A, is ot cosistet for attacer A (zero lielihood), profile A is removed from P (Lie 10). Let b j,t := E σ D (A j)[b,t ], be the expected value of the belief we get for attacer A whe we are best respodig to A j ad the true type is A A ad deote with b := mi j Aj A l(b j,t) l(b j,t ) the miimum differece of such values. We ca upper boud the regret of algorithm as stated by the followig theorem. Theorem 3 ( pseudo-regret upper boud). Give a istace of the I-SG problem s.t. b > 0 for each A A ad applyig, the defeder icurs i a pseudo-regret of: R N (U) K =1 2(λ 2 + λ2 ) L ( b ) 2, where λ := max m M max σ S l(σ A (σ) m ) mi m M mi σ S l(σ A (σ) m )I {σ A (σ) m 0} is the rage where the logarithm of the beliefs realizatios lies (excludig realizatios equal to zero, which ed the exploratio of a profile) ad S := σ D (A ) is the set of the available best respose to the attacers profile.

For space reasos, we report the proof of Theorem 3 i Appedix A. Comparig the derived results, we otice that the algorithm presets a upper boud over the pseudo-regret that is strictly better tha that of MAB algorithms, i.e., a costat regret O(1) i N vs. a logarithmic oe O(l N). 5.2 FOLLOW THE REGRET adopts the belief as discrimiat factor to select the strategy profile to play i the ext roud. Coversely, i what follows, we describe the algorithm which is drive by a value iteratio procedure that directly miimizes the expected regret over the remaiig rouds {+1,..., N}. I priciple, oe should perform the procedure util the last roud N, but, for computatioal purposes, a approximate solutio ca be obtaied by settig a maximum level of recursio h max ad carry o the optimizatio oly o the rouds { + 1,..., mi{ + h max, N}}. Algorithm 2 (h max ) 1: for all A A do 2: Iitialize b (1) = 1 K 3: for all {1,..., N} do 4: 5: ˆR = RE(1, b (), h max ) Select A s.t. = arg mi t ˆRt 6: Play σ D (A ) 7: Observe attacer actio i A, 8: for all A A do 9: Compute b (+1) accordig to Equatio (6) Algorithm 3 RE(h, b, h max ) 1: for all A A do 2: for all (i, j) M 2 do 3: for all A t A do 4: ˆbt b t σ At (σ D (A )) j 5: ˆb ˆb ˆb m m 6: Compute r ij, accordig to Equatio (4) 7: if h < h max the 8: R = RE(h + 1, ˆb, hmax ) 9: r ij, r ij, + mi R 10: Compute ˆR accordig to Equatio (5) 11: Retur ˆR The pseudo-code of the algorithm is preseted i Algorithm 2, which recursively exploits the subroutie Algorithm 3. At first, the algorithm requires to iitialize a belief vector b (1) = 1 K for each attacer A A (Lie 2, Alg. 2). At each roud, the algorithm computes the estimated expected regret vector ˆR suffered by D if she plays the best respose σ D (A ) to A for each attacer profile A A (Lie 4, Alg. 2), by recursively callig the Regret Estimator (RE) algorithm. Here, for every possible attacer A A ad for every pair of possible actios of the defeder ad the attacer (i, j) M 2, we create a ew belief vector ˆb by updatig b accordig to the iformatio the attacer played actio j (Lie 4, Alg. 3). After that, we compute r ij,, i.e., the estimated expected loss i the case the defeder D plays actio i D, = i ad the attacer A plays i A, = j averaged over the beliefs b (A), as follows: r ij, = v j I{i j} t {1,...,K} ˆbt L(A t ). (4) If the maximum recursio level h max has bee reached, the above value correspods to the total estimated expected regret, otherwise we recursively compute the regret by callig RE over the followig rouds ad sum it to the istataeous oe r ij, (Lie 9, Alg. 3). Fially we compute the estimated total regret of choosig a specific attacer A for the ext tur (Lie 10, Alg. 3) as follows: ˆR := M i=1 j=1 M r ij, σd(a ) i A A b σ A (σ D(A )) j, (5) where the regret r ij, is weighted with the probabilities that actio i is selected by D ad actio j is selected by A. The defeder D plays, for the curret roud, the best respose to the attacer A, providig the miimum estimated expected regret ˆR (Lie 6, Alg. 2) ad observig actio i A, udertae by the attacer A. Fially, the algorithm updates the beliefs (Lie 9, Alg. 2) as follows: b (+1) w = {1,...,K} w, (6) where w = b () σ A (σ D (A )) ia,. 5.3 COMPUTATIONAL COMPLEXITY I this sectio, we aalyse the proposed algorithms from a computatioal perspective. has complexity O(KN), sice it performs a belief update for each of the K attacer profiles, repeatig this operatio over N rouds. Thus, it results beig liear both i the umber of profiles ad the rouds the game is played. Coversely, requires much more computatioal time. Ideed, for each attacer profile K, we cosider M actios for both players ad update the expected regret over the K profiles curret beliefs. This leads to

Table 1: Number ad type of attacer profiles A used for the experimets ad total umber of attacer K. Sta Sto SUQR U K C 1 1 1 - - 2 C 2 1-1 - 2 C 3 1 1 1-3 C 4 1 5 - - 6 C 5 1-5 - 6 C 6 1 5 5-11 C 7 1 5 5 1 12 a cost of O(M 2 K 2 ) for a sigle roud ad a overall computatioal cost of O(M 2 K 2 N) over the problem horizo N. If we wat to employ the strategy from the curret roud to the ed of the horizo to compute the estimated expected regret ˆR (A ) by meas of a forward procedure, the computatioal cost required by is O(M 2(N ) K 2(N ) ) for a roud. Thus, the fial computatioal cost required by is ( ) N =1 O(M 2(N ) K 2(N ) ) = O (MK) 2N 1 (MK) 2 1 O(M 2N K 2N ). 6 EXPERIMENTAL EVALUATION We compare the proposed algorithms ad (with h max = 1) with the state-of-the-art olie learig approaches from the MAB (Bubec et al., 2012) ad expert (Kalai ad Vempala, 2005) fields. I particular, We evaluate algorithm (Auer et al., 2002), from the MAB literature, ad the algorithm (Cesa-Biachi ad Lugosi, 2006), from the expert literature. I the experimets we also aalyse the case i which oe of the attacer behavioural profiles, amely U, is stochastic ad her strategy is uow to the defeder D (to avoid possible misuderstadigs, let us otice that the stochastic behaviour we describe i Sectio 4 is based o the assumptio that the defeder ows the strategy). I this case, we are still able to allow the leader to commit to a strategy that somehow miimizes the expected loss. Ideed, we ca assig: σ D,(U) = F P L(h ), where F P L( ) M is the pure strategy prescribed by the algorithm. I this case the algorithm suffers from a additioal regret due to the fact that, eve if it is able to correctly detect the profile, it does ot ow the best respose σ D (U), but it eeds to lear it over time. 6.1 EXPERIMENTAL SETTING The experimetal settig is as follows. We use a time horizo of N = 1000 rouds, with a differet amout of targets M {5, 10} ad differet profile cofiguratios C i, listed i Table 1, i which we report also the umber of differet stochastic, SUQR, ad uow stochastic behavioural profiles for each cofiguratio. The cofiguratios are ordered from the oes with smallest umber of behavioural profiles (K = 2) to the largest oe (K = 12). I priciple, these problems should be of icreasig difficulty, sice the algorithms have to idetify the actual behaviour amog a larger umber of optios. The strategies of the stochastic behavioural profiles Sto are draw from a Dirichlet distributio with θ = 1 M (uiform distributio over M ) ad the target values v are uiformly sampled i [0, 1] M. The parameters for the SUQR behavioural profiles are draw from a uiform probability distributio over the itervals α [5, 15], β [0, 1] ad γ [0, 1], whose choice is motivated by the experimetal results obtaied by Nguye et al. (2013). For each combiatio of behavioural profiles ad targets size, 10 radom cofiguratios (i.e., target values v ad attacer profile sets A) are geerated ad the actual behavioural profile A is draw from a uiform probability distributio over the give profiles set A. For each cofiguratio we ru 100 idepedet experimets ad we compute the average regret. We evaluate the performace i terms of expected pseudo-regret R(U) with {1,..., N} ad computatioal time spet by the algorithms to execute a sigle ru (N = 1000 rouds). Each compoet of the oise vector z i is draw from a uiform probability distributio over the iterval [0, ˆvK N], where ˆv = max m M v m, as described i Cesa-Biachi ad Lugosi (2006), Chapter 4. 6.2 EXPERIMENTAL RESULTS We report i Table 2 the empiric pseudo-regret obtaied i the experimetal results. It ca be observed that the algorithms we propose dramatically outperform the baselies provided by the state of the art. Furthermore, there is o strog statistical evidece that oe algorithm betwee or outperforms the other. We recall that is more computatioally demadig tha, thus oe might prefer for problems with may attacer behavioural profiles, sice it has comparable performace w.r.t. ad is computatioally more efficiet. Notably, the algorithm geerally improves its performace whe tested over larger target space M = 10. We thi this could be iduced by the fact that the specific cofiguratios i which the gets liear regret (i.e., the oes cosidered i Theorem 2) are less liely to occur whe we have a larger amout of targets. Remarably, our algorithms provide good performace also whe a stochastic behavioural profile U whose strategy is uow to the defeder is preset amog the possible oes.

Table 2: Expected pseudo-regret R N (U) over N = 1000 rouds ad correspodig 95% cofidece itervals for differet cofiguratios (best results are i boldface). M = 5 M = 10 C 1 C 2 C 3 C 4 C 5 C 6 C 7 14.12 ± 1.88 8.62 ± 3.73 23.92 ± 5.23 45.75 ± 11.68 1.76 ± 0.41 75.82 ± 19.94 62.31 ± 12.22 18.71 ± 35.02 11.16 ± 5.98 38.5 ± 27.18 49.8 ± 62.33 0.77 ± 0.12 68.88 ± 64.13 72.5 ± 53.34 0.19 ± 0.13 0.2 ± 0.18 0.5 ± 0.24 0.48 ± 0.2 0.09 ± 0.03 0.67 ± 0.2 7.92 ± 4.87 0.1 ± 0.06 0.27 ± 0.36 0.42 ± 0.3 0.62 ± 0.24 0.07 ± 0.04 1.07 ± 1.1 4.84 ± 3.32 16.77 ± 1.2 5.24 ± 2.79 21.2 ± 3.76 60.58 ± 8.89 4.24 ± 5.02 61.52 ± 22.48 58.93 ± 17.42 1.08 ± 0.2 5.97 ± 3.5 12.06 ± 4.31 2.63 ± 0.99 3.24 ± 3.96 17.69 ± 16.03 22.49 ± 12.26 0.13 ± 0.03 0.1 ± 0.02 0.33 ± 0.16 0.57 ± 0.17 0.05 ± 0.01 0.58 ± 0.14 16.06 ± 6.89 0.06 ± 0.05 0.12 ± 0.21 0.21 ± 0.12 0.43 ± 0.19 0.02 ± 0.02 0.6 ± 0.43 14.65 ± 8.1 I Figures 2 to 7 we show how the pseudo-regret R (U) evolves durig the time horizo i the most challegig cofiguratios, amely C 5, C 6 ad C 7. The results i other cofiguratios, omitted due to reasos of space, cofirm the results obtaied i C 5, C 6, C 7 ad are reported i Appedix B. The plots are i a semilogarithmic scale for a better comprehesio. I all the preseted cofiguratios, except i C 7 with M = 10, there is statistical sigificace that the ad algorithms outperform the baselies o average sice the cofidece itervals do ot overlap after the first 50 rouds. I cofiguratio C 7 with M = 10, our algorithms outperform the baselies oly o average. Fially, we aalyze the computatioal effort required by our algorithms to solve istaces over N = 1000 rouds ad M {5, 10, 20, 40} targets. 2 The average computatioal times are reported i Table 3 (the full versio of Table 3 with cofidece itervals is reported i Appedix B). There are three observatios we ca mae. First, we could ot report the values for M {20, 40} for sice the required computatioal cost is too high ( 3600 secods). Secod, both ad preset the same tred w.r.t. the cofiguratios: i fact, whe the behavioral profile of the oppoet ca oly be either Sta or Sto, both algorithms are twice more efficiet tha i 2 The computatioal times for the ad algorithm are omitted sice they are i lie with the oe of. Table 3: Computatioal time i secods eeded by ad to solve a istace over N = 1000 rouds. C 1 C 2 C 3 C 4 C 5 C 6 C 7 M = 5 6 11 12 4 24 15 15 77 121 170 146 652 1029 1114 M = 10 2 23 7 63 47 48 356 679 887 960 4402 7527 7292 M = 20 33 222 138 34 485 227 229 M = 40 105 2061 1412 129 2348 1634 1643 cases i which SUQR adversaries are itroduced. This is due to the fact that both Sta ad SUQR models exploit the strategy the defeder commits to, maig more difficult to distiguish amog them. The most difficult cofiguratio is C 7, where the presece of a stochastic uow adversary mae thigs eve worse sice the distributio must also be estimated. Fially, as expected, we otice that is always faster tha : i fact, while they are both polyomial i the actios available to the players, i.e., the umber of targets, the former is liear while the latter quadratic (sice we set h max = 1). 7 CONCLUSIONS AND FUTURE RESEARCH I this wor, we study for the first time, a ovel leadership game i which the leader plays agaist a follower whose behaviour is uow, but it belogs to a set of ow profiles. We provide two ovel approaches to tacle this problem, amely ad, bridgig together game-theoretical techiques ad olie learig tools. I the algorithm the leader is drive by the beliefs o the possible follower profiles, while the oe is based o a learig policy directly drive by the estimated expected regret, computed accordig to a value iteratio procedure. For the first approach, we provide also a fiite-time aalysis, showig that the regret of the algorithm is costat i the umber of rouds, while badit ad expert algorithms available i the state of the art suffer from a logarithmic ad liear regret, respectively. Fially, we experimetally evaluate the performace of our algorithms i leadership settigs ispired by cocrete security domais, showig that our approaches provide a remarable improvemet i terms of empirical pseudoregret miimizatio w.r.t. the mai algorithms available i the state of the art of the olie learig field. I the future, we will study a upper boud over the regret of the algorithm. Furthermore, we will iclude ew types of attacer profiles ad we will exted the framewor towards a multi-aget-learig settig, allowig the attacer to exploit a fiite/ifiite memory.

R(U) R(U) 10 2 10 2 Figure 2: Expected pseudo-regret for the cofiguratio C 5 with M = 5 targets. Figure 5: Expected pseudo-regret for the cofiguratio C 5 with M = 10 targets. R(U) R(U) Figure 3: Expected pseudo-regret for the cofiguratio C 6 with M = 5 targets. Figure 6: Expected pseudo-regret for the cofiguratio C 6 with M = 10 targets. R(U) R(U) Figure 4: Expected pseudo-regret for the cofiguratio C 7 with M = 5 targets. Figure 7: Expected pseudo-regret for the cofiguratio C 7 with M = 10 targets.

Refereces A, B., Brow, M., Vorobeychi, Y., ad Tambe, M. (2013). Security games with surveillace cost ad optimal timig of attac executio. I AAMAS, pages 223 230. Auer, P., Cesa-Biachi, N., ad Fischer, P. (2002). Fiite-time aalysis of the multiarmed badit problem. MACH LEARN, 47(2):235 256. Balca, M., Blum, A., Haghtalab, N., ad Procaccia, A. D. (2015). Commitmet without regrets: Olie learig i Stacelberg security games. I EC, pages 61 78. Basilico, N., De Nittis, G., ad Gatti, N. (2017). Adversarial patrollig with spatially ucertai alarm sigals. ART INT, 246:220 257. Blum, A., Haghtalab, N., ad Procaccia, A. D. (2015). Learig to play Stacelberg security games. Techical report, Caregie Mello Uiversity, Computer Sciece Departmet. Bubec, S., Cesa-Biachi, N., et al. (2012). Regret aalysis of stochastic ad ostochastic multi-armed badit problems. Foudatios ad Treds i Machie Learig, 5(1):1 122. Cesa-Biachi, N. ad Lugosi, G. (2006). Predictio, learig, ad games. Cambridge uiversity press. Coitzer, V. ad Sadholm, T. (2006). Computig the optimal strategy to commit to. I EC, pages 82 90. Coitzer, V. ad Sadholm, T. (2007). Awesome: A geeral multiaget learig algorithm that coverges i self-play ad lears a best respose agaist statioary oppoets. MACH LEARN, 67(1-2):23 43. Fag, F., Stoe, P., ad Tambe, M. (2015). Whe security games go gree: desigig defeder strategies to prevet poachig ad illegal fishig. I IJCAI, pages 2589 2595. Ford, B., Kar, D., Delle Fave, F. M., Yag, R., ad Tambe, M. (2014). PAWS: adaptive game-theoretic patrollig for wildlife protectio. I AAMAS, pages 1641 1642. Fudeberg, D. ad Tirole, J. (1991). Game Theory. MIT Press. Kalai, A. ad Vempala, S. (2005). Efficiet algorithms for olie decisio problems. J COMPUT SYST SCI, 71(3):291 307. Klíma, R., Kieitveld, C., ad Lisỳ, V. (2014). Olie learig methods for border patrol resource allocatio. I GameSec, pages 340 349. McDiarmid, C. (1989). O the method of bouded differeces. LOND MATH S, 141(1):148 188. McFadde, D. L. (1984). Ecoometric aalysis of qualitative respose models. Hadboo of ecoometrics, 2:1395 1457. Nguye, T., Yag, R., Azaria, A., Kraus, S., ad Tambe, M. (2013). Aalyzig the effectiveess of adversary modelig i security games. I AAAI, pages 718 724. Paruchuri, P., Pearce, J. P., Mareci, J., Tambe, M., Ordóñez, F., ad Kraus, S. (2008). Playig games for security: A efficiet exact algorithm for solvig Bayesia Stacelberg games. I AAMAS, pages 895 902. Pita, J., Jai, M., Wester, C., Portway, C., Tambe, M., Ordóñez, F., Kraus, S., ad Paruchuri, P. (2008). Deployed ARMOR protectio: The applicatio of a game-theoretic model for security at the Los Ageles Iteratioal Airport. I AAMAS, pages 125 132. Pita, J., Tambe, M., Kieitveld, C., Culle, S., ad Steigerwald, E. (2011). Guards: game theoretic security allocatio o a atioal scale. I AAMAS, pages 37 44. Qia, Y., Hasell, W. B., Jiag, A. X., ad Tambe, M. (2014). Olie plaig for optimal protector strategies i resource coservatio games. I AAMAS, pages 733 740. Qia, Y., Zhag, C., Krishamachari, B., ad Tambe, M. (2016). Restless poachers: Hadlig exploratioexploitatio tradeoffs i security domais. I AAMAS, pages 123 131. Shoham, Y. ad Leyto-Brow, K. (2008). Multiaget systems: Algorithmic, game-theoretic, ad logical foudatios. Cambridge Uiversity Press. Tsai, J., Rathi, S., Kieitveld, C., Ordóñez, F., ad Tambe, M. (2009). IRIS-A tool for strategic security allocatio i trasportatio etwors. I AAMAS, pages 1327 1334. Tuyls, K. ad Weiss, G. (2012). Multiaget learig: Basics, challeges, ad prospects. AI MAG, 33(3):41. Martform ud gle- Vo Stacelberg, H. (1934). ichgewicht. J. Spriger. Xu, H., Tra-Thah, L., ad Jeigs, N. R. (2016). Playig repeated security games with o prior owledge. I AAMAS, pages 104 112. Yag, R., Ford, B., Tambe, M., ad Lemieux, A. (2014). Adaptive resource allocatio for wildlife protectio agaist illegal poachers. I AAMAS, pages 453 460. Yag, R., Kieitveld, C., Ordóñez, F., Tambe, M., ad Joh, R. (2011). Improvig resource allocatio strategy agaist huma adversaries i security games. I IJCAI, pages 458 464.

A SUPPLEMENTAL MATERIAL Theorem 2 (Expert pseudo-regret upper boud). Let us cosider a istace of the I-SG problem ad apply the algorithm, where each possible profile A is a expert ad receives, at roud, a expert reward equal to mius the loss she would have icurred observig i A, by playig the best respose to the attacer A. The, there always exists a attacer set A s.t. the defeder D icurs i a expected pseudo-regret of: R N (U) L N. Proof. Let us aalyse the I-SG problem i which the attacer profile set is A = {Sta, Sto}, the true attacer A = Sta ad we use the Follow the Leader algorithm (Cesa-Biachi ad Lugosi, 2006). Assume that the best respose σ D (Sto) to the stochastic attacer Sto correspods to the pure strategy played by the Stacelberg attacer at the equilibrium, i.e, σ Sta (σ D (Sta)) = σ D (Sto). Assume the chose target by the two strategies has value v ˆm i target ˆm, maximum value v m i target m ad that the stochastic attacer has strategy p s.t.: α if m = ˆm p m = 1 α if m = m, 0 otherwise where α = v m L(Sta) v m ad αv m > (1 α)v m. I this case, the defeder might commit to two differet strategies: if the defeder D declares its best respose to the Stacerlberg attacer σ D (Sta) for the tur, it would provide zero loss as feedbac for the stochastic attacer expert ad loss equal to L(Sta) to the Stacelberg oe if the defeder D selects the best respose to the stochastic attacer σ D (Sto), the defeder would gai loss equal to (1 α)v m = L(Sta) for the stochastic attacer expert ad L(Sta) for the Stacelberg oe. Thus, i this case the two types would receive the same feedbac. Summarizig, we have that the Stacelberg attacer expert always icurs i a loss greater or equal to the oe of the stochastic oe, eve if the real attacer is Stacelberg. Thus, with a probability grater tha 0.5 we are icurrig i a loss of L for the etire horizo, with a total regret proportioal to L N. Eve by resortig to radomizatio, thus eve adoptig the we would have a probability of at least 0.5 ε (beig ε the probability with which the chooses a suboptimal optio) to select the wrog optio, thus also the algorithm would icur i a liear regret over the time horizo. Theorem 3 ( pseudo-regret upper boud). Give a istace of the I-SG problem s.t. b > 0 for each A A ad applyig, the defeder icurs i a pseudo-regret of: R N (U) K =1 2(λ 2 + λ2 ) L ( b ) 2, where λ := max m M max σ S l(σ A (σ) m ) mi m M mi σ S l(σ A (σ) m )I {σ A (σ) m 0} is the rage where the logarithm of the beliefs realizatios lies (excludig realizatios equal to zero, which ed the exploratio of a profile) ad S := σ D (A ) is the set of the available best respose to the attacers profile. Proof. Let us aalyze the regret of the algorithm. We get some regret if the algorithm selects a strategy profile correspodig to a type differet from the real oe. Thus, the regret is upper bouded by: [ N ] R N (U) = E l L N where we recall that: =1 [ N ] K = E l L = L E[T (N)], =1 =1

T i (N) = N =1 I{A = A } is the umber of times we played the best respose σ D (A ) to attacer A ; L = M m=1 σ A(σ D (A )) m v m (1 σ D (A ) m ) L is the expected regret of playig the best respose to attacer A whe the real attacer is A. Each roud i which the algorithm selects a profile s.t. the best respose is ot equal to the oe of A we are gettig some regret. Let us defie variables B, ad B, deotig the belief we have for the possible attacer A ad of the real attacer A, respectively, of the actio played by the real attacer A at tur. Moreover, let b j,t := E σ D (A j)[b,t ] be the expected value of the belief we get for attacer A whe we are best respodig to A j ad the true type is A A at roud t. Note that b j,t < b j,t, j, sice b is positive. For each profile A A, we have: E[T (N)] = = [ { N E I B,t =1 [ { N E I l(b,t ) =1 B,t }] }] l(b,t) N ( P l(b,t) l(b ),t) N ( P l(b,t) l(b j t,t) l(b,t) + =1 =1 + l(b ( j t,t) l(b j t,t) l(b ) j t,t) }{{} b l(b,t) N ( P l(b,t) l(b j t,t) b 2 =1 N ( P l(b,t) l(b j t,t) + b ) + 2 =1 }{{} R 1 N ( l(b,t) + l(b j t,t) b 2 ) 0 + P l(b j t,t) b ), (12) 2 =1 }{{} R 2 (7) (8) (9) (10) (11) where j t is the idex of the attacer A jt we selected at roud t ad we defied b := mi j Aj A l(b j,t) l(b j,t ), i.e., the miimum w.r.t. the best respose for the available attacers of the differece betwee the expected value of the loglielihood of attacer A ad A if the true profile is A. Equatio (9) has bee ( obtaied from Equatio (8) sice E [I { }] = P ( ) while Equatio (10) has bee computed from Equatio (9) addig l(b ) j t,t) l(b j t,t) to both l.h.s. ad r.h.s. of the iequality. We would lie to poit out that b does ot deped o t sice the distributio of B,t ad B,t is the same over rouds. Let us focus o R 1. We use the McDiarmid iequality (McDiarmid, 1989) to boud the probability that the empirical

estimate of the loglielihood expected value is higher tha a certai upper boud as follows: N ( R 1 = P l(b,t) l(b j t,t) =1 ( P l(b,t) l(b j t,t) =1 { exp ( b ) 2 } 2λ 2 2λ2 ( b ) 2, =1 + b ) 2 ) + b 2 where we exploited x=1 e κx 1 κ. We defie λ := max m M max σ S l(σ A (σ) m ) mi m M mi σ S l(σ A (σ) m )I {σ A (σ) m 0} as the rage where the beliefs realizatios lie (excludig realizatios equal to zero which eds the exploratio of a profile), where we used the fact that E[B,t ] = b, t ad S := σ D (A ) is the set of the available best respose to the attacers profile. A similar reasoig ca be applied to R 2 gettig a upper boud of the followig form: The regret becomes: i=1 i=1 R 2 2λ2 ( b ) 2. K K ( ) 2λ 2 R N (U) = L E[T (N)] L ( b ) 2 + 2λ2 ( b ) 2 which cocludes the proof. K i=1 2(λ 2 + λ2 ) L ( b ) 2,

B ADDITIONAL RESULTS For the sae of completeess, we report i Figures 8 ad 9 all the graphs regardig the regret for all the ruig cofiguratios C 1,..., C 7 ad for the two dimesios of the target space, amely M {5, 10}. By ispectig these additioal set of figures are i lie with what has bee preseted i Sectio 6 of the mai paper, where the proposed techiques, amely ad, are able to outperform the literature methods. Eve here, there is ot a clear method providig statistical evidece that it is able to outperform the other. Moreover, we also provide i Figure 10 the results for cofiguratio C 6 with a umber of target M = 40. I this cofiguratio, we were able to ru oly the algorithm for computatioal time costraits. The results show that the has performace similar to the oes experieced with smaller target space, thus it is able to scale without sigificat loss i terms of expected pseudo-regret R N (U). R(U) R(U) R(U) 10 2 (a) Cofiguratio C 1. (b) Cofiguratio C 2. (c) Cofiguratio C 3. R(U) R(U) R(U) 10 2 (d) Cofiguratio C 4. (e) Cofiguratio C 5. (f) Cofiguratio C 6. R(U) (g) Cofiguratio C 7. Figure 8: Expected pseudo-regret for the differet cofiguratios with M = 5 targets.

R(U) R(U) R(U) 10 2 (a) Cofiguratio C 1. (b) Cofiguratio C 2. (c) Cofiguratio C 3. R(U) R(U) R(U) 10 2 (d) Cofiguratio C 4. (e) Cofiguratio C 5. (f) Cofiguratio C 6. R(U) (g) Cofiguratio C 7. Figure 9: Expected pseudo-regret for the differet cofiguratios with M = 10 targets.

R(U) Figure 10: Expected pseudo-regret for the cofiguratio C 6 with M = 40 targets.

Table 4: Computatioal time i secods eeded by ad to solve a istace over N = 1000 rouds ad the correspodig 95% cofidece itervals. M = 40 M = 20 M = 10 M = 5 C 1 C 2 C 3 C 4 C 5 C 6 C 7 5.9 ± 1.7 11.1 ± 2.2 11.7 ± 2.9 3.5 ± 1.0 23.7 ± 2.4 14.9 ± 4.3 14.7 ± 3.2 77.0 ± 2.1 121.1 ± 3.2 170.4 ± 4.1 146.2 ± 4.7 651.7 ± 36.6 1029.2 ± 64.7 1113.7 ± 40.2 10.3 ± 2.6 21.9 ± 13.2 23.0 ± 17.9 7.1 ± 2.3 63.0 ± 7.4 47.22 ± 14.05 48.59 ± 13.48 356.1 ± 14.3 678.5 ± 15.9 887.0 ± 11.1 960.4 ± 13.0 4402.5 ± 14.2 7526.5 ± 189.9 7291.6 ± 23.7 33.5 ± 3.0 222.2 ± 126.9 137.8 ± 77.6 33.7 ± 1.2 484.5 ± 107.7 226.8 ± 45.3 229.5 ± 46.44 104.5 ± 7.1 2061.5 ± 837.2 1412.0 ± 812.1 128.9 ± 16.5 2347.9 ± 1223.2 1634.2 ± 487.6 1643.62 ± 468.8 We also report here Table 4, the full versio of Table 3, with the time values up to the first decimal ad also specifyig the cofidece iterval.