A SULEMENTAL MATERIAL Theorem (Expert pseudo-regret upper boud. Let us cosider a istace of the I-SG problem ad apply the FL algorithm, where each possible profile A is a expert ad receives, at roud, a expert reward equal to mius the loss she would have icurred observig i A, by playig the best respose to the attacer A. The, there always exists a attacer set A s.t. the defeder D icurs i a expected pseudo-regret of: R N (U L N. roof. Let us aalyse the I-SG problem i which the attacer profile set is A = {Sta, Sto}, the true attacer A = Sta ad we use the Follow the Leader algorithm (Cesa-Biachi ad Lugosi, 006. Assume that the best respose σ D (Sto to the stochastic attacer Sto correspods to the pure strategy played by the Stacelberg attacer at the equilibrium, i.e, σ Sta (σ D (Sta = σ D (Sto. Assume the chose target by the two strategies has value v ˆm i target ˆm, maximum value v m i target m ad that the stochastic attacer has strategy p s.t.: α if m = ˆm p m = 1 α if m = m, 0 otherwise where α = v ml(sta v m ad αv m > (1 αv m. I this case, the defeder might commit to two differet strategies: if the defeder D declares its best respose to the Stacerlberg attacer σ D (Sta for the tur, it would provide zero loss as feedbac for the stochastic attacer expert ad loss equal to L(Sta to the Stacelberg oe if the defeder D selects the best respose to the stochastic attacer σ D (Sto, the defeder would gai loss equal to (1 αv m = L(Sta for the stochastic attacer expert ad L(Sta for the Stacelberg oe. Thus, i this case the two types would receive the same feedbac. Summarizig, we have that the Stacelberg attacer expert always icurs i a loss greater or equal to the oe of the stochastic oe, eve if the real attacer is Stacelberg. Thus, with a probability grater tha 0.5 we are icurrig i a loss of L for the etire horizo, with a total regret proportioal to L N. Eve by resortig to radomizatio, thus eve adoptig the FL we would have a probability of at least 0.5 ε (beig ε the probability with which the FL chooses a suboptimal optio to select the wrog optio, thus also the FL algorithm would icur i a liear regret over the time horizo. Theorem 3 ( pseudo-regret upper boud. Give a istace of the I-SG problem s.t. b > 0 for each A A ad applyig, the defeder icurs i a pseudo-regret of: R N (U =1 (λ λ L ( b, where λ := max m M max σ S l(σ A (σ m mi m M mi σ S l(σ A (σ m I {σ A (σ m 0} is the rage where the logarithm of the beliefs realizatios lies (excludig realizatios equal to zero, which ed the exploratio of a profile ad S := σ D (A is the set of the available best respose to the attacers profile. roof. Let us aalyze the regret of the algorithm. We get some regret if the algorithm selects a strategy profile correspodig to a type differet from the real oe. Thus, the regret is upper bouded by: [ N ] R N (U = E l L N where we recall that: [ N ] = E l L = L E[T (N], =1
T i (N = N I{A = A } is the umber of times we played the best respose σ D (A to attacer A ; L = M m=1 σ A(σ D (A m v m (1 σ D (A m L is the expected regret of playig the best respose to attacer A whe the real attacer is A. Each roud i which the algorithm selects a profile s.t. the best respose is ot equal to the oe of A we are gettig some regret. Let us defie variables B, ad B, deotig the belief we have for the possible attacer A ad of the real attacer A, respectively, of the actio played by the real attacer A at tur. Moreover, let b j,t := E σ D (A j[b,t ] be the expected value of the belief we get for attacer A whe we are best respodig to A j ad the true type is A A at roud t. Note that b j,t < b j,t, j, sice b is positive. For each profile A A, we have: E[T (N] = = [ { E I B,t [ { E I l(b,t B,t }] }] l(b,t l(b,t l(b j t,t l(b,t l(b j t,t l(b j t,t l(b j t,t }{{} b l(b,t l(b j t,t b l(b j t,t b }{{} R 1 l(b,t l(b j t,t b 0 l(b j t,t b, (1 }{{} R (7 (8 (9 (10 (11 where j t is the idex of the attacer A jt we selected at roud t ad we defied b := mi j Aj A l(b j,tl(b j,t, i.e., the miimum w.r.t. the best respose for the available attacers of the differece betwee the expected value of the loglielihood of attacer A ad A if the true profile is A. Equatio (9 has bee ( obtaied from Equatio (8 sice E [I { }] = ( while Equatio (10 has bee computed from Equatio (9 addig l(b j t,t l(b j t,t to both l.h.s. ad r.h.s. of the iequality. We would lie to poit out that b does ot deped o t sice the distributio of B,t ad B,t is the same over rouds. Let us focus o R 1. We use the McDiarmid iequality (McDiarmid, 1989 to boud the probability that the empirical
estimate of the loglielihood expected value is higher tha a certai upper boud as follows: R 1 = l(b j t,t l(b j t,t { exp ( b } λ λ ( b, b b where we exploited x=1 eκx 1 κ. We defie λ := max m M max σ S l(σ A (σ m mi m M mi σ S l(σ A (σ m I {σ A (σ m 0} as the rage where the beliefs realizatios lie (excludig realizatios equal to zero which eds the exploratio of a profile, where we used the fact that E[B,t ] = b, t ad S := σ D (A is the set of the available best respose to the attacers profile. A similar reasoig ca be applied to R gettig a upper boud of the followig form: The regret becomes: i=1 i=1 R λ ( b. ( λ R N (U = L E[T (N] L ( b λ ( b which cocludes the proof. i=1 (λ λ L ( b,
B ADDITIONAL RESULTS For the sae of completeess, we report i Figures 8 ad 9 all the graphs regardig the regret for all the ruig cofiguratios C 1,..., C 7 ad for the two dimesios of the target space, amely M {5, 10}. By ispectig these additioal set of figures are i lie with what has bee preseted i Sectio 6 of the mai paper, where the proposed techiques, amely ad, are able to outperform the literature methods. Eve here, there is ot a clear method providig statistical evidece that it is able to outperform the other. Moreover, we also provide i Figure 10 the results for cofiguratio C 6 with a umber of target M = 40. I this cofiguratio, we were able to ru oly the algorithm for computatioal time costraits. The results show that the has performace similar to the oes experieced with smaller target space, thus it is able to scale without sigificat loss i terms of expected pseudo-regret R N (U. FL FL FL R(U R(U R(U 0 00 400 600 800 1000 10 0 00 400 600 800 1000 0 00 400 600 800 1000 (a Cofiguratio C 1. (b Cofiguratio C. (c Cofiguratio C 3. 10 FL FL 10 FL R(U R(U R(U 10 0 00 400 600 800 1000 (d Cofiguratio C 4. 0 00 400 600 800 1000 (e Cofiguratio C 5. 0 00 400 600 800 1000 (f Cofiguratio C 6. 10 FL R(U 0 00 400 600 800 1000 (g Cofiguratio C 7. Figure 8: Expected pseudo-regret for the differet cofiguratios with M = 5 targets.
FL FL FL R(U R(U R(U 10 0 00 400 600 800 1000 0 00 400 600 800 1000 0 00 400 600 800 1000 (a Cofiguratio C 1. (b Cofiguratio C. (c Cofiguratio C 3. FL FL 10 FL R(U R(U R(U 10 0 00 400 600 800 1000 (d Cofiguratio C 4. 0 00 400 600 800 1000 (e Cofiguratio C 5. 0 00 400 600 800 1000 (f Cofiguratio C 6. 10 FL R(U 0 00 400 600 800 1000 (g Cofiguratio C 7. Figure 9: Expected pseudo-regret for the differet cofiguratios with M = 10 targets.
FL R(U 0 00 400 600 800 1000 Figure 10: Expected pseudo-regret for the cofiguratio C 6 with M = 40 targets.
Table 4: Computatioal time i secods eeded by ad to solve a istace over N = 1000 rouds ad the correspodig 95% cofidece itervals. M = 40 M = 0 M = 10 M = 5 C 1 C C 3 C 4 C 5 C 6 C 7 5.9 ± 1.7 11.1 ±. 11.7 ±.9 3.5 ± 1.0 3.7 ±.4 14.9 ± 4.3 14.7 ± 3. 77.0 ±.1 11.1 ± 3. 170.4 ± 4.1 146. ± 4.7 651.7 ± 36.6 109. ± 64.7 1113.7 ± 40. 10.3 ±.6 1.9 ± 13. 3.0 ± 17.9 7.1 ±.3 63.0 ± 7.4 47. ± 14.05 48.59 ± 13.48 356.1 ± 14.3 678.5 ± 15.9 887.0 ± 11.1 960.4 ± 13.0 440.5 ± 14. 756.5 ± 189.9 791.6 ± 3.7 33.5 ± 3.0. ± 16.9 137.8 ± 77.6 33.7 ± 1. 484.5 ± 107.7 6.8 ± 45.3 9.5 ± 46.44 104.5 ± 7.1 061.5 ± 837. 141.0 ± 81.1 18.9 ± 16.5 347.9 ± 13. 1634. ± 487.6 1643.6 ± 468.8 We also report here Table 4, the full versio of Table 3, with the time values up to the first decimal ad also specifyig the cofidece iterval.