Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4.

Size: px

Start display at page:

Download "Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4."

Baldwin Caldwell
5 years ago
Views:

1 Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4. If the reader will recall, we have the following problem-specific constants in the main text: the size of the global observer set σ, parameter L > 0 from the continuity assumption, error bound max (Mσ M σ ) 1 σ i1 M x i M xi (ν i ν 0 ), where the max is taken from ν 0, ν 1,, ν σ 0, 1 n, and the maximum difference in the expected reward R max max x1,x X,ν 0,1 n r(x 1, ν) r(x, ν). For technical reasons, we also defined φ(ν) max((ν, 1), 0) to adjust ν to the nearest vector in 0, 1 n, and r(x, ν) r(x, φ(ν)), ν R n \ 0, 1 n to preserve the Lipschitz continuity throughout R n. To make our proof clearer, we define v(t) as the state of any variable v by the end of time step t. Our analysis is based on the snapshot of all variables just before the statement t t + 1 (Line 14 and 0). One batch processing in exploration phase is called one round, and then n σ is increased by 1. Denote ˆν (j) as the estimated mean of outcomes after j rounds of exploration. For example, at time t, the estimated mean of outcomes is ˆν(t) and the exploration counter is n σ (t), so we have ˆν (nσ(t)) ˆν(t). And for time step t + 1, the player will use the previous knowledge of ˆν(t) to get ˆx(t + 1) argmax x X r(x, ˆν(t)) and ˆx (t + 1) argmax x X \{ˆx(t+1)} r(x, ˆν(t)). In the following analysis, the frequency function is set to f X (t) ln t + ln X. Note that by using f X (t), we can αf construct the confidence interval X (t) n σ to eliate failures with high probability. Define N(t) X t, which will be frequently used in our analysis, then exp{ f X (t)} N(t) 1. Let α 8L β σ a, where a > 0 is a parameter to be tuned later. The symbols used in the proof are listed in Table 1, and to facilitate the understanding of our proof, each lemma is briefly summarized in Table. Below we give an outline of the complete proof, which consists of three main parts: First, we introduce some general insights, present the preliaries, and prove basic properties of our model and the Algorithm GCB, which is shared by the proofs of both distribution-independent and distribution-dependent regret bounds. Next, we obtain the concentration property of the empirical mean of outcomes via Azuma-Hoeffding Inequality (Fact A.) in Lemma A.. Lemma A.1 shows that our algorithm bounds the number of exploration rounds by O(T / log T ), which implies that our algorithm will not play exploration for too long. In Lemma A.14, we prove that when the gap between the estimated optimal action ˆx and the second optimal ˆx is large (i.e. the first condition in Line 8), with low probability the estimated optimal action is sub-optimal. This means that our global confidence bound will exclude sub-optimal actions effectively. In Section A.1, we prove the distribution-independent regret bound of O(T log T ) (Theorem 4.1 in the main text). In Lemma A.15, we show that the gap between the estimated optimal action ˆx and the real one x decays exponentially with the number of exploration rounds. Thus, the penalty in exploitation phase can be derived in Lemma A.16. Then, we use Lemmas A.1 and A.16 to prove Theorem 4.1 in the main text. Hence the distribution-independent bound O(T log T ) is achieved. In Section A., we prove the distribution-dependent bound of O(log T ) related to the predetered distribution p, assug that the optimal action ( x is unique ) (Theorem 4. in the main text). First, we show in Lemma A.17 that, when the algorithm plays Ω ln t+ ln X rounds of exploration, the probability of the estimated optimal action ˆx being sub-optimal is low. Then, in Lemma A.18, we combine the results of Lemmas A.14 and A.17 and show that with a low probability the algorithm exploits with a sub-optimal action. Thus, Lemma A.18 is enough to bound the regret ( of exploitation. ) Next, we bound the regret of exploration by bounding the number of exploration rounds to O ln T + ln X in Lemma A.. This is done by showing that whenever the algorithm has conducted ( ) Θ ln t+ ln X rounds of exploration, with high probability it switches to exploitation (Lemma A.19), and then aggregating multiple switches between exploration and exploitation in the proof of Lemma A.. Finally, we combine Lemmas A.18 and A. to prove Theorem 4. in the main text. Fact A.1. The following probability laws will be used in the analysis.

2 Symbols in the main text v(t) v 0, 1 n ṽ 0, 1 n ν ˆν x X r(x, v), r(x, ν) R M x R mx n y(t) R m x(t) y x, max, σ X L R max f X (t) x ˆx(t) ˆx (t) Symbols in the proof n σ µ(t), η(t) X Good, X Bad X F Good, F Bad L CI, L c CI E Explore, E InExplore, E FinExplore E Exploit δ xi,x j, ˆδ xi,x j (t) R G k Definition state any variable v by the end of time step t outcomes of the environment estimation of outcomes through inversion mean of outcomes empirical mean of outcomes action x in action set X reward function taking x and v, and expected reward function taking x and ν transformation matrix of action x where m x depends on x feedback vector under the choice of x(t) vector that stacks feedbacks from different times reward gap of action x, of the maximum, of the (positive) imum global observer set of actions Lipschitz constant distribution-independent error bound from σ distribution-independent largest gap of expected reward frequency function real optimal action estimated optimal action at time t estimated second optimal action at time t Definition exploration counter threshold functions good action set, and bad action set event of choosing good action set, and bad action set event of occurring gap being larger than confidence interval and its complement event of doing exploration, being in the middle of it, and being at its end event of exploitation reward gap of action x i and x j, and its estimated value at time t the event indicating the first occurrence of k rounds of exploration Table 1. List of symbols in the proof. Succinct interpretation of the results Dependence Lemma A. Estimate of outcomes concentrates around the mean. Fact A. Lemma A.7 Difference of real and estimated gap is bounded. Lemma A.8 Estimated error of outcomes is small compared to the confidence interval. Lemma A. Lemma A.1 The counter of exploration is bounded within O(T / log T ). Lemma A.14 Finding a bad action to fail confidence interval occurs rarely. Lemma A.8 Lemma A.15 Incurring a large penalty for current optimal action is rare. Lemma A. Lemma A.16 The penalty in the exploitation phase is bounded. Lemmas A.14, A.15 Theorem 4.1 Distribution-independent bound: O(T / log T ). Lemmas A.1, A.16 Lemma A.17 With enough exploration, finding a bad action is rare. Lemma A. Lemma A.18 Finding a bad action and exploiting it become rare as time elapses. Lemmas A.14, A.17 Lemma A.19 With enough exploration, finding a good action but yet exploring becomes rare. Lemmas A., A.7, A.1 Lemma A.0 Once the algorithm performs enough exploration, it switches to exploitation. Lemmas A.17, A.19 Lemma A. Exploration rounds are bounded. Lemma A.0 Theorem 4. Distribution-dependent bound: O(log T ). Lemmas A.18, A. Table. List of lemmas and their dependencies in the proof.

3 Law of Conditional Probability: Pr A B Pr A B Pr B. Law of Total Probability: if {B n : n 1,, } is a set of disjoint events whose union is the entire sample space, then Pr A Pr A B n. n Fact A. (Azuma-Hoeffding Inequality in Euclidean Space (Theorem 1.8 of (Hayes, 00))). Let X (X 0,, X n ) be a very-weak martingale, which is defined for every i, E X i X i 1 X i 1, and it takes values in Euclidean Space, such that for every i, X i R d. Suppose X 0 0, and for i 1,, n, X i X i 1 1. Then, for every ɛ > 0, (ɛ 1) 1 Pr X n ɛ < e n < e e ɛ n. (1) We can use the preceding fact to obtain the concentration property of outcomes during exploration. Lemma A. (Concentration during exploration). After the exploration round i 1,,, j at t 1, t,, t j respectively, we use the inverse to get ṽ i I(M σ, y(t i )) M + σ y i and their mean is ˆν (j) 1 j j i1 ṽi. Then, γ > 0: Pr ν ˆν (j) γ Proof. For each i, let X i be the sequence sum satisfying X i i l1 e exp{ γ j βσ }. () ν ṽ l ν ṽ, where E i 0, and ν ṽi 1. So X i X i 1 ν ṽi implies X i X i 1 1. And we know that ṽ i is independent of the previous inverse ṽ 1,, ṽ i 1, so it holds that E X i X i 1 X i 1 E X i X i 1 X i 1 () ν ṽi E X i 1 (4) ν ṽi E 0. (5) Therefore, X (X 0,, X n ) satisfies the definition of a very-weak martingale. Apply Fact A., and it will achieve the bound ɛ > 0, Pr X j ɛ < e e ɛ j. Let γ ɛ j, as ν ˆν (j) βσ j X j, we will get: γ > 0, Pr ν ˆν (j) γ < e exp{ γ j βσ }. (6) Under a predetered outcome distribution p with mean outcome vector ν and x argmax x X r(x, ν), in the main text we define the gap: x r(x, ν) r(x, ν), (7) max max{ x : x X }, (8) { x : x X, x > 0}. (9) Definition A.4 (Good actions / bad actions). Based on the distance to the optimal action, define good actions and bad actions as: X Good {x : x X, x 0} (10) X Bad {x : x X, x > 0}. (11) Therefore, X X Good X Bad. Moreover, x X Good. (x is unique if and only if X Good 1.)

4 Definition A.5 (Events of finding a good action / bad action). Define ˆx(t) argmax x X r(x, ˆν(t 1)) as the current optimal action at time t. Let F Bad (t) be the event that fails to choose the optimal action at time t. Formally, F Bad (t) and its complement event are: F Bad (t) {ˆx(t) X Bad} (1) F Good (t) {ˆx(t) X Good}. (1) To build the connection with the exploration round j, we define the time-invariant event F(j) Bad as the event in which the algorithm fails to choose the optimal action after j rounds of exploration: F Bad (j) {ˆx (j) X Bad} (14) { ˆx (j) X Good}, (15) F Good (j) where ˆx (j) argmax x X r(x, ˆν (j) ). By definition, it is always true that F Bad (n σ(t 1)) F Bad (t) and F Good (n σ(t 1)) F Good (t). Definition A.6 (Estimated gap and real gap). For any pair of action x i, x j X, defined the gap of estimated reward between x i, x j as ˆδ xi,x j (t) r(x i, ˆν(t 1)) r(x j, ˆν(t 1)), and the gap of real reward between them as δ xi,x j r(x i, ν) r(x j, ν). Lemma A.7 (Bound of the gap). For any pair of action x i, x j X, we establish the inequality over time t as: ˆδ xi,x j (t) δ xi,x j L ν ˆν(t 1). (16) Proof. ˆδ xi,x j (t) δ xi,x j (r(x i, ˆν(t 1)) r(x i, ν)) (r(x j, ˆν(t 1)) r(x j, ν)) (17) r(x i, ˆν(t 1)) r(x i, ν) + r(x j, ˆν(t 1)) r(x j, ν) (18) L ν ˆν(t 1) + L ν ˆν(t 1) (19) L ν ˆν(t 1). (0) Lemma A.8 (Small error in estimation). Given time t, for f X (t) ln t + ln X, α 8L β σ a, and a > 0, αf X (t) γ > 0, Pr ν ˆν(t 1) γ n σ (t 1) e 4γ L X N(t)1 a. (1)

5 Proof. As the time of exploration equals to the counter n σ (t 1) and ˆν(t 1) ˆν (nσ(t 1)), we have: αf X (t) Pr ν ˆν(t 1) γ n σ (t 1) t 1 Pr ν ˆν (nσ(t 1)) αf X (t) γ n j1 σ (t 1) n σ(t 1) j t 1 Pr ν ˆν (j) αf X (t) γ n σ (t 1) j j j1 t 1 Pr ν ˆν (j) αf X (t) γ j j1 ( e exp γ t 1 j1 t 1 j1 t 1 j1 { } e exp γ αf X (t) βσ e N(t) γ α βσ. ) αf X (t) j j β σ () () (4) (5) {Lemma A.} (6) (7) (8) As α 8L β σ a and a > 0, the probability is: αf X (t) Pr ν ˆν(t 1) γ n σ (t 1) e (t 1) N(t) γ α β σ e 4γ L X N(t)1 a. (9) Definition A.9 (Events of exploration or exploitation). In Algorithm GCB, for any time t, we can define three events, namely the beginning of exploration E Explore (t), in the process of exploration E InExplore (t) and exploitation E Exploit (t). They are mutually exclusive, and E Explore (t) E InExplore (t) E Exploit (t) is always true. Formally, it is: E Explore (t) {state(t) begin exploration} (0) E InExplore (t) {state(t) in exploration} (1) E Exploit (t) {state(t) exploitation}. () Definition A.10 (Events related to confidence interval). In Line 8 of Algorithm GCB, we can define the event for the first condition where the gap of estimated optimal action and other actions is larger than confidence interval as L CI (t) at time t, i.e., { } αf L CI (t) x X \ {ˆx(t)}, ˆδˆx(t),x X (t) (t) >. () n σ (t 1) And its complement event is: { } L c αf CI(t) x X \ {ˆx(t)}, ˆδˆx(t),x X (t) (t). (4) n σ (t 1) Remark { 1. In Algorithm GCB, we know } that the first condition of Line 8 is true, if and only if L CI (t) x X \ {ˆx(t)}, ˆδˆx(t),x (t) > αfx (t) n σ(t 1) occurs. Thus, we use the equivalent event in the following proof to make it clearer.

6 Definition A.11. For simplicity, suppose α 8L β σ a, constant a > 0 and θ > 0, then we can define two threshold functions: Note that η(t) and µ(t) are values, not random variables. Proposition A.1. If t > T 0 (1+θa) α η(t) t fx (t) (5) µ(t) (1 + θa) αf X (t). (6), then µ(t) < η(t). (It can be verified by the definition.) Lemma A.1 (Exploration Ceiling). Let α 8L β σ a and a > 0. For any time t, if the exploration counter n σ (t 1) > η(t), the algorithm will play exploitation surely, i.e., Pr E Explore (t) n σ (t 1) > η(t) 0. (7) Proof. If n σ (t 1) > η(t), then Line 8 of Algorithm GCB will be true because of its second condition. According to the algorithm, it will not go to exploration phase, so we know that which restricts n σ (t 1) to no larger than η(t) + 1 at any time t. Pr E Explore (t) n σ (t 1) > η(t) 0, (8) Lemma A.14 (Low failure probability of the confidence interval). Let f X (t) ln t+ ln X, α 8L β σ a and 0 < a 1. For any time t, the probability that both choosing bad action and the gap is larger than confidence interval satisfies: Pr L CI (t) F Bad (t) e X N(t). (9) Proof. The definition of F Bad (t) {ˆx(t) X Bad } implies x X \ {ˆx(t)}. Their gap is ˆδˆx(t),x (t) r(ˆx(t), ˆν(t 1)) r(x, ˆν(t 1)) (40) r(ˆx(t), ˆν(t 1)) r(x, ˆν(t 1)) + r(x, ν) r(ˆx(t), ν) {Definition of x } (41) r(ˆx(t), ˆν(t 1)) r(ˆx(t), ν) + r(x, ν) r(x, ˆν(t 1)) (4) L ν ˆν(t 1). (4) Thus, we can write the probability as: Pr L CI (t) F Bad (t) αf Pr x X \ {ˆx(t)}, ˆδˆx(t),x X (t) n σ (t 1) F Bad (t) αf X (t) Pr ˆδˆx(t),x n σ (t 1) F Bad (t) αf X (t) Pr ˆδˆx(t),x n σ (t 1) Pr ν ˆν (nσ(t 1)) 1 αf X (t) L n σ (t 1) e X N(t)1 1 a {Lemma A.8 with γ 1 L } (49) e X N(t). {0 < a 1 } (50) (44) (45) (46) (47) (48)

7 A.1. Distribution-independent bound Lemma A.15. For any ɛ > 0, j 1,,, t 1, when the algorithm has played n σ (t 1) j rounds exploitation at time t, the probability of incurring penalty ɛ satisfies Pr ˆx(t) ɛ n σ (t 1) j e exp { j } ɛ 8L βσ. (51) Proof. ˆx(t) is the real gap of reward between x and ˆx(t): ˆx(t) δ x,ˆx(t) δ x,ˆx(t) + ˆδˆx(t),x (t) {Definition of ˆx(t)} (5) δ x,ˆx(t) ˆδ x,ˆx(t)(t) r(x, ν) r(x, ˆν(t 1)) + r(ˆx(t), ν) r(ˆx(t), ˆν(t 1)) (55) L ν ˆν(t 1) L ν ˆν (nσ(t 1)) (56) When n σ (t 1) j, we can conclude that the probability of incurring a large penalty is: Pr ˆx(t) ɛ n σ (t 1) j Pr L ν ˆν (nσ(t 1)) ɛ n σ (t 1) j (57) Pr ν ˆν (j) ɛ (58) L e exp { j } ɛ 8L βσ. {Lemma A.} (59) (5) (54) In Algorithm GCB, we know that the exploitation is penalized with respect to the regret only if it chooses a bad action and exploits it simultaneously, i.e., F Bad (t) and E Exploit (t) are both satisfied. When the algorithm chooses exploitation at time t, the regret at that time will be E ˆx(t) I F Bad (t) E Exploit (t). Lemma A.16 (Penalty of exploitation). ɛ > 0, Algorithm GCB with f X (t) ln t + ln X, α 8L β σ a, 0 < a 1, and η(t) t f X (t), the penalty in the exploitation phase at time t will be in expectation: E ˆx(t) I F Bad (t) E Exploit (t) ( }) e ɛ + max X N(t) + e η(t) ɛ exp { 8L βσ. (60) Proof. ɛ > 0, the expectation satisfies: E ˆx(t) I F Bad (t) E Exploit (t) (61) E ˆx(t) F Bad (t) E Exploit (t) Pr F Bad (t) E Exploit (t) (6) E ˆx(t) ˆx(t) < ɛ F Bad (t) E Exploit (t) Pr ˆx(t) < ɛ F Bad (t) E Exploit (t) + E ˆx(t) ˆx(t) ɛ F Bad (t) E Exploit (t) Pr ˆx(t) ɛ F Bad (t) E Exploit (t) (6) ɛ Pr ˆx(t) < ɛ F Bad (t) E Exploit (t) + max Pr ˆx(t) ɛ F Bad (t) E Exploit (t) (64) ɛ + max Pr ˆx(t) ɛ F Bad (t) E Exploit (t). (65) By definition, exploration event E Exploit (t) {L CI (t) n σ (t) > η(t)} happens when no other action is in the gap L CI (t) or the counter n σ (t) > η(t). And we know that n σ (t) is no larger than η(t) + 1, because it is a hard constraint implied

8 by Lemma A.1. Therefore, the probability in the second term is the joint of these two events: Pr ˆx(t) ɛ F Bad (t) E Exploit (t) (66) Pr ˆx(t) ɛ F Bad (t) (L CI (t) n σ (t 1) > η(t)) (67) Pr ˆx(t) ɛ F Bad (t) L CI (t) n σ (t 1) η(t) + Pr ˆx(t) ɛ F Bad (t) n σ (t 1) > η(t) (68) Pr F Bad (t) L CI (t) + Pr ˆx(t) ɛ F Bad (t) n σ (t 1) η(t) + 1 (69) } e X N(t) + e η(t) ɛ exp { 8L βσ. {Lemma A.14 and A.15} (70) Therefore, we have E ˆx(t) I F Bad (t) E Exploit (t) ( }) e ɛ + max X N(t) + e η(t) ɛ exp { 8L βσ. (71) Theorem 4.1 (in the main text): (Distribution-independent bound). Let f X (t) ln t + ln X, and α 4L βσ. The distribution-independent regret bound of Algorithm GCB is: ) R(T ) R max σ T 8 (ln T + ln X ) + LT + Rmax ( σ + 4e X 4. (7) Proof. From the algorithm, we know that it either plays actions in the exploration phase or in the exploitation phase. The exploration phase will take time σ to finish, and its penalty is x σ x. And the penalty of playing exploitation is ˆx(t) at each time step t. R(T ) x E n σ (T ) + x σ T E ˆx(t) I F Bad (t) E Exploit (t). (7) From Lemma A.1, we can infer that if the exploration counter n σ (t) > η(t) t f X (t), it will no longer play exploration. Therefore, the expected number of rounds of exploration satisfies E n σ (T ) T f X (T ) + 1, so the regret for exploration is x E n σ (T ) ( ) x T fx (T ) + 1. (74) x σ Let ɛ 4L t 1, then η(t) t f X (t) and η(t)ɛ 8L βσ exploitation part: Therefore, we will have R(T ) x σ f X (t). Therefore, we can apply Lemma A.16 to get the regret of T E ˆx(t) I F Bad (t) E Exploit (t) (75) T ( }) e ɛ + max X N(t) + e η(t) ɛ exp { 8L β (76) σ T ( ) 4L t 1 e + max X + e N(t) (77) 8 ( ) LT e 1 + max X + e X 4. (78) x σ ( x T 8 (ln T + ln X ) + LT + x σ ) x + 4e X 4 max. (79)

9 As x and max is bounded by R max under any distribution, we conclude that: R(T ) R max σ T (ln T + ln X ) + 8 LT + Rmax ) ( σ + 4e X 4. (80) A.. Distribution-dependent Bound Under a predetered outcome distribution p, the imum gap between optimal action and sub-optimal action is. It follows that: Lemma A.17 (Condition of choosing optimal action). Suppose we have played exploration round j, at time t. If b 1, j b 8L β σ f X (t), Algorithm GCB will choose the optimal action with high probability: j b 8L βσ f X (t), Pr F Bad (j) e. (81) t N(t) b 1 Proof. According to the definition, F(j) Bad only occurs only if one sub-optimal action has the largest estimated reward. Pr F(j) Bad (8) Pr x b X Bad, x g X Good, r(x g, ˆν (j) ) r(x b, ˆν (j) ) (8) Pr x b X Bad, x g X Good, r(x g, ˆν (j) ) r(x b, ˆν (j) ) (84) Pr r(x g, ˆν (j) ) r(x b, ˆν (j) ) 0 {Union bound} (85) x b X Bad x g X Good x b X Bad x g X Good x b X Bad x g X Good x b X Bad x g X Good ( Pr r(x g, ν) r(x g, ˆν (j) ) + Pr r(x b, ν) r(x b, ˆν (j) ) < ) ( Pr r(x g, ν) r(x g, ˆν (j) ) + Pr r(x b, ν) r(x b, ˆν (j) ) > ) Pr L ν ˆν (j). (88) Thus, by Lemma A., it is Pr F(j) Bad Pr L ν ˆν (j) > x b X Bad,x g X Good (89) 4e j exp{ 8L β } x b X Bad,x g X Good σ (90) 4e X Bad X Good exp{ j 8L βσ } (91) e X exp{ j 8L βσ }. { X Bad + X Good X } (9) Therefore, if j b 8L β σ f X (t), b 1, we can conclude: Pr F Bad (j) e X (86) (87) N(t) b e. (9) tn(t) b 1

10 Lemma A.18 (Exploit the Optimal Action). Let α 8L β σ a, 0 < a 1 and θ. For any time t > T 0, the probability of F Bad (t) and playing exploitation in Algorithm GCB is: Pr E Exploit (t) F Bad (t) e N(t). (94) Proof. If t > T 0, and E Exploit (t) {L CI (t) n σ (t 1) > η(t)}, we can write the probability of exploitation as: Pr E Exploit (t) F Bad (t) Pr E Exploit (t) F Bad (t) n σ (t 1) > η(t) + Pr E Exploit (t) F Bad (t) n σ (t 1) η(t) (96) Pr F Bad (t) n σ (t 1) > η(t) + Pr L CI (t) F Bad (t) n σ (t 1) η(t) (97) Pr F Bad (t) n σ (t 1) > η(t) + Pr L CI (t) F Bad (t) (98) Pr F Bad (t) n σ (t 1) > η(t) + e X N(t). {Lemma A.14} Since we know that n σ (t 1) > η(t), 0 < a 1 and θ 1, then n σ (t 1) > η(t) > µ(t) By Lemma A.17, the following inequality holds: (1 + θa) a 8L β σf X (t) > 8L βσf X (t). Pr F Bad (t) n σ (t 1) > η(t) (100) t 1 jη(t) t jη(t) t 1 jη(t) t 1 jη(t) Therefore, we can get: Pr F Bad (t) n σ (t 1) j (101) Pr F(n Bad n σ(t 1)) σ(t 1) j Pr e N(t). F Bad (j) (95) (99) (10) (10) e tn(t) {Lemma A.17 with b } (104) (105) Pr E Exploit (t) F Bad (t) e N(t) + e X N(t) e N(t). (106) Lemma A.19 (The exploration probability will drop). Suppose the instance has unique optimal action under distribution p, i.e., X Good 1. Let α 8L β σ a, 0 < a 1. For any time t > T 0, when n σ (t 1) µ(t) (1 + θa) αf X (t) where θ, and the probability of F Good (t) and exploration happening simultaneously is: Pr E Explore (t) F Good (t) n σ (t 1) µ(t) e X N(t). (107) Proof. By definition, the event that exploration happens at time t is E Explore (t) {L c CI (t) n σ η(t)}. When t > T 0, it is true that η(t) > µ(t).

11 On one hand, if n σ (t 1) > η(t), then by Lemma A.1, we know that Pr E Explore (t) n σ (t 1) > η(t) (108) Pr E Explore (t) nσ (t 1) > η(t) Pr n σ (t 1) > η(t) (109) 0. On the other hand, for µ(t) n σ (t 1) η(t), whether to play exploration only depends on the event L c CI (t). If F Good (t) {ˆx(t) X Good} and with the assumption that X Good 1, we know that X Good (X \ {ˆx(t)}). So the gap at time t is, x X \ {ˆx(t)}, ˆδˆx(t),x (t) ˆδ x,x(t) δ x,x ˆδ x,x(t) δ x,x (111) (110) ˆδ x,x(t) δ x,x { is the imum gap} (11) L ν ˆν(t 1). {Lemma A.7} (11) And we also know that if n σ (t 1) µ(t) (1 + θa) αf X (t), αf X (t) n σ (t 1) 1 + θa, (114) thus we can get Let α 8L β σ a Pr E Explore (t) F Good (t) µ(t) n σ (t 1) η(t) (115) αf Pr x X \ {ˆx(t)}, ˆδˆx(t),x X (t) (t) n σ (t 1) F Good (t) µ(t) n σ (t 1) η(t) (116) αf X (t) Pr L ν ˆν(t 1) n σ (t 1) µ(t) n σ(t 1) η(t) (117) Pr L ν ˆν(t 1) 1 + θa µ(t) n σ(t 1) η(t) (118) Pr L ν ˆν (nσ(t 1)) θa 1 + θa µ(t) n σ (t 1) η(t) (119) Pr L ν ˆν (nσ(t 1)) θa 1 + θa µ(t) n σ (t 1) η(t) (10) η(t) Pr L ν ˆν (nσ(t 1)) θa 1 + θa n σ (t 1) j (11) jµ(t) η(t) jµ(t) η(t) jµ(t) Pr L ν ˆν (j) θa 1 + θa n σ (t 1) j (1) Pr L ν ˆν (j) θa 1 + θa. (1), 0 < a 1. For j µ(t),, η(t) and µ(t) (1 + θa) αf X (t) Pr ν ˆν (j) 1 L θa 1 + θa } { e exp (θa) j (1 + θa) 8L, recall Lemma A., then we have: (14) {Lemma A.} (15) e exp { θ f X (t) } {j µ(t)} (16) e N(t) θ. (17)

12 Therefore, we have: Pr E Explore (t) F Good (t) n σ (t 1) µ(t) (18) Pr E Explore (t) F Good (t) n σ (t 1) > η(t) + Pr E Explore (t) F Good (t) µ(t) n σ (t 1) η(t) (19) 0 + η(t) jµ(t) e t N(t) θ e X N(t)1 θ e N(t) θ (10) (11) (1) e X N(t). {Let θ } (1) When the instance has a unique optimal action x under distribution p, the following lemmata ensures that exploration will not continue endlessly, thus it will switch to exploitation gradually. For simplicity, we consider the case that the exploration round has already reached µ(t ) at given time T. Lemma A.0 (Switch to exploitation gradually). Suppose the instance has a unique optimal action x under distribution p. Given time T, if for time i T the exploration rounds n σ (i) µ(t ) has already been satisfied, where µ(t ) (1 + θa) αf X (T ), 0 < a 1. θ. Then t, max{i + 1, T 0 } t T, the probability of playing exploration is: Proof. As n σ (i) µ(t ), we know that Pr E Explore (t) n σ (i) µ(t ) 4e N(t). (14) n σ (i) µ(t ) n σ (t 1) n σ (i) µ(t ), (15) which implies that the event n σ (i) µ(t ) is the subset of the event n σ (t 1) µ(t ). From Lemma A.19, the first part is For the second part, as 0 < a 1 and θ, we can get µ(t ) (1 + θa) αf X (T ) Pr E Explore (t) n σ (i) µ(t ) (16) Pr E Explore (t) n σ (t 1) µ(t ) (17) Pr E Explore (t) F Good (t) n σ (t 1) µ(t ) + Pr E Explore (t) F Bad (t) n σ (t 1) µ(t ) (18) Pr E Explore (t) F Good (t) n σ (t 1) µ(t ) (19) + Pr F Bad (t) n σ (t 1) µ(t ). (140) Pr E Explore (t) F Good (t) n σ (t 1) µ(t ) (141) Pr E Explore (t) F Good (t) n σ (t 1) µ(t) (14) e X N(t). (14) (1 + θa) αf X (t) (1 + θa) a 8L β σf X (t) > 8L βσf X (t). (144)

13 Thus, by using Lemma A.17, it is Pr F Bad (t) n σ (t 1) µ(t ) (145) t Therefore, we can get t 1 jµ(t ) t 1 jµ(t ) t 1 jµ(t ) t 1 jµ(t ) Pr F Bad (t) n σ (t 1) j (146) Pr F(n Bad n σ(t 1)) σ(t 1) j Pr F(j) Bad Pr F Bad (j) n σ (t 1) j (147) (148) (149) e tn(t) {Lemma A.17 with b } (150) e N(t). (151) Pr E Explore (t) n σ (i) µ(t ) e X N + e N 4e N. (15) For counter n σ, the following definition characterizes its first occurrence to be k. Definition A.1. Given k, for any t, we define the event that n σ (t) k and n σ (t 1) k 1 as G k (t), i.e., G k (t) {n σ (t) k n σ (t 1) k 1}. Lemma A. (Exploration Numbers). Let µ(t ) (1 + θa) αf X (T ), 0 < a 1 and θ. If under distribution p, there is a unique optimal action, i.e., X Good 1, then the expected exploration round at time T (T 0 T ) is: E n σ (T ) µ(t ) + 4e T0 X 4 ln(t + 1) Pr E Explore (t). (15) Proof. Note that it takes σ time steps to play exploration and then to increase n σ by 1. E FinExplore (t) is the event that the algorithm finishes one round of exploration and updates n σ at time t. Then, we have E FinExplore (t) E Explore (t σ + 1) and t 1,,, σ 1, Pr E FinExplore (t) 0, meaning that the event never happens for t < σ. By definition, we can get: E n σ (T ) T Pr E FinExplore (t) T Pr E FinExplore (t) t σ Because the accumulation of exploration rounds is n σ (T ), therefore its expected number can be: Pr E Explore (t). (154) E n σ (T ) Pr E Explore (t) (155) Pr E Explore (t) n σ (T ) < µ(t ) + Pr E Explore (t) n σ (T ) µ(t ). (156)

14 The following inequality ensures that the first part is not large: Pr E Explore (t) n σ (T ) < µ(t ) (157) Pr n σ (T ) < µ(t ) Pr E Explore (t) n σ (T ) < µ(t ) (158) Pr n σ (T ) < µ(t ) Pr E Explore (t) nσ (T ) < µ(t ) (159) Pr n σ (T ) < µ(t ) T Pr E FinExplore (t) n σ (T ) < µ(t ) (160) Pr n σ (T ) < µ(t ) E n σ (T ) n σ (T ) < µ(t ) (161) Pr n σ (T ) < µ(t ) µ(t ). (16) We know the counter n σ could only increase by 1 at a time. For this reason, if the value of n σ (T ) exceeds µ(t ) at time T {, this event must happen within t µ(t ),, T. Thus, the occurrence of µ(t ) is equivalent to the union of events T } G µ(t )(i). By definition, each event G µ(t ) (i), i µ(t ),, T, is mutually exclusive. Therefore, we have { T } {n σ (T ) µ(t )} G µ(t )(i), and the second part is: T T Pr E Explore (t) n σ (T ) µ(t ) (16) Pr E Explore (t) Pr T T T ( E Explore (t) G µ(t ) (i) ) G µ(t ) (i) (164) (165) Pr E Explore (t) G µ(t ) (i) {Union bound} (166) Pr E Explore (t) G µ(t ) (i) (167) i Pr E Explore (t) G µ(t ) (i) + T ti+1 Pr E Explore (t) G µ(t ) (i). (168)

15 Now we will prove that the first term is in O(µ(T )): T T T T T T i Pr E Explore (t) G µ(t ) (i) (169) i Pr G µ(t ) (i) Pr E Explore (t) Gµ(T ) (i) (170) Pr G µ(t ) (i) i Pr E Explore (t) G µ(t ) (i) (171) Pr G µ(t ) (i) i+ σ 1 Pr E FinExplore (t) G µ(t ) (i) (17) Pr G µ(t ) (i) E n σ (i + σ 1) G µ(t ) (i) (17) Pr G µ(t ) (i) E n σ (i) + 1 Gµ(T ) (i) {n σ (i + σ 1) n σ (i) + 1} (174) Pr n σ (T ) µ(t ) (µ(t ) + 1), {Mutually exclusive} (175) Since G µ(t ) (i) {n σ (i) µ(t ) n σ (i 1) µ(t ) 1}, we can write the second term as: T T T ti+1 T 0 T 0 T 0 T Pr E Explore (t) G µ(t ) (i) (176) Pr E Explore (t) G µ(t ) (i) + Pr E Explore (t) G µ(t ) (i) + Pr E Explore (t) G µ(t ) (i) + T 0 Pr E Explore (t) T 0 T Pr E Explore (t) T + 4e X 4 T 0 Pr E Explore (t) + 4e ln T. X 4 Therefore, we can get T tmax{i+1,t 0+1} T tmax{i+1,t 0+1} T tmax{i+1,t 0+1} G µ(t ) (i) + 4e X 4 T T Pr E Explore (t) G µ(t ) (i) (177) Pr E Explore (t) n σ (i) µ(t ) (178) 4e N(t) ti t dt di {Lemma A.0} (179) {Mutually exclusive} (180) 1 di (181) i (18) E n σ (T ) µ(t ) e T0 X 4 ln T + Pr E Explore (t). (18)

16 Theorem 4. (in the main text): (Distribution-dependent bound). For Algorithm GCB, let f X (t) ln t + ln X, α 4L βσ. If the instance has a unique optimal action under outcome distribution p and mean outcome vector ν, the distribution-dependent regret bound of Algorithm GCB is: R(T ) 96L βσ x x σ (ln T + ln X ) + 4e X 4 ln T + 1 where x σ x, max and are problem-specific constants under the distribution p. ( e + max X L βσ ), (184) Proof. If we penalize each time the algorithm plays a sub-optimal action by max, then the regret function is composed of exploration and exploitation: R(T ) x σ x E n σ (T ) + max x σ x E n σ (T ) + max T E I E Exploit (t) F Bad (t) (185) T Pr E Exploit (t) F Bad (t). (186) Suppose it has unique optimal action X Good 1, from Lemma A. the expected rounds of exploration are: E n σ (T ) (1 + θa) αf X (T ) e T0 X 4 ln(t + 1) + The regret of exploitation phase can be inferred from Lemma A.18 that: max max Pr E Explore (t). (187) T Pr F Bad (t) E Exploit (t) (188) ( T tt 0+1 ( e max X 4 ( max e T0 N(t) + Pr F Bad (t) E Exploit (t) ) (189) T tt 0+1 e T0 X 4 + T0 1 t + Pr E Exploit (t) ) (190) Pr E Exploit (t) ). (191) Since for t 1,,, T 0, we perform either exploration or exploitation, the regret is no worse than max T 0, that is: T 0 x Pr E Explore (t) T 0 + max Pr E Exploit (t) max T 0. (19) x σ Thus, for f X (t) ln t + ln X, α 8L β σ a, 0 < a 1 and θ, R(T ) (1 + θa) α x (ln T + ln X ) + 4e x σ X 4 ln T + 1 where T 0 (1+θa) α. Let a 1, θ, and α 4L βσ. As a conclusion, we will get: R(T ) 96L β σ x (ln T + ln X ) + 4e X 4 ln T + 1 x σ ( ) e + max X 4 + T 0, (19) ( e + max X L βσ ). (194)

17 B. An Example of M σ and Global Observer Set Construction for 1 < s < N in the Crowdsourcing Application In this section, we provide an example of constructing the stacked matrix M σ and the global observer set in the crowdsourcing application when we require 1 < s < N, where s is the number of matched worker-task pairs used for reporting the feedback. Recall that the feedback for a matching is the simple summation of these s matched worker-task pairs. This implies that for each matching x, the transformation matrix M x contains a single row with exactly s 1s and all other entries are 0, and M x x s. As an illustration, consider the case that both N and M are divisible by s + 1. Then we can construct a full-rank square matrix M σ such that, after rearranging the columns of M σ, it is a block diagonal matrix with each block B being an (s+1)-by-(s+1) square matrix with 0 in the diagonal entries and 1 as off-diagonal entries. The following is an illustration of such a matrix for the case of s + 1 N M It is clear that this M σ is full column rank. To recover the NM actions (matchings) corresponding to the NM rows, we map each block B to a matching that matches s + 1 workers to s + 1 tasks such that these matchings share no common edges. This can be done in the following way. We partition N workers into N/(s+1) groups of size s+1 each, and partition M tasks into M/(s+1) groups of size s+1 each. Taking any group W of s + 1 workers and any group U of s + 1 tasks, we can find s + 1 non-overlapping matchings between W and U by rotation: in the j-th matching, the i-th worker is matched with the (i + j mod s + 1)-th task. Since we have NM/(s + 1) worker-task group pairs, and each group pair generates s + 1 non-overlapping matchings, in total we have NM/(s + 1) non-overlapping matchings, and we map these matches to the NM/(s + 1) blocks in the rearranged matrix M σ. The above construction implies that we can find NM actions to form a global observer set, in which each action is a matching of s + 1 workers to s + 1 tasks, and each matching returns an aggregate performance feedback of s worker-task pairs in the matching. Thus the assumption on the existence of the global observer set holds and the set can be constructed easily. The error bound for the above constructed M σ is more complicated to analyze, but by our empirical evaluation using Matlab, we believe that it is also a low-degree polynomial in N and M. References Hayes, Thomas P. A large-deviation inequality for vector-valued martingales. Combinatorics, Probability and Computing, 00.

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

IEOR E4707: Foundations of Financial Engineering c 206 by Martin Haugh Martingale Pricing Theory in Discrete-Time and Discrete-Space Models These notes develop the theory of martingale pricing in a discrete-time,