Bandit Learning with switching costs

Bandit Learning with switching costs Jian Ding, University of Chicago joint with: Ofer Dekel (MSR), Tomer Koren (Technion) and Yuval Peres (MSR) June 2016, Harvard University

Online Learning with k -Actions player (a.k.a. learner) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Online Learning with k -Actions k actions (a.k.a. arms, experts) player (a.k.a. learner) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Online Learning with k -Actions player (a.k.a. learner) Jian Ding (University of Chicago) adversary (a.k.a. environment) Bandits with switching costs June, 2016 2 / 30

Round 1 0.9 0.2 player (a.k.a. learner) Jian Ding (University of Chicago) 0.6 Bandits with switching costs adversary (a.k.a. environment) June, 2016 2 / 30

Round 1 0.9 0.2 player (a.k.a. learner) 0.6 adversary (a.k.a. environment) 0.2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Finite-Action Online Learning easy hard unlearnable less feedback more powerful adversary Goal: a complete characterization of learning hardness. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 3 / 30

Round t 0.1 0.7 randomized player 0.2 0.2 adversary 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Round t 0.1 Two Types of Adversaries An adaptive adversary takes0.7 the player s past actions into account when setting loss values. An oblivious adversary ignores the player s past acadversary tions when setting loss values. randomized 0.2 player 0.2 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Round t 0.1 0.7 randomized player 0.2 0.2 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) adversary 0.1 Bandits with switching costs June, 2016 4 / 30

Round t 0.1 Two Feedback Models In the bandit feedback model, the player only sees the 0.7 loss associated with his action (one number). In the full feedback model, the player also sees the adversary losses associated with the other actions (k numbers). randomized 0.2 player 0.2 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) 0.1 Bandits with switching costs June, 2016 4 / 30

Examples bandit feedback Display one of k news articles to maximize user clicks. full feedback Invest in one stock on each day. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 5 / 30

More Formally Setting A T -round repeated game between a randomized player and a deterministic adaptive adversary Notation: player s actions: X = {1,..., k} Before the game: adversary chooses sequence of loss functions f 1,..., f T, where f t : X t [0, 1] The game: for t = 1,..., T player chooses distribution µ t over X and draws X t µ t player suffers and observes loss f t (X 1,..., X t ) if full feedback, player observes x f t (X 1,..., X t 1, x) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 6 / 30

Adaptive vs. Oblivious Adaptive : f t : X t [0, 1] can be any function Oblivious : adversary chooses l 1,..., l T, where l t : X [0, 1], and sets f t (x 1,..., x t ) = l t (x t ). oblivious adaptive Jian Ding (University of Chicago) Bandits with switching costs June, 2016 7 / 30

Loss, Regret Definition Player s expected cumulative loss is [ T ] E t=1 f t(x 1,..., X t ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 8 / 30

Loss, Regret Definition Player s expected cumulative loss is [ T ] E t=1 f t(x 1,..., X t ). Definition R(T ) = E [ T Player s regret w.r.t. the best action is ] t=1 f t(x 1,..., X t ) min T x X t=1 f t(x,..., x). Interpretation R(T ) = o(t ) the player gets better with time. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 8 / 30

Minimax Regret Regret measures a specific player s performance We want to measure the inherent difficulty of the problem Definition The minimax regret R (T ), is the inf over randomized player strategies of the sup over adversary loss sequences of the resulting expected regret. R (T ) = θ( T ) problem is easy R (T ) = θ(t ) problem is unlearnable Jian Ding (University of Chicago) Bandits with switching costs June, 2016 9 / 30

Full + Oblivious a.k.a. Predicting with Expert Advice Littlestone & Warmuth (1994), Freund & Schapire (1997) The Multiplicative Weights Algorithm Sample X t from µ t where t 1 µ t (i) exp γ l j (i). j=1 Theorem γ = 1/ T yields R(T ) = O( T log(k)). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 10 / 30

Bandit + Oblivious a.k.a. The Adversarial Multiarmed Bandit Problem Auer, Cesa-Bianchi, Freund, Schapire (2002) The EXP3 Algorithm Run the weighted majority algorithm with estimates of the full feedback vectors { lt (i) µ ˆl t (i) = t if i = X (i) t 0 otherwise. Theorem E[ˆl t (i)] = l t (i) R(T ) = O( Tk). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 11 / 30

Adaptive Obstacle (Arora, Dekel, Tewari 2012) R (T ) = θ(t ) in any feedback model. Proof w.l.o.g. assume µ 1 (1) > 0. Define { 1 if x 1 = 1 f t (x 1,..., x t ) = 0 otherwise. This loss guarantees R (T ) = µ 1 (1) T. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 12 / 30

The Characterization (so far) oblivious adaptive bandit full θ( T ) easy θ(t ) unlearnable less feedback more powerful adversary Boring Feedback models seem to be equivalent (when k = 2, say). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 13 / 30

Adding a Switching Cost The switching cost adversary chooses l 1,..., l T, where l t : X [0, 1], and sets f t (x 1,..., x t ) = 1 ( ) lt (x t ) + 1 xt x 2 t 1. The Follow the Lazy Leader algorithm (Kalai-Vempala 05) guarantees R(T ) = O( T ) (full information); also Shrinking the dartboard (Geulen-Vöcking-Winkler 10) oblivious switching adaptive Jian Ding (University of Chicago) Bandits with switching costs June, 2016 14 / 30

m-memory Adversary, Counterfactual Feedback The m-memory adversary defines loss functions that depend only on the m + 1 recent actions. ( ) ( ) f t x1,..., x }{{} t = ft xt m,..., x }{{} t. t m+1 A Third Feedback Model In the counterfactual feedback model, the player receives the entire loss function f t. Merhav et al. (2002) proved R(T ) = O(T 2/3 ). Gyorgy & Neu (2011) improved this to R(T ) = O( T ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 15 / 30

Adversaries and Feedbacks oblivious switching m-memory adaptive bandit full counterfactual Jian Ding (University of Chicago) Bandits with switching costs June, 2016 16 / 30

The Characterization (so far) oblivious switching m-memory bandit full counterfactual θ( T ) easy more powerful adversary θ(t ) unlearnable less feedback Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

bandit The Characterization (so far) full counterfactual oblivious θ( T ) easy switching Ω( T ), O(T 2/3 ) easy? hard? m-memory more powerful adversary θ(t ) unlearnable less feedback Arora, Dekel, Tewari 2012 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

The Characterization (so far) oblivious switching m-memory bandit full counterfactual θ( T ) easy θ(t 2/3 ) hard more powerful adversary θ(t ) unlearnable less feedback Cesa-Bianchi, Dekel, Shamir (2013), (Unbounded Losses) Dekel, D., Koren, Peres (2013) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

Bandit + Switching: Upper Bound Algorithm split rounds into T /B blocks of length B. use EXP3 to choose action ˆx j for the entire block feedback to EXP3 is the average loss in the block ˆx 1 ˆx 1 ˆx 1 ˆx 1 ˆx 2 ˆx 2 ˆx 2 ˆx 2 ˆx 3 ˆx 3 ˆx 3 ˆx 3 ˆx 4 ˆx 4 ˆx 4 ˆx 4 Regret Analysis R(T ) T /B }{{} switches + B O ( T /B ) }{{} loss = O(T /B + TB) Minimized by selecting B = T 1/3 yielding regret R(T ) = O(T 2/3 ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 18 / 30

Bandit + Switching: Lower Bound Yao s Minimax Principle (1975) The expected regret of the best deterministic algorithm on a random loss sequence lower-bounds the expected regret of a randomized algorithm on the worst deterministic loss sequence. Goal find a random loss sequence for which all deterministic algorithms have expected regret Ω(T 2/3 ). For simplicity, assume k = 2. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 19 / 30

Bandit + Switching: Lower Bound Cesa-Bianchi, Dekel, Shamir 2013: random walk construction Let (S t ) be a Gaussian random walk, and ɛ = 1/ T. Randomly choose an action and assign to it the loss function (S t ), and the other action the loss function (S t + ɛ). Key: 1/ɛ 2 switches required before determining which action is worse! Drawback: Unbounded loss function is hard learning an artifact of unboundedness?? Jian Ding (University of Chicago) Bandits with switching costs June, 2016 20 / 30

Multi-Scale Random Walk Define the loss of action 1: Draw independent Gaussians ξ 1,..., ξ T N(0, σ 2 ) For each t, define a parent ρ(t) {0,..., t 1} Define (recursively): L 0 = 1/2, L t = L ρ(t) + ξ t ξ 4 ξ 2 ξ 6 ξ 1 ξ 3 ξ 5 ξ 7 L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 loss of action 1 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 21 / 30

Examples ρ(t) = t gcd(t, 2 T ) L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 ρ(t) = 0 wide L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 ρ(t) = t 1 L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 deep Jian Ding (University of Chicago) Bandits with switching costs June, 2016 22 / 30

Define the loss of action 2: Draw a random sign χ The Second Action Define L t = L t + χɛ, where ɛ = T 1/3. ( ) Pr(χ = +1) = Pr(χ = 1) loss ɛ Action 1 (L t ) Action 2 (L t) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 t Choose worse action θ(t ) times R(T ) = Ω(T 2/3 ) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 23 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ no info L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 action: 2 1 1 1 2 2 2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ no info one sample L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 action: 2 1 1 1 2 2 2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 action: 2 1 1 1 2 2 2 How many red edges? Player needs at least σ 2 T 2/3 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

Counting the Information Define width(ρ) is the maximum size of any vertical cut in the graph induced by ρ. width(ρ)=3 L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 Lemma A switch contributes width(ρ) samples. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 25 / 30

Depth Define depth(ρ) is the length of the longest path. L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 Loss should remain bounded in [0, 1] set σ 1 depth(ρ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 26 / 30

Putting it All Together σ 2 T 2/3 samples needed each switch gives width(ρ) samples loss bounded in [0, 1] σ 2 1 depth(ρ). Conclusion Number of switches needed to determine the better action T 2/3 width(ρ) depth(ρ) 2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 27 / 30

Putting it All Together σ 2 T 2/3 samples needed each switch gives width(ρ) samples loss bounded in [0, 1] σ 2 1 depth(ρ). Conclusion Number of switches needed to determine the better action T 2/3 width(ρ) depth(ρ) 2 Choose ρ(t) = t gcd(t, 2 T ) Lemma depth(ρ) log(t ) and width(ρ) log(t )+1 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 27 / 30

Corollaries & Extensions Corollary Exploration requires switching. e.g., EXP3 switches θ(t ) times. Dependence on k The minimax regret of the multiarmed bandit with switching costs is θ(t 2/3 k 1/3 ) Implications on other models The minimax regret of learning an adversarial deterministic MDP is θ(t 2/3 ) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 28 / 30

Summary A complete characterization of learning hardness. There exist online learning problems that are hard yet learnable. Learning with bandit feedback can be strictly harder than learning with full feedback. Exploration requires extensive switching. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 29 / 30

The End hard easy Jian Ding (University of Chicago) Bandits with switching costs r lea un ble na June, 2016 30 / 30