Bandit Learning with switching costs

Similar documents
An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

Yao s Minimax Principle

TTIC An Introduction to the Theory of Machine Learning. The Adversarial Multi-armed Bandit Problem Avrim Blum.

Distributed Non-Stochastic Experts

Lecture 7: Bayesian approach to MAB - Gittins index

Tuning bandit algorithms in stochastic environments

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Optimal Regret Minimization in Posted-Price Auctions with Strategic Buyers

Sequential Decision Making

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4.

The Non-stationary Stochastic Multi-armed Bandit Problem

Stochastic Dual Dynamic Programming

1 Online Problem Examples

Regret Minimization and Correlated Equilibria

Dynamic Pricing with Varying Cost

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

arxiv: v1 [cs.lg] 23 Nov 2014

Revenue optimization in AdExchange against strategic advertisers

ECE 586GT: Problem Set 1: Problems and Solutions Analysis of static games

On Existence of Equilibria. Bayesian Allocation-Mechanisms

Regret Minimization and the Price of Total Anarchy

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

6.896 Topics in Algorithmic Game Theory February 10, Lecture 3

Rollout Allocation Strategies for Classification-based Policy Iteration

arxiv: v1 [cs.dm] 4 Jan 2012

4 Martingales in Discrete-Time

Lecture 5. 1 Online Learning. 1.1 Learning Setup (Perspective of Universe) CSCI699: Topics in Learning & Game Theory

Martingale Measure TA

Algorithmic Game Theory and Applications. Lecture 11: Games of Perfect Information

Risk-Sensitive Online Learning

arxiv: v1 [cs.lg] 14 Nov 2012

CS 188: Artificial Intelligence

Rational Behaviour and Strategy Construction in Infinite Multiplayer Games

PLAYING GAMES WITHOUT OBSERVING PAYOFFS

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme

Lecture 9 Feb. 21, 2017

Zooming Algorithm for Lipschitz Bandits

Lecture 23: April 10

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

Approximate Revenue Maximization with Multiple Items

Near-Optimal No-Regret Algorithms for Zero-Sum Games

Adaptive Market Making via Online Learning

Sublinear Time Algorithms Oct 19, Lecture 1

arxiv: v1 [cs.lg] 21 May 2011

Online Trading Algorithms and Robust Option Pricing

Evaluating Strategic Forecasters. Rahul Deb with Mallesh Pai (Rice) and Maher Said (NYU Stern) Becker Friedman Theory Conference III July 22, 2017

Tug of War Game. William Gasarch and Nick Sovich and Paul Zimand. October 6, Abstract

Reduced-Variance Payoff Estimation in Adversarial Bandit Problems

Multi-period mean variance asset allocation: Is it bad to win the lottery?

Online Network Revenue Management using Thompson Sampling

Strategies and Nash Equilibrium. A Whirlwind Tour of Game Theory

Asymptotic results discrete time martingales and stochastic algorithms

Efficient Market Making via Convex Optimization, and a Connection to Online Learning

Log-Robust Portfolio Management

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Finite Memory and Imperfect Monitoring

GAME THEORY. Department of Economics, MIT, Follow Muhamet s slides. We need the following result for future reference.

Near-Optimal No-Regret Algorithms for Zero-Sum Games

Laws of probabilities in efficient markets

Max Registers, Counters and Monotone Circuits

CEC login. Student Details Name SOLUTIONS

Non-Deterministic Search

A New Understanding of Prediction Markets Via No-Regret Learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

An Approach to Bounded Rationality

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

CS599: Algorithm Design in Strategic Settings Fall 2012 Lecture 6: Prior-Free Single-Parameter Mechanism Design (Continued)

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

The Stackelberg Minimum Spanning Tree Game

On Finite Strategy Sets for Finitely Repeated Zero-Sum Games

Lecture 11: Bandits with Knapsacks

TTIC An Introduction to the Theory of Machine Learning. Learning and Game Theory. Avrim Blum 5/7/18, 5/9/18

The Value of Stochastic Modeling in Two-Stage Stochastic Programs

Commitment in First-price Auctions

Online Algorithms SS 2013

Computing Unsatisfiable k-sat Instances with Few Occurrences per Variable

Essays on Some Combinatorial Optimization Problems with Interval Data

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

Monte-Carlo Planning: Basic Principles and Recent Progress

CS 343: Artificial Intelligence

Decision Making in Uncertain and Changing Environments

Handout 4: Deterministic Systems and the Shortest Path Problem

Optimal Order Placement

Martingales. by D. Cox December 2, 2009

Regret Minimization against Strategic Buyers

Cooperative Games with Monte Carlo Tree Search

Response Regret. Martin Zinkevich University of Alberta Department of Computing Science. Fundamentals of Game Theory

All-Pay Contests. (Ron Siegel; Econometrica, 2009) PhDBA 279B 13 Feb Hyo (Hyoseok) Kang First-year BPP

How to Buy Advice. Ronen Gradwohl Yuval Salant. First version: January 3, 2011 This version: September 20, Abstract

Outline Introduction Game Representations Reductions Solution Concepts. Game Theory. Enrico Franchi. May 19, 2010

Bounded computational capacity equilibrium

TDT4171 Artificial Intelligence Methods

REPUTATION WITH LONG RUN PLAYERS AND IMPERFECT OBSERVATION

Optimal online-list batch scheduling

UNIVERSITY OF VIENNA

Coordination Games on Graphs

Reinforcement Learning

Information Aggregation in Dynamic Markets with Strategic Traders. Michael Ostrovsky

Transcription:

Bandit Learning with switching costs Jian Ding, University of Chicago joint with: Ofer Dekel (MSR), Tomer Koren (Technion) and Yuval Peres (MSR) June 2016, Harvard University

Online Learning with k -Actions player (a.k.a. learner) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Online Learning with k -Actions k actions (a.k.a. arms, experts) player (a.k.a. learner) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Online Learning with k -Actions player (a.k.a. learner) Jian Ding (University of Chicago) adversary (a.k.a. environment) Bandits with switching costs June, 2016 2 / 30

Round 1 0.9 0.2 player (a.k.a. learner) Jian Ding (University of Chicago) 0.6 Bandits with switching costs adversary (a.k.a. environment) June, 2016 2 / 30

Round 1 0.9 0.2 player (a.k.a. learner) 0.6 adversary (a.k.a. environment) 0.2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Finite-Action Online Learning easy hard unlearnable less feedback more powerful adversary Goal: a complete characterization of learning hardness. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 3 / 30

Round t 0.1 0.7 randomized player 0.2 0.2 adversary 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Round t 0.1 Two Types of Adversaries An adaptive adversary takes0.7 the player s past actions into account when setting loss values. An oblivious adversary ignores the player s past acadversary tions when setting loss values. randomized 0.2 player 0.2 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Round t 0.1 0.7 randomized player 0.2 0.2 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) adversary 0.1 Bandits with switching costs June, 2016 4 / 30

Round t 0.1 Two Feedback Models In the bandit feedback model, the player only sees the 0.7 loss associated with his action (one number). In the full feedback model, the player also sees the adversary losses associated with the other actions (k numbers). randomized 0.2 player 0.2 0.1 0.3 0.8 0.5 Jian Ding (University of Chicago) 0.1 Bandits with switching costs June, 2016 4 / 30

Examples bandit feedback Display one of k news articles to maximize user clicks. full feedback Invest in one stock on each day. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 5 / 30

More Formally Setting A T -round repeated game between a randomized player and a deterministic adaptive adversary Notation: player s actions: X = {1,..., k} Before the game: adversary chooses sequence of loss functions f 1,..., f T, where f t : X t [0, 1] The game: for t = 1,..., T player chooses distribution µ t over X and draws X t µ t player suffers and observes loss f t (X 1,..., X t ) if full feedback, player observes x f t (X 1,..., X t 1, x) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 6 / 30

Adaptive vs. Oblivious Adaptive : f t : X t [0, 1] can be any function Oblivious : adversary chooses l 1,..., l T, where l t : X [0, 1], and sets f t (x 1,..., x t ) = l t (x t ). oblivious adaptive Jian Ding (University of Chicago) Bandits with switching costs June, 2016 7 / 30

Loss, Regret Definition Player s expected cumulative loss is [ T ] E t=1 f t(x 1,..., X t ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 8 / 30

Loss, Regret Definition Player s expected cumulative loss is [ T ] E t=1 f t(x 1,..., X t ). Definition R(T ) = E [ T Player s regret w.r.t. the best action is ] t=1 f t(x 1,..., X t ) min T x X t=1 f t(x,..., x). Interpretation R(T ) = o(t ) the player gets better with time. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 8 / 30

Minimax Regret Regret measures a specific player s performance We want to measure the inherent difficulty of the problem Definition The minimax regret R (T ), is the inf over randomized player strategies of the sup over adversary loss sequences of the resulting expected regret. R (T ) = θ( T ) problem is easy R (T ) = θ(t ) problem is unlearnable Jian Ding (University of Chicago) Bandits with switching costs June, 2016 9 / 30

Full + Oblivious a.k.a. Predicting with Expert Advice Littlestone & Warmuth (1994), Freund & Schapire (1997) The Multiplicative Weights Algorithm Sample X t from µ t where t 1 µ t (i) exp γ l j (i). j=1 Theorem γ = 1/ T yields R(T ) = O( T log(k)). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 10 / 30

Bandit + Oblivious a.k.a. The Adversarial Multiarmed Bandit Problem Auer, Cesa-Bianchi, Freund, Schapire (2002) The EXP3 Algorithm Run the weighted majority algorithm with estimates of the full feedback vectors { lt (i) µ ˆl t (i) = t if i = X (i) t 0 otherwise. Theorem E[ˆl t (i)] = l t (i) R(T ) = O( Tk). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 11 / 30

Adaptive Obstacle (Arora, Dekel, Tewari 2012) R (T ) = θ(t ) in any feedback model. Proof w.l.o.g. assume µ 1 (1) > 0. Define { 1 if x 1 = 1 f t (x 1,..., x t ) = 0 otherwise. This loss guarantees R (T ) = µ 1 (1) T. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 12 / 30

The Characterization (so far) oblivious adaptive bandit full θ( T ) easy θ(t ) unlearnable less feedback more powerful adversary Boring Feedback models seem to be equivalent (when k = 2, say). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 13 / 30

Adding a Switching Cost The switching cost adversary chooses l 1,..., l T, where l t : X [0, 1], and sets f t (x 1,..., x t ) = 1 ( ) lt (x t ) + 1 xt x 2 t 1. The Follow the Lazy Leader algorithm (Kalai-Vempala 05) guarantees R(T ) = O( T ) (full information); also Shrinking the dartboard (Geulen-Vöcking-Winkler 10) oblivious switching adaptive Jian Ding (University of Chicago) Bandits with switching costs June, 2016 14 / 30

m-memory Adversary, Counterfactual Feedback The m-memory adversary defines loss functions that depend only on the m + 1 recent actions. ( ) ( ) f t x1,..., x }{{} t = ft xt m,..., x }{{} t. t m+1 A Third Feedback Model In the counterfactual feedback model, the player receives the entire loss function f t. Merhav et al. (2002) proved R(T ) = O(T 2/3 ). Gyorgy & Neu (2011) improved this to R(T ) = O( T ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 15 / 30

Adversaries and Feedbacks oblivious switching m-memory adaptive bandit full counterfactual Jian Ding (University of Chicago) Bandits with switching costs June, 2016 16 / 30

The Characterization (so far) oblivious switching m-memory bandit full counterfactual θ( T ) easy more powerful adversary θ(t ) unlearnable less feedback Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

bandit The Characterization (so far) full counterfactual oblivious θ( T ) easy switching Ω( T ), O(T 2/3 ) easy? hard? m-memory more powerful adversary θ(t ) unlearnable less feedback Arora, Dekel, Tewari 2012 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

The Characterization (so far) oblivious switching m-memory bandit full counterfactual θ( T ) easy θ(t 2/3 ) hard more powerful adversary θ(t ) unlearnable less feedback Cesa-Bianchi, Dekel, Shamir (2013), (Unbounded Losses) Dekel, D., Koren, Peres (2013) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

The Characterization (so far) oblivious switching m-memory bandit full counterfactual θ( T ) easy θ(t 2/3 ) hard more powerful adversary θ(t ) unlearnable less feedback Cesa-Bianchi, Dekel, Shamir (2013), (Unbounded Losses) Dekel, D., Koren, Peres (2013) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

Bandit + Switching: Upper Bound Algorithm split rounds into T /B blocks of length B. use EXP3 to choose action ˆx j for the entire block feedback to EXP3 is the average loss in the block ˆx 1 ˆx 1 ˆx 1 ˆx 1 ˆx 2 ˆx 2 ˆx 2 ˆx 2 ˆx 3 ˆx 3 ˆx 3 ˆx 3 ˆx 4 ˆx 4 ˆx 4 ˆx 4 Regret Analysis R(T ) T /B }{{} switches + B O ( T /B ) }{{} loss = O(T /B + TB) Minimized by selecting B = T 1/3 yielding regret R(T ) = O(T 2/3 ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 18 / 30

Bandit + Switching: Lower Bound Yao s Minimax Principle (1975) The expected regret of the best deterministic algorithm on a random loss sequence lower-bounds the expected regret of a randomized algorithm on the worst deterministic loss sequence. Goal find a random loss sequence for which all deterministic algorithms have expected regret Ω(T 2/3 ). For simplicity, assume k = 2. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 19 / 30

Bandit + Switching: Lower Bound Cesa-Bianchi, Dekel, Shamir 2013: random walk construction Let (S t ) be a Gaussian random walk, and ɛ = 1/ T. Randomly choose an action and assign to it the loss function (S t ), and the other action the loss function (S t + ɛ). Key: 1/ɛ 2 switches required before determining which action is worse! Drawback: Unbounded loss function is hard learning an artifact of unboundedness?? Jian Ding (University of Chicago) Bandits with switching costs June, 2016 20 / 30

Multi-Scale Random Walk Define the loss of action 1: Draw independent Gaussians ξ 1,..., ξ T N(0, σ 2 ) For each t, define a parent ρ(t) {0,..., t 1} Define (recursively): L 0 = 1/2, L t = L ρ(t) + ξ t ξ 4 ξ 2 ξ 6 ξ 1 ξ 3 ξ 5 ξ 7 L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 loss of action 1 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 21 / 30

Examples ρ(t) = t gcd(t, 2 T ) L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 ρ(t) = 0 wide L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 ρ(t) = t 1 L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 deep Jian Ding (University of Chicago) Bandits with switching costs June, 2016 22 / 30

Define the loss of action 2: Draw a random sign χ The Second Action Define L t = L t + χɛ, where ɛ = T 1/3. ( ) Pr(χ = +1) = Pr(χ = 1) loss ɛ Action 1 (L t ) Action 2 (L t) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 t Choose worse action θ(t ) times R(T ) = Ω(T 2/3 ) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 23 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ no info L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 action: 2 1 1 1 2 2 2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ no info one sample L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 action: 2 1 1 1 2 2 2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

The Information in One Sample To avoid choosing the worse action θ(t ) times, algorithm must identify the value of χ. Fact Q: How many samples to estimate the mean of a Gaussian with accuracy ɛ? A: ( ) σ 2 ɛ L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 action: 2 1 1 1 2 2 2 How many red edges? Player needs at least σ 2 T 2/3 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

Counting the Information Define width(ρ) is the maximum size of any vertical cut in the graph induced by ρ. width(ρ)=3 L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 Lemma A switch contributes width(ρ) samples. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 25 / 30

Depth Define depth(ρ) is the length of the longest path. L 0 L 1 L 2 L 3 L 4 L 5 L 6 L 7 Loss should remain bounded in [0, 1] set σ 1 depth(ρ). Jian Ding (University of Chicago) Bandits with switching costs June, 2016 26 / 30

Putting it All Together σ 2 T 2/3 samples needed each switch gives width(ρ) samples loss bounded in [0, 1] σ 2 1 depth(ρ). Conclusion Number of switches needed to determine the better action T 2/3 width(ρ) depth(ρ) 2 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 27 / 30

Putting it All Together σ 2 T 2/3 samples needed each switch gives width(ρ) samples loss bounded in [0, 1] σ 2 1 depth(ρ). Conclusion Number of switches needed to determine the better action T 2/3 width(ρ) depth(ρ) 2 Choose ρ(t) = t gcd(t, 2 T ) Lemma depth(ρ) log(t ) and width(ρ) log(t )+1 Jian Ding (University of Chicago) Bandits with switching costs June, 2016 27 / 30

Corollaries & Extensions Corollary Exploration requires switching. e.g., EXP3 switches θ(t ) times. Dependence on k The minimax regret of the multiarmed bandit with switching costs is θ(t 2/3 k 1/3 ) Implications on other models The minimax regret of learning an adversarial deterministic MDP is θ(t 2/3 ) Jian Ding (University of Chicago) Bandits with switching costs June, 2016 28 / 30

Summary A complete characterization of learning hardness. There exist online learning problems that are hard yet learnable. Learning with bandit feedback can be strictly harder than learning with full feedback. Exploration requires extensive switching. Jian Ding (University of Chicago) Bandits with switching costs June, 2016 29 / 30

The End hard easy Jian Ding (University of Chicago) Bandits with switching costs r lea un ble na June, 2016 30 / 30