Zooming Algorithm for Lipschitz Bandits

Similar documents
Dynamic Pricing with Varying Cost

Lecture 11: Bandits with Knapsacks

Bandit algorithms for tree search Applications to games, optimization, and planning

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Multi-armed bandit problems

Treatment Allocations Based on Multi-Armed Bandit Strategies

TTIC An Introduction to the Theory of Machine Learning. The Adversarial Multi-armed Bandit Problem Avrim Blum.

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme

Adaptive Experiments for Policy Choice. March 8, 2019

Posted-Price Mechanisms and Prophet Inequalities

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

Dynamic Pricing with Limited Supply (extended abstract)

Tuning bandit algorithms in stochastic environments

PLAYING GAMES WITHOUT OBSERVING PAYOFFS

CMSC 858F: Algorithmic Game Theory Fall 2010 Introduction to Algorithmic Game Theory

Bandit Learning with switching costs

Lecture 7: Bayesian approach to MAB - Gittins index

Online Network Revenue Management using Thompson Sampling

Monte-Carlo Planning: Basic Principles and Recent Progress

CS 188: Artificial Intelligence

Black-Scholes and Game Theory. Tushar Vaidya ESD

The Irrevocable Multi-Armed Bandit Problem

Equity correlations implied by index options: estimation and model uncertainty analysis

A Robust Option Pricing Problem

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Bandit Problems with Lévy Payoff Processes

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

The Complexity of Simple and Optimal Deterministic Mechanisms for an Additive Buyer. Xi Chen, George Matikas, Dimitris Paparas, Mihalis Yannakakis

CSE202: Algorithm Design and Analysis. Ragesh Jaiswal, CSE, UCSD

Approximate Revenue Maximization with Multiple Items

The Menu-Size Complexity of Precise and Approximate Revenue-Maximizing Auctions

From Bayesian Auctions to Approximation Guarantees

Dynamic Pricing under Finite Space Demand Uncertainty: A Multi-Armed Bandit with Dependent Arms

Regret Minimization and Correlated Equilibria

Regret Minimization and Security Strategies

CS 343: Artificial Intelligence

Matching Markets and Google s Sponsored Search

Teaching Bandits How to Behave

Regret Minimization against Strategic Buyers

CS364A: Algorithmic Game Theory Lecture #14: Robust Price-of-Anarchy Bounds in Smooth Games

Sublinear Time Algorithms Oct 19, Lecture 1

D I S C O N T I N U O U S DEMAND FUNCTIONS: ESTIMATION AND PRICING. Rotterdam May 24, 2018

Model-independent bounds for Asian options

Rollout Allocation Strategies for Classification-based Policy Iteration

Mechanism Design and Auctions

Learning the Demand Curve in Posted-Price Digital Goods Auctions

Notes on Intertemporal Optimization

Recharging Bandits. Joint work with Nicole Immorlica.

Multi-armed bandits in dynamic pricing

A lower bound on seller revenue in single buyer monopoly auctions

So we turn now to many-to-one matching with money, which is generally seen as a model of firms hiring workers

Incentivizing and Coordinating Exploration Part II: Bayesian Models with Transfers

Lecture 12: Introduction to reasoning under uncertainty. Actions and Consequences

Dynamic Marginal Contribution Mechanism

MATH3075/3975 FINANCIAL MATHEMATICS TUTORIAL PROBLEMS

Bargaining and Competition Revisited Takashi Kunimoto and Roberto Serrano

Lecture 5. 1 Online Learning. 1.1 Learning Setup (Perspective of Universe) CSCI699: Topics in Learning & Game Theory

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2017

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

Subject : Computer Science. Paper: Machine Learning. Module: Decision Theory and Bayesian Decision Theory. Module No: CS/ML/10.

Optimal Investment for Worst-Case Crash Scenarios

High Dimensional Bayesian Optimisation and Bandits via Additive Models

Exploration for sequential decision making Application to games, tree search, optimization, and planning

Universal Portfolios

arxiv: v3 [cs.gt] 26 Nov 2013

Model-independent bounds for Asian options

Week 8: Basic concepts in game theory

TDT4171 Artificial Intelligence Methods

On Existence of Equilibria. Bayesian Allocation-Mechanisms

AN ONLINE LEARNING APPROACH TO ALGORITHMIC BIDDING FOR VIRTUAL TRADING

Bernoulli Bandits An Empirical Comparison

Lecture 5 Leadership and Reputation

The Accrual Anomaly in the Game-Theoretic Setting

Intro to Decision Theory

Detail-free, Posted-Price Mechanisms for Limited Supply Online Auctions

Rational Behaviour and Strategy Construction in Infinite Multiplayer Games

CS364A: Algorithmic Game Theory Lecture #3: Myerson s Lemma

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

Basic Arbitrage Theory KTH Tomas Björk

Strategies and Nash Equilibrium. A Whirlwind Tour of Game Theory

Chapter 2. An Introduction to Forwards and Options. Question 2.1

Game Theory I. Author: Neil Bendle Marketing Metrics Reference: Chapter Neil Bendle and Management by the Numbers, Inc.

1 The Solow Growth Model

B35150 Winter 2014 Quiz Solutions

arxiv: v2 [cs.gt] 11 Mar 2018 Abstract

CS 188: Artificial Intelligence

Math 167: Mathematical Game Theory Instructor: Alpár R. Mészáros

Repeated Games. Econ 400. University of Notre Dame. Econ 400 (ND) Repeated Games 1 / 48

CSV 886 Social Economic and Information Networks. Lecture 5: Matching Markets, Sponsored Search. R Ravi

G5212: Game Theory. Mark Dean. Spring 2017

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

UNIVERSITY OF VIENNA

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Optimal Online Two-way Trading with Bounded Number of Transactions

CSE 417 Dynamic Programming (pt 2) Look at the Last Element

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

Max Registers, Counters and Monotone Circuits

Infinitely Repeated Games

Transcription:

Zooming Algorithm for Lipschitz Bandits Alex Slivkins Microsoft Research New York City Based on joint work with Robert Kleinberg and Eli Upfal (STOC'08)

Running examples Dynamic pricing. You release a song which customers can download for a price. What price will maximize profit? Customers arrive one by one, you can update the price Web advertisement. Every time someone visits your site, you display an ad. There are many ads to choose from. Which one will maximize #clicks? you can update your selection based on the clicks received 2

Multi-Armed Bandits In a (basic) MAB problem one has: set of strategies (a.k.a. arms) arms payoffs pricing prices payments web ads ads clicks (x) [0,1] expected payoff for each x (fixed but unknown) In each round an algorithm picks arm x based on past history receives payoffs (money): an independent sample in [0,1] from distribution D(x) with expectation (x) 3 =.6 =.2 =.4

Exploration vs Exploitation Explore: try out new arms to get more info... perhaps playing low-paying arms Exploit: play arms that seem best based on current info... but maybe there is a better arm that we don't know about Classical setting since 1952 OR, Econ, CS: various versions and extensions 4 =.6 =.2 =.4

Background Early work: maximize expected time-discounted payoffs w.r.t. independent bayesian priors over arms. Solved by the "Gittins index policy" ( Gittins and Jones (1972) ) We focus on the prior-free version arm x i.i.d. sample with expectation (x) benchmark: * = max x (x) Regret in T rounds: R(T) = T * [expected total payoffs] 5 =.6 =.2 =.4

Background For small #arms (K), the problem is well-understood ( Lai & Robbins (1985), Auer at al. (2002) ) Benchmark: * = max x (x) Regret: R(T) = T * [expected total payoffs] R(T) O (K log T) for fixed R(T) O(K T log K) 1/2 in the worst case both optimal via relative entropy arguments 6 =.6 =.2 =.4

Bandits with side information What if the strategy set is very large? infinite? needle in a haystack hopeless unless we have side info Dynamic pricing unlimited supply of identical digital goods, seller can update the price; arms are prices numerical similarity between arms known shape of payoff function, e.g. smoothness Web advertisement new user arrives, display one of the k ads, maximize #clicks; arms are ads similarity between arms: topical taxonomy, feature vectors, etc context: user profile, page features Present scope: similarity between arms 7

Lipschitz MAB problem Algorithm is given similarity metric L on arms such that (x) (y) L(x, y) x,y (Lipschitz condition) In other words, considering payoff function : (x): is Lipschitz-continuous w.r.t. (,L) Problem instance: (known) metric space (,L) and (unknown) How to utilize this side information? What performance guarantees (regret) can be achieved? 8

A (very) naive algorithm in each phase, choose K equally spaced arms ( -net), use an off-the-shelf K-armed bandit algorithm one of the chosen arms is close to the opt! phase i lasts for 2 i rounds; K = 2 i d/(d+2), d = CoveringDim 9

A (very) naive algorithm in each phase, choose K equally spaced arms ( -net), use an off-the-shelf K-armed bandit algorithm one of the chosen arms is close to the opt! phase i lasts for 2 i rounds; K = 2 i d/(d+2), d = CoveringDim Definition Covering Dimension of a metric space r>0 the metric can be covered with c r d sets of diameter r c-covdim = smallest such d Fact: CovDim DoublingDim EuclideanDim S 10

A (very) naive algorithm in each phase, choose K equally spaced arms ( -net), use an off-the-shelf K-armed bandit algorithm one of the chosen arms is close to the opt! phase i lasts for 2 i rounds; K = 2 i d/(d+2), d = CoveringDim Theorem: using off-the-shelf guarantees R(T) O(T 1 1/(d+2) log T) 11

Is this the right algorithm?? The naive algorithm seems wasteful: places equally spaced probes S (what if some regions yield better payoffs than others?) after the probes are placed, all similarity information is discarded For a given metric space, can we do better?... in the worst case?... for a nice problem instance (payoff function)? YES YES This talk high low (x) 12 1 x

Better algorithm for nice instances Goal: do as well as the naive algorithm in general, but perform "better" on "nice" problem instances?????? 13

Our results: zooming algorithm TheoremThe zooming algorithm achieves regret R(T) O(c T 1 1/(d+2) log T) where d = c-covdim of similarity metric L c-zooming Dimension of problem instance (,L) Definition Covering Dimension of a metric space r>0 the metric can be covered with c r d sets of diameter r c-covdim = smallest such d c-zoomingdim 14

Our results: zooming algorithm TheoremThe zooming algorithm achieves regret R(T) O(c T 1 1/(d+2) log T) where d = c-covdim of similarity metric L c-zooming Dimension of problem instance (,L) Definition Covering Dimension of a metric space {x: r/2 * (x) r } r>0 the metric can be covered with c r d sets of diameter r c-covdim = smallest such d c-zoomingdim high low 15

Zooming algorithm maintain a finite set of active arms start with no active arms, activate one by one. in each round, play one of the active arms. ACTIVATION RULE: add a new active arm? which one? SELECTION RULE: choose which active arm to play next 16

Activation rule r t (x) = confidence radius of arm x at time t SAMPLEAVERAGE t (x) (x) r t (x) w.h.p. by Chernoff Bounds r t x 8 log t # samples from x 17

Activation rule r t (x) = confidence radius of arm x at time t SAMPLEAVERAGE t (x) (x) r t (x) w.h.p. confidence ball B t (x) = B(x, r t (x)) intuition: should we activate y? x y 18

Activation rule r t (x) = confidence radius of arm x at time t SAMPLEAVERAGE t (x) (x) r t (x) w.h.p. confidence ball B t (x) = B(x, r t (x)) intuition: no point to activate arm which is covered maintain invariant: all arms are covered x 19

Activation rule r t (x) = confidence radius of arm x at time t SAMPLEAVERAGE t (x) (x) r t (x) w.h.p. confidence ball B t (x) = B(x, r t (x)) intuition: no point to activate arm which is covered maintain invariant: all arms are covered x 20

Activation rule r t (x) = confidence radius of arm x at time t SAMPLEAVERAGE t (x) (x) r t (x) w.h.p. confidence ball B t (x) = B(x, r t (x)) intuition: no point to activate arm which is covered maintain invariant: all arms are covered y x 21

Activation rule maintain invariant: all arms are covered what if some arm becomes uncovered? y x 22

Activation rule maintain invariant: all arms are covered ACTIVATION RULE: if arm y becomes uncovered, activate it initially confidence radius r t (y) is very large, so confidence ball B(y, r t (y)) covers the entire metric self-adjusting: "zoom in on region R" activate many arms in R arms in R are played often arms in R are good y x 23

Selection rule Define INDE t (x) = SAMPLEAVERAGE t (x) + 2 r t (x) Recall: SAMPLEAVERAGE t (x) (x) r t (x) w.h.p. SELECTION RULE: play active arm with max index why does it make sense? If index is large then: either sample average is large ( good arm), or confidence radius is large ( need to explore it more) 24

Sketch of analysis Key fact: if x is played at time t then INDE t (x) * "badness" (x) * (x) Consider active arms x such that r/2 (x) r To bound regret, we show that: we don't activate too many "bad" arms: sparsity: L(x, y) (r) each "bad" arm is not played too often : #samples(x) O(1/r 2 ) 25

Extensions Relaxed assumptions no need for triangle inequality "weak Lipschitz condition": (x * ) (y) L(x *, y) Special cases (much) more efficient sampling if max x (x) = 1 if (x) f(l(x, S)) distance to target set S then ZoomingDim = CovDim(S) 26

contexts Extension: contextual bandits Contextual bandits: in each round, an adversary chooses context x, an algorithm chooses arm y, and the expected payoff is (x,y). if arms are ads, contexts are page/user profiles Similarity info: given a metric space on (x,y) pairs s.t. (x,y) (x',y') L( (x,y), (x',y') ) Contextual zooming algorithm ( Slivkins (2009) ) x active points confidence balls: radius reflects uncertainty look at relevant active points pick one with largest index 27 arms