Multi-armed bandit problems

Similar documents
Multi-armed bandits in dynamic pricing

Dynamic Pricing with Varying Cost

Tuning bandit algorithms in stochastic environments

Bernoulli Bandits An Empirical Comparison

Lecture 17: More on Markov Decision Processes. Reinforcement learning

D I S C O N T I N U O U S DEMAND FUNCTIONS: ESTIMATION AND PRICING. Rotterdam May 24, 2018

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Treatment Allocations Based on Multi-Armed Bandit Strategies

Lecture 7: Bayesian approach to MAB - Gittins index

Chapter 7: Estimation Sections

Chapter 8: Sampling distributions of estimators Sections

Adaptive Experiments for Policy Choice. March 8, 2019

Regret Minimization against Strategic Buyers

Confidence Intervals Introduction

Chapter 7: Estimation Sections

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

Chapter 7: Estimation Sections

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme

Learning From Data: MLE. Maximum Likelihood Estimators

Online Network Revenue Management using Thompson Sampling

The Irrevocable Multi-Armed Bandit Problem

Lecture 11: Bandits with Knapsacks

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

EE266 Homework 5 Solutions

17 MAKING COMPLEX DECISIONS

Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

6. Martingales. = Zn. Think of Z n+1 as being a gambler s earnings after n+1 games. If the game if fair, then E [ Z n+1 Z n

Point Estimation. Principle of Unbiased Estimation. When choosing among several different estimators of θ, select one that is unbiased.

Lecture 5. 1 Online Learning. 1.1 Learning Setup (Perspective of Universe) CSCI699: Topics in Learning & Game Theory

Zooming Algorithm for Lipschitz Bandits

Dynamic Pricing under Finite Space Demand Uncertainty: A Multi-Armed Bandit with Dependent Arms

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE

Module 10:Application of stochastic processes in areas like finance Lecture 36:Black-Scholes Model. Stochastic Differential Equation.

Applied Statistics I

INVERSE REWARD DESIGN

Bandit algorithms for tree search Applications to games, optimization, and planning

16 MAKING SIMPLE DECISIONS

Dynamic Pricing for Competing Sellers

Back to estimators...

MVE051/MSG Lecture 7

Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints

Revenue optimization in AdExchange against strategic advertisers

The Value of Stochastic Modeling in Two-Stage Stochastic Programs

Budget Management In GSP (2018)

Forecast Horizons for Production Planning with Stochastic Demand

16 MAKING SIMPLE DECISIONS

Characterization of the Optimum

Chapter 8. Introduction to Statistical Inference

A selection of MAS learning techniques based on RL

Chapter 8: Sampling distributions of estimators Sections

Lecture outline W.B.Powell 1

Actuarial Mathematics and Statistics Statistics 5 Part 2: Statistical Inference Tutorial Problems

Lecture Notes 1

Monte-Carlo Planning: Basic Principles and Recent Progress

CS340 Machine learning Bayesian model selection

Dealing with forecast uncertainty in inventory models

Sequential Decision Making

Chapter 5. Sampling Distributions

Rollout Allocation Strategies for Classification-based Policy Iteration

Chapter 5. Statistical inference for Parametric Models

Dynamic Portfolio Choice II

Problem 1: Random variables, common distributions and the monopoly price

Revenue Management with Incomplete Demand Information

AM 121: Intro to Optimization Models and Methods

CMSC 858F: Algorithmic Game Theory Fall 2010 Introduction to Algorithmic Game Theory

EE641 Digital Image Processing II: Purdue University VISE - October 29,

1 The EOQ and Extensions

EE365: Risk Averse Control

Final exam solutions

Lecture 22. Survey Sampling: an Overview

Unobserved Heterogeneity Revisited

ROBUST OPTIMIZATION OF MULTI-PERIOD PRODUCTION PLANNING UNDER DEMAND UNCERTAINTY. A. Ben-Tal, B. Golany and M. Rozenblit

Asymptotic results discrete time martingales and stochastic algorithms

Universal Portfolios

Chapter 7 - Lecture 1 General concepts and criteria

Behavioral Competitive Equilibrium and Extreme Prices. Faruk Gul Wolfgang Pesendorfer Tomasz Strzalecki

SOLVING ROBUST SUPPLY CHAIN PROBLEMS

Prediction Market, Mechanism Design, and Cooperative Game Theory

Chapter 4: Asymptotic Properties of MLE (Part 3)

Statistical analysis and bootstrapping

Decision Theory: Value Iteration

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Contents. 1 Introduction. Math 321 Chapter 5 Confidence Intervals. 1 Introduction 1

INTERTEMPORAL ASSET ALLOCATION: THEORY

Modelling, Estimation and Hedging of Longevity Risk

4 Reinforcement Learning Basic Algorithms

Recharging Bandits. Joint work with Nicole Immorlica.

CS 361: Probability & Statistics

Lecture 12: Introduction to reasoning under uncertainty. Actions and Consequences

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

CSCI 1951-G Optimization Methods in Finance Part 07: Portfolio Optimization

EVA Tutorial #1 BLOCK MAXIMA APPROACH IN HYDROLOGIC/CLIMATE APPLICATIONS. Rick Katz

All Investors are Risk-averse Expected Utility Maximizers. Carole Bernard (UW), Jit Seng Chen (GGY) and Steven Vanduffel (Vrije Universiteit Brussel)

Teaching Bandits How to Behave

Optimally Thresholded Realized Power Variations for Lévy Jump Diffusion Models

Robust Longevity Risk Management

Transcription:

Multi-armed bandit problems Stochastic Decision Theory (2WB12) Arnoud den Boer 13 March 2013

Set-up 13 and 14 March: Lectures. 20 and 21 March: Paper presentations (Four groups, 45 min per group). Before 31 March: Hand in exercises. Papers and exercises can be found at http://www.win.tue.nl/~aboer/sdt/sdt.html Please make four groups and email me (a.v.d.boer@tue.nl) before the end of this week.

Outline for today Optimization under uncertainty Multi-armed bandit problems Upper bounds on the performance of policies Lower bounds on the performance of policies

Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown.

Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown. Robust optimization approach: max min x X θ Θ f (x; θ).

Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown. Robust optimization approach: max min x X θ Θ f (x; θ). Note the difference with min θ Θ max x X f (x; θ) Also note that X is known and deterministic (else Stochastic Programming, chance constraints).

Decision making under uncertainty In stochastic decision problems, f is random and one typically maximizes the expectation of f, max E[f (x; θ)]. x X For example, θ parametrizes the distribution of a random variable Y θ (x) which depends on x, and we solve max E[f (x; Y θ(x))]. x X

Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred.

Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred. 1) Let ˆθ = ˆθ(D) be an estimate of θ (e.g. LS, MLE,..). 2) Solve max E[f (x; x X Yˆθ(x))].

Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred. 1) Let ˆθ = ˆθ(D) be an estimate of θ (e.g. LS, MLE,..). 2) Solve max E[f (x; x X Yˆθ(x))]. Robust alternatives are possible, e.g. max min E[f (x; Y θ(x))], x X θ CI where CI is a 95% confidence interval: P (θ CI) 0.95.

Decision making under uncertainty Consider a discrete-time sequential stochastic decision problem under uncertainty: x t = arg max E[f (x; Y θ (x))], x X (t N, θ unknown), where previous decisions x 1,..., x t 1 and observed realizations of Y θ (x 1 ),..., Y θ (x t 1 ) can be used to estimate θ.

Decision making under uncertainty Consider a discrete-time sequential stochastic decision problem under uncertainty: x t = arg max E[f (x; Y θ (x))], x X (t N, θ unknown), where previous decisions x 1,..., x t 1 and observed realizations of Y θ (x 1 ),..., Y θ (x t 1 ) can be used to estimate θ. Then periodically updating ˆθ may be beneficial.

Decision making under uncertainty DATA

Decision making under uncertainty Estimate unknown parameters STATISTICS DATA

Decision making under uncertainty OPTIMIZATION Determine optimal decision Estimate unknown parameters STATISTICS Collect new data DATA

Decision making under uncertainty OPTIMIZATION Determine optimal decision Estimate unknown parameters STATISTICS Collect new data DATA

Decision making under uncertainty OPTIMIZATION Determine optimal decision Estimate unknown parameters STATISTICS Collect new data DATA

Decision making under uncertainty Examples of sequential stochastic decision problems under uncertainty: Clinical trials Optimal placement of online advertisements Recommendation systems Optimal routing Dynamic pricing Inventory control...

Decision making under uncertainty Myopic policy: x t arg max E[f (x; Yˆθ t (x))] for all suff. large t, x X where ˆθ t is an estimate of θ, based on x 1,..., x t 1 and realizations of Y θ (x 1 ),..., Y θ (x t ).

Decision making under uncertainty Myopic policy: x t arg max E[f (x; Yˆθ t (x))] for all suff. large t, x X where ˆθ t is an estimate of θ, based on x 1,..., x t 1 and realizations of Y θ (x 1 ),..., Y θ (x t ). Typical questions: How well does a myopic policy perform? Is experimentation beneficial? Given a policy, what are the costs-for-learning? What are the lowest costs-for-learning achievable by any policy?

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ).

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled.

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i.

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i. Let I t denote the arm pulled at time t. Each I t may depend on previously chosen arms and observed rewards, but not on the future.

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i. Let I t denote the arm pulled at time t. Each I t may depend on previously chosen arms and observed rewards, but not on the future. Goal: maximize the expected reward n t=1 E[µ I t ].

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i. Let I t denote the arm pulled at time t. Each I t may depend on previously chosen arms and observed rewards, but not on the future. Goal: maximize the expected reward n t=1 E[µ I t ]. Alternatively, minimize the regret R n : R n = nµ i n t=1 E[µ I t ], where i arg max µ i. i

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms. Stationary reward distribution.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms. Stationary reward distribution. Finite time horizon.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms. Stationary reward distribution. Finite time horizon. Non-Bayesian.

Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K.

Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K. Estimate N ˆµ i = N 1 X i,t, t=1 where X i,1,..., X i,n are the N rewards observed from pulling arm i. Use an arm j s.t. ˆµ j = max i ˆµ i during time periods KN + 1,..., n.

Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K. Estimate N ˆµ i = N 1 X i,t, t=1 where X i,1,..., X i,n are the N rewards observed from pulling arm i. Use an arm j s.t. ˆµ j = max i ˆµ i during time periods KN + 1,..., n. Observe: both exploration and exploitation.

Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K. Estimate N ˆµ i = N 1 X i,t, t=1 where X i,1,..., X i,n are the N rewards observed from pulling arm i. Use an arm j s.t. ˆµ j = max i ˆµ i during time periods KN + 1,..., n. Observe: both exploration and exploitation. One can show R n = O(log n), by choosing N appropriately.

Multi-armed bandit problems Some disadvantages of the simple policy: Does not use all data to estimate µ i Needs to know n in advance With positive probability, the fraction that the optimal arm is chosen is o(n). Alternative policy?

Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval.

Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval. Choose each arm once. For all t = K + 1,..., n, play machine j that maximizes ˆµ j + 2 log t T j (t), where ˆµ j is the average reward obtained from arm j, and T j (t) is the number of times arm j has been played up to time t.

Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval. Choose each arm once. For all t = K + 1,..., n, play machine j that maximizes ˆµ j + 2 log t T j (t), where ˆµ j is the average reward obtained from arm j, and T j (t) is the number of times arm j has been played up to time t. Again one can show R n = O(log n)

Multi-armed bandit problems We have seen two policies with R n = O(log n). Can any policy do better?

Multi-armed bandit problems We have seen two policies with R n = O(log n). Can any policy do better? No. All policies have R n = Ω(log n). Lai and Robbins (1985): Any uniformly good policy and any suboptimal arm j lim inf t E[T j (t)] log t 1 D KL (X j X i ) a.s., where D KL (P Q) is the Kullback-Leibler divergence between r.v. P and Q, and uniformly good means E[T j (n)] = o(n a ) for all a > 0 and all suboptimal arms j.

Multi-armed bandit problems How to choose between different policies, each with logarithmic regret? Constant before the log-term Finite-time behavior Variance of regret Some numerical studies: Kuleshov and Precup (2000), Vermorel and Mohri (2005) (on website).

Reminder: presentations next week Topics: See 1 Incomplete learning (20 March) 2 Adversarial bandits (20 March) 3 Non-stationarity (21 March) 4 Continuum-armed bandits (21 March) http://www.win.tue.nl/~aboer/sdt/sdt.html for papers and more information. Please make four groups and email me (a.v.d.boer@tue.nl) before the end of this week. First-come-first-served.