Multi-armed bandit problems - PDF Free Download

Multi-armed bandit problems Stochastic Decision Theory (2WB12) Arnoud den Boer 13 March 2013

Set-up 13 and 14 March: Lectures. 20 and 21 March: Paper presentations (Four groups, 45 min per group). Before 31 March: Hand in exercises. Papers and exercises can be found at http://www.win.tue.nl/~aboer/sdt/sdt.html Please make four groups and email me (a.v.d.boer@tue.nl) before the end of this week.

Outline for today Optimization under uncertainty Multi-armed bandit problems Upper bounds on the performance of policies Lower bounds on the performance of policies

Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown.

Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown. Robust optimization approach: max min x X θ Θ f (x; θ).

Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown. Robust optimization approach: max min x X θ Θ f (x; θ). Note the difference with min θ Θ max x X f (x; θ) Also note that X is known and deterministic (else Stochastic Programming, chance constraints).

Decision making under uncertainty In stochastic decision problems, f is random and one typically maximizes the expectation of f, max E[f (x; θ)]. x X For example, θ parametrizes the distribution of a random variable Y θ (x) which depends on x, and we solve max E[f (x; Y θ(x))]. x X

Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred.

Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred. 1) Let ˆθ = ˆθ(D) be an estimate of θ (e.g. LS, MLE,..). 2) Solve max E[f (x; x X Yˆθ(x))]. Robust alternatives are possible, e.g. max min E[f (x; Y θ(x))], x X θ CI where CI is a 95% confidence interval: P (θ CI) 0.95.

Decision making under uncertainty Consider a discrete-time sequential stochastic decision problem under uncertainty: x t = arg max E[f (x; Y θ (x))], x X (t N, θ unknown), where previous decisions x 1,..., x t 1 and observed realizations of Y θ (x 1 ),..., Y θ (x t 1 ) can be used to estimate θ.

Decision making under uncertainty DATA

Decision making under uncertainty Estimate unknown parameters STATISTICS DATA

Decision making under uncertainty OPTIMIZATION Determine optimal decision Estimate unknown parameters STATISTICS Collect new data DATA

Decision making under uncertainty Examples of sequential stochastic decision problems under uncertainty: Clinical trials Optimal placement of online advertisements Recommendation systems Optimal routing Dynamic pricing Inventory control...

Decision making under uncertainty Myopic policy: x t arg max E[f (x; Yˆθ t (x))] for all suff. large t, x X where ˆθ t is an estimate of θ, based on x 1,..., x t 1 and realizations of Y θ (x 1 ),..., Y θ (x t ). Typical questions: How well does a myopic policy perform? Is experimentation beneficial? Given a policy, what are the costs-for-learning? What are the lowest costs-for-learning achievable by any policy?

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ).

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled.

Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i. Let I t denote the arm pulled at time t. Each I t may depend on previously chosen arms and observed rewards, but not on the future.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms.

Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms. Stationary reward distribution.

Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K.

Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K. Estimate N ˆµ i = N 1 X i,t, t=1 where X i,1,..., X i,n are the N rewards observed from pulling arm i. Use an arm j s.t. ˆµ j = max i ˆµ i during time periods KN + 1,..., n.

Multi-armed bandit problems Some disadvantages of the simple policy: Does not use all data to estimate µ i Needs to know n in advance With positive probability, the fraction that the optimal arm is chosen is o(n). Alternative policy?

Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval.

Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval. Choose each arm once. For all t = K + 1,..., n, play machine j that maximizes ˆµ j + 2 log t T j (t), where ˆµ j is the average reward obtained from arm j, and T j (t) is the number of times arm j has been played up to time t.

Multi-armed bandit problems We have seen two policies with R n = O(log n). Can any policy do better?

Multi-armed bandit problems We have seen two policies with R n = O(log n). Can any policy do better? No. All policies have R n = Ω(log n). Lai and Robbins (1985): Any uniformly good policy and any suboptimal arm j lim inf t E[T j (t)] log t 1 D KL (X j X i ) a.s., where D KL (P Q) is the Kullback-Leibler divergence between r.v. P and Q, and uniformly good means E[T j (n)] = o(n a ) for all a > 0 and all suboptimal arms j.

Multi-armed bandit problems How to choose between different policies, each with logarithmic regret? Constant before the log-term Finite-time behavior Variance of regret Some numerical studies: Kuleshov and Precup (2000), Vermorel and Mohri (2005) (on website).

Reminder: presentations next week Topics: See 1 Incomplete learning (20 March) 2 Adversarial bandits (20 March) 3 Non-stationarity (21 March) 4 Continuum-armed bandits (21 March) http://www.win.tue.nl/~aboer/sdt/sdt.html for papers and more information. Please make four groups and email me (a.v.d.boer@tue.nl) before the end of this week. First-come-first-served.