Multi-armed bandit problems

Size: px

Start display at page:

Download "Multi-armed bandit problems"

Madison Haynes
5 years ago
Views:

1 Multi-armed bandit problems Stochastic Decision Theory (2WB12) Arnoud den Boer 13 March 2013

2 Set-up 13 and 14 March: Lectures. 20 and 21 March: Paper presentations (Four groups, 45 min per group). Before 31 March: Hand in exercises. Papers and exercises can be found at Please make four groups and me before the end of this week.

3 Outline for today Optimization under uncertainty Multi-armed bandit problems Upper bounds on the performance of policies Lower bounds on the performance of policies

4 Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown.

5 Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown. Robust optimization approach: max min x X θ Θ f (x; θ).

6 Decision making under uncertainty Deterministic optimization problem: max f (x; θ), x X where θ Θ is unknown. Robust optimization approach: max min x X θ Θ f (x; θ). Note the difference with min θ Θ max x X f (x; θ) Also note that X is known and deterministic (else Stochastic Programming, chance constraints).

7 Decision making under uncertainty In stochastic decision problems, f is random and one typically maximizes the expectation of f, max E[f (x; θ)]. x X For example, θ parametrizes the distribution of a random variable Y θ (x) which depends on x, and we solve max E[f (x; Y θ(x))]. x X

8 Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred.

9 Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred. 1) Let ˆθ = ˆθ(D) be an estimate of θ (e.g. LS, MLE,..). 2) Solve max E[f (x; x X Yˆθ(x))].

10 Decision making under uncertainty If data D = (x i, y i (x i )) 1 i n is available, the value of θ may be inferred. 1) Let ˆθ = ˆθ(D) be an estimate of θ (e.g. LS, MLE,..). 2) Solve max E[f (x; x X Yˆθ(x))]. Robust alternatives are possible, e.g. max min E[f (x; Y θ(x))], x X θ CI where CI is a 95% confidence interval: P (θ CI) 0.95.

11 Decision making under uncertainty Consider a discrete-time sequential stochastic decision problem under uncertainty: x t = arg max E[f (x; Y θ (x))], x X (t N, θ unknown), where previous decisions x 1,..., x t 1 and observed realizations of Y θ (x 1 ),..., Y θ (x t 1 ) can be used to estimate θ.

12 Decision making under uncertainty Consider a discrete-time sequential stochastic decision problem under uncertainty: x t = arg max E[f (x; Y θ (x))], x X (t N, θ unknown), where previous decisions x 1,..., x t 1 and observed realizations of Y θ (x 1 ),..., Y θ (x t 1 ) can be used to estimate θ. Then periodically updating ˆθ may be beneficial.

13 Decision making under uncertainty DATA

14 Decision making under uncertainty Estimate unknown parameters STATISTICS DATA

15 Decision making under uncertainty OPTIMIZATION Determine optimal decision Estimate unknown parameters STATISTICS Collect new data DATA

16 Decision making under uncertainty OPTIMIZATION Determine optimal decision Estimate unknown parameters STATISTICS Collect new data DATA

17 Decision making under uncertainty OPTIMIZATION Determine optimal decision Estimate unknown parameters STATISTICS Collect new data DATA

18 Decision making under uncertainty Examples of sequential stochastic decision problems under uncertainty: Clinical trials Optimal placement of online advertisements Recommendation systems Optimal routing Dynamic pricing Inventory control...

19 Decision making under uncertainty Myopic policy: x t arg max E[f (x; Yˆθ t (x))] for all suff. large t, x X where ˆθ t is an estimate of θ, based on x 1,..., x t 1 and realizations of Y θ (x 1 ),..., Y θ (x t ).

20 Decision making under uncertainty Myopic policy: x t arg max E[f (x; Yˆθ t (x))] for all suff. large t, x X where ˆθ t is an estimate of θ, based on x 1,..., x t 1 and realizations of Y θ (x 1 ),..., Y θ (x t ). Typical questions: How well does a myopic policy perform? Is experimentation beneficial? Given a policy, what are the costs-for-learning? What are the lowest costs-for-learning achievable by any policy?

21 Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ).

22 Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled.

23 Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i.

24 Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i. Let I t denote the arm pulled at time t. Each I t may depend on previously chosen arms and observed rewards, but not on the future.

25 Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i. Let I t denote the arm pulled at time t. Each I t may depend on previously chosen arms and observed rewards, but not on the future. Goal: maximize the expected reward n t=1 E[µ I t ].

26 Multi-armed bandit problems (MAB) Given K 2 independent slot machines ( bandits, arms ). At each time point t = 1,..., n N, exactly one arm has to be pulled. The reward of pulling arm i is random, with unknown finite mean µ i. Let I t denote the arm pulled at time t. Each I t may depend on previously chosen arms and observed rewards, but not on the future. Goal: maximize the expected reward n t=1 E[µ I t ]. Alternatively, minimize the regret R n : R n = nµ i n t=1 E[µ I t ], where i arg max µ i. i

27 Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms.

28 Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms.

29 Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms.

30 Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms. Stationary reward distribution.

31 Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms. Stationary reward distribution. Finite time horizon.

32 Multi-armed bandit problems Note: Rewards of arm i are i.i.d., and independent of rewards from other arms. Finite number of arms. No structure or ordering assumed among arms. Stationary reward distribution. Finite time horizon. Non-Bayesian.

33 Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K.

34 Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K. Estimate N ˆµ i = N 1 X i,t, t=1 where X i,1,..., X i,n are the N rewards observed from pulling arm i. Use an arm j s.t. ˆµ j = max i ˆµ i during time periods KN + 1,..., n.

35 Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K. Estimate N ˆµ i = N 1 X i,t, t=1 where X i,1,..., X i,n are the N rewards observed from pulling arm i. Use an arm j s.t. ˆµ j = max i ˆµ i during time periods KN + 1,..., n. Observe: both exploration and exploitation.

36 Multi-armed bandit problems A simple policy: Use arm i during time periods (i 1)N + 1,..., in, for i = 1,..., K. Estimate N ˆµ i = N 1 X i,t, t=1 where X i,1,..., X i,n are the N rewards observed from pulling arm i. Use an arm j s.t. ˆµ j = max i ˆµ i during time periods KN + 1,..., n. Observe: both exploration and exploitation. One can show R n = O(log n), by choosing N appropriately.

37 Multi-armed bandit problems Some disadvantages of the simple policy: Does not use all data to estimate µ i Needs to know n in advance With positive probability, the fraction that the optimal arm is chosen is o(n). Alternative policy?

38 Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval.

39 Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval. Choose each arm once. For all t = K + 1,..., n, play machine j that maximizes ˆµ j + 2 log t T j (t), where ˆµ j is the average reward obtained from arm j, and T j (t) is the number of times arm j has been played up to time t.

40 Multi-armed bandit problems UCB1: Idea: determine a confidence interval for ˆµ i. Use the arm with the highest confidence interval. Choose each arm once. For all t = K + 1,..., n, play machine j that maximizes ˆµ j + 2 log t T j (t), where ˆµ j is the average reward obtained from arm j, and T j (t) is the number of times arm j has been played up to time t. Again one can show R n = O(log n)

41 Multi-armed bandit problems We have seen two policies with R n = O(log n). Can any policy do better?

42 Multi-armed bandit problems We have seen two policies with R n = O(log n). Can any policy do better? No. All policies have R n = Ω(log n). Lai and Robbins (1985): Any uniformly good policy and any suboptimal arm j lim inf t E[T j (t)] log t 1 D KL (X j X i ) a.s., where D KL (P Q) is the Kullback-Leibler divergence between r.v. P and Q, and uniformly good means E[T j (n)] = o(n a ) for all a > 0 and all suboptimal arms j.

43 Multi-armed bandit problems How to choose between different policies, each with logarithmic regret? Constant before the log-term Finite-time behavior Variance of regret Some numerical studies: Kuleshov and Precup (2000), Vermorel and Mohri (2005) (on website).

44 Reminder: presentations next week Topics: See 1 Incomplete learning (20 March) 2 Adversarial bandits (20 March) 3 Non-stationarity (21 March) 4 Continuum-armed bandits (21 March) for papers and more information. Please make four groups and me (a.v.d.boer@tue.nl) before the end of this week. First-come-first-served.

Multi-armed bandits in dynamic pricing

Multi-armed bandits in dynamic pricing Arnoud den Boer University of Twente, Centrum Wiskunde & Informatica Amsterdam Lancaster, January 11, 2016 Dynamic pricing A firm sells a product, with abundant inventory,