Tuning bandit algorithms in stochastic environments

Size: px

Start display at page:

Download "Tuning bandit algorithms in stochastic environments"

Loraine Mason
5 years ago
Views:

Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, CERTIS - Ecole des Ponts Remi Munos, INRIA Futurs Lille Csaba Szepesvári,

1 Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, CERTIS - Ecole des Ponts Remi Munos, INRIA Futurs Lille Csaba Szepesvári, University of Alberta The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International Center, Sendai, Japan

2 Contents Bandit problems UCB and Motivation Tuning UCB by using variance estimates Concentration of the regret Finite horizon finite regret (PAC-UCB) Conclusions

3 Exploration vs. Exploitation Two treatments Unknown success probabilities Goal: find the best treatment while losing the smallest number of patients Explore or exploit?

4 Playing Bandits Payoff is 0 or 1 Arm 1: X0 11, X1 12, X0 13, X0 14, X 15, X 16, X 17, Arm 2: X1 21, X1 22, X0 23, X1 24, X1 25, X1 26, X 27,

5 Exploration vs. Exploitation: Some Applications Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker

6 Bandit Problems Optimism in the Face of Uncertainty Introduced by Lai and Robbins (1985) (?) i.i.d. payoffs X 11,X 12,,X 1t, X 21,X 22,,X 2t, Principle: Inflated value of an option = maximum expected reward that looks quite possible given the observations so far Select the option with best inflated value

7 Some definitions Payoff is 0 or 1 Now: t=11 T 1 (t-1) = 4 T 2 (t-1) = 6 I 1 = 1, I 2 = 2, Arm 1: X0 11, X1 12, X0 13, X0 14, X 15, X 16, X 17, Arm 2: X1 21, X1 22, X0 23, X1 24, X1 25, X1 26, X 27, ˆR n def = n t=1 X k,t n t=1 X I t,t It (t)

8 Parametric Bandits [Lai&Robbins] X it p i,θi ( ), θ i unknown, t=1,2, Uncertainty set: Reasonable values of θ given the experience so far U i,t ={θ p i,θ (X i,1:ti (t)) is large mod (t,t i (t)) } Inflated values: Z i,t =max{ E θ θ U i,t } Rule: I t = arg max i Z i,t

9 Bounds Upper bound: Lower bound: If an algorithm is uniformly good then..

10 UCB1 Algorithm (Auer et al., 2002) Algorithm: UCB1(b) 1. Try all options once 2. Use option k with the highest index: Regret bound: R n : Expected loss due to not selecting the best option at time step n. Then:

11 Problem #1 When b 2 σ 2, regret should scale with σ 2 and not b 2!

12 UCB1-NORMAL Algorithm: UCB1-NORMAL 1. Try all options once 2. Use option k with the highest index: ˆµ kt + Regret bound: 16ˆσ 2 kt log(t) T k (t 1)

13 Problem #1 The regret of UCB1(b) scales with O(b 2 ) The regret of UCB1-NORMAL scales with O(σ 2 ) but UCB1-NORMAL assumes normally distributed payoffs UCB-Tuned(b): ˆµ kt + min ( b 2 4, σ2 kt Good experimental results No theoretical guarantees ) log(t) T k (t 1)

14 UCB-V Algorithm: UCB-V(b) 1. Try all options once 2. Use option k with the highest index: ˆµ kt + Regret bound: 2.4 σ 2 kt log(t) T k (t 1) + 3blog(t) T k (t 1)

15 Proof The missing bound (hunch.net): ˆµ t µ σ t log(3δ 1 ) t + 3blog(3δ 1 ) t Bounding the sampling times of suboptimal arms (new bound)

16 Can we decrease exploration? Algorithm: UCB-V(b,ζ,c) 1. Try all options once 2. Use option k with the highest index: ˆµ kt + Theorem: 2ζ σ 2 kt log(t) T k (t 1) +c3blog(t) T k (t 1) When ζ<1, the regret will be polynomial for some bandit problems When cζ<1/6, the regret will be polynomial for some bandit problems

17 Concentration bounds Averages concentrate: S n n µ ( O log(δ 1 ) n Does the regret of UCB* concentrate? R n n µ?? R n E[R n ] 1?? ) RISK??

18 Logarithmic regret implies high risk Theorem: Consider the pseudo-regret R n = k=1k T k (n) k. Then for any ζ>1 and z>γ log(n), P(R n >z) C z -ζ (Gaussian tail:p(r n >z) C exp(-z 2 )) Illustration: Two arms; 2 = µ 2 -µ 1 >0. Modes of law of R n at O(log(n)), O( 2 n)! Only happens when the support of the second best arm s distribution overlaps with that of the optimal arm

19 Finite horizon: PAC-UCB Algorithm: PAC-UCB(N) 1. Try all options ones 2. Use option k with the highest index: ˆµ kt + 2 σ 2 kt L t T k (t 1) + 3bL t T k (t 1), L t =log(nk(t k (t 1)+1)) Theorem: At time N with probability 1-1/N, suboptimal plays are bounded by O(log(K N)). Good when N is known beforehand

20 Conclusions Taking into account the variance lessens dependence on the a priori bound b Low expected regret => high risk PAC-UCB: Finite regret, known horizon, exponential concentration of the regret Optimal balance? Other algorithms? Greater generality: look up the paper!

21 Thank you! Questions?

22 References Optimism in the face of uncertainty: Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4 22. UCB1 and more: Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3): Audibert, J., Munos, R., and Szepesvári, Cs. (2007). Tuning bandit algorithms in stochastic environments, ALT-2007.

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned