Exploration for sequential decision making Application to games, tree search, optimization, and planning

Size: px

Start display at page:

Download "Exploration for sequential decision making Application to games, tree search, optimization, and planning"

Ursula Flowers
5 years ago
Views:

1 Exploration for sequential decision making Application to games, tree search, optimization, and planning Rémi Munos SequeL project: Sequential Learning INRIA Lille - Nord Europe Master MVA, Cours Apprentissage par renforcement, 2008

2 Exploration for sequential decision making Application to games, tree search, optimization, and planning Global view: Exploration in Reinforcement Learning The multi-armed bandit problem The optimistic strategy in uncertain environments The robust strategy in adversarial environments Adversarial bandits Application to games Hierarchical bandits Application to tree search Application to optimization Application to planning

3 Exploration in Reinforcement Learning Assumption about RL: no initial model of the world (dynamics + rewards) Example: a child learns to ride a bicycle How should we act? Exploitation: act optimally according to the current beliefs Exploration: learn more about the environment Tradeoff between exploration and exploitation.

4 General setting: Introduction to multi-armed bandits At each round, several options (actions) are available to choose from. A reward is provided according to the choice made. Our goal is to optimize the sum of rewards. Many potential applications: Clinical trials Advertising: what ad to put on a web-page? Labor markets: which job a worker should choose? Optimization of noisy function Numerical resource allocation

5 Example: a two-armed bandit Say, there are 2 arms: We have pulled the arms so far: Time Arm pulled Reward arm Reward arm Which arm should we pull next? What are the assumption about the rewards? What is really our goal?

6 The stochastic bandit problem Setting: Set of K arms, defined by random variables X k [0,1], whose law is unknown, At each time t, choose an arm k t and receive reward i.i.d. x t X kt. Goal: find an arm selection policy such as to maximize the expected sum of rewards. Definitions : Let µ k = E[X k ] be the expected value of arm k. Let µ = max k µ k the optimal value, and k an optimal arm.

7 Exploration-exploitation tradeoff Define the cumulative regret: R n def = n µ µ kt = t=1 K n k k, where k def = µ µ k, and n k the number of times arm k has been pulled up to time n. (regret results from not pulling an optimal arm from the beginning) Goal: Find an arm selection policy such as to minimize R n. k=1

8 Proposed solutions This is an old problem! [Robbins, 1952] (maybe surprisingly) not fully solved yet! Many proposed solutions. Examples: ǫ-greedy exploration: choose apparent best action with proba 1 ǫ, or random action with proba ǫ, Follow the perturbed leader: choose the best perturbed arm Bayesian exploration: assign prior to the arm distributions and based on the rewards, choose the arm with best posterior mean, or with highest probability of being the best Boltzmann exploration: choose arm k with proba exp( 1 T X k ) Optimistic exploration: choose an arm that has a possibility of being the best

9 The UCB algorithm Upper Confidence Bounds algorithm [Auer et al. 2002]: at each time n, select an arm with where B k,nk,n arg max B k,nk,n, k def = 1 n k n k s=1 x k,s } {{ } bx k,nk + 2log(n), n k }{{} c nk,n n k is the number of times arm k has been pulled up to time n x k,s is the s-th reward obtained when pulling arm k. Note that Sum of an exploitation term and an exploration term. c nk,n is a confidence interval term, so B k,nk,n is a UCB.

10 Idea: Intuition behind the UCB algorithm Select an arm that has a high probability of being the best, given what has been observed so far. Optimism under the face of uncertainty strategy Why? For any arm k, for any t, B k,t,n µ k with high probability. Why? From Chernoff-Hoeffding inequality, What is that? P( 1 n n X t EX ε) 2e 2nε2 t=1 Concentration inequality... What?

11 Concentration of measure phenomenon What is that? Talagrand: A random variable that depends on the influence of many independent random variables (but not too much on any of them) is essentially constant. Ex: Let f (x 1,...,x n ) be a function of n variables such that f (x i ) f (x i ) 1/n, and X 1,...,X n be independent r. v., then: P( f (X 1,...,X n ) Ef ε) 2e 2nε2. Many examples (Chernoff-Hoeffding, Bennett, Bernstein, Efron-Stein, log. Sobolev, Azuma, Talagrand inequalities)

12 Chernoff-Hoeffding inequality Consider f (x 1,...,x n ) def = 1 n Then if X 1,X 2,...,X n are independent and identically distributed (i.i.d.) copies of a r.v. X [0,1], then the empirical mean: n t=1 f (X 1,...,X n ) = X n def = 1 n x t n t=1 concentrates around Ef = EX exponentially fast, i.e. X t P( Xn EX ε) 2e 2nε 2.

13 Chernoff-Hoeffding, sketch of proof Chernoff-Hoeffding inequality: P( Xn EX ε) 2e 2nε2. (1) Proof. Markov inequality: for Y 0, P(Y λ) EY λ. Indeed: P(Y λ) = E(1 Y λ ) = E(1 Y /λ 1 ) E(Y /λ). Thus: P( X n EX ε) = P(e s 1 n P n t=1 (Xt EX) e sε ) e sε E[e s 1 P n n t=1 (Xt EX) ] n e sε E[e s n (Xt EX) ] t=1 Now, use Hoeffding s inequality E[e s(x EX) ] e s2 /8 and set s = 4εn to deduce (1).

14 Let s go back to the UCB algorithm Thus The B-values: P( Xk,t µ k 2log(n) 2 log(n) 2n ) 2e t t B k,t,n def = 1 t t 2log(n) X k,s + t s=1 2n 4 are Upper-Confidence-Bounds on µ k, i.e. with high probability, B k,nk,n µ k.

15 Regret bound for UCB Proposition Each sub-optimal arm k is visited in average, at most: En k (n) 8 log n 2 k times (where k def = µ µ k > 0). Thus the expected regret is bounded by: ER n = k E[n k ] k 8 + cst k: k >0 log n k + cst. This is optimal (up to sub-log terms) since ER n = Ω(log n) [Lai and Robbins, 1985].

16 Intuition of the proof Let k be a sub-optimal arm, and k be an optimal arm. At time n, if arm k is selected, this means that B k,nk,n B k,n k,n 2log(n) X k,nk + n X k,n k + k µ k + 2 2log(n) n k 2log(n) n k µ, with high proba n k 8log(n) 2 k Thus with high probability, if n k > 8log(n), then arm k will not be 2 k selected. Thus n k 8log(n) + 1 with high proba. 2 k

17 Write u = 8log(n) 2 k Sketch of proof + 1. We have: n k (n) u n n 1 kt=k;n k (t)>u 1 s:u<s t, s :1 s t, s.t. B k,s,t B k,s,t t=u+1 n t=u+1 n t=u+1 [1 s:u<s t s.t. Bk,s,t >µ + 1 s :1 s t s.t. Bk,s,t µ ] [ t t ] 1 Bk,s,t >µ + 1 Bk,s,t µ t=u+1 s=u+1 s=1 Now, taking the expectation of both sides, E[n k (n)] u n t=u+1 n t=u+1 [ t s=u+1 [ t s=u+1 P ( B k,s,t > µ ) + t 4 + t s=1 t P ( B k,s,t µ )] s=1 t 4] π2 3

18 PAC-UCB Let β > 0, by slightly changing the confidence interval term, i.e. def B k,t = X log(kt k,t + 2 β 1 ), t then ( P Xk,t µ k log(kt 2 β 1 ) ), k {1,...,K}, t 1 1 β. t PAC-UCB [Audibert et al. 2007]: with probability 1 β, the regret is bounded by a constant independent of n: R n 6log(Kβ 1 ) k: k >0 1 k.

19 Adversarial environment What about if the obtained rewards are no more i.i.d. samples of random variables, but are arbitrary? At time t, rewards x k,t are settled (maybe by some opponent), but not revealed. The player chooses an arm k t. He receives x kt. Can we expect to do almost as good as if all rewards were known ahead of time? the best (constant) arm? Example: Time Arm pulled Reward arm 1 10 (7) 9 11 (12) Reward arm 2 (9) 0 (18) (0) 14 (0) (16) (0) Reward obtained: 74. Best sum of rewards: 96. Arm 1: 79, Arm 2: 57.

20 Define the regret: R n = Notion of regret max k {1,...,K} t=1 n x k,t n x kt. t=1 Performance assessed in terms of the best constant strategy. Generalization to prediction using expert advices [Lugosi and Cesa-Bianchi, 2006]. Can we expect sup R n /n 0? rewards If the policy of the player is deterministic, there exists a reward sequence such that the performance is arbitrarily poor Need internal randomization.

21 EXP3 algorithm EXP3 algorithm (Explore-Exploit using Exponential weights) [Auer et al, 2002]: Start with w k,t = 1. At round t, def Set p k,t = w k,t P K, i=1 w i,t Choose arm k t according to the probabilities p k,t, Receive reward x t, Set estimated rewards x k,t = xt p k,t 1 k=kt, Update weights: w k,t+1 = w k,t exp 1 K t x k,t,

22 EXP3 results The expected regret (wrt internal randomization): sup ER n = O( nk log(k)). rewards (proof not provided here, see [Lugosi and Cesa-Bianchi, 2006] or [Stoltz, 2005]). Properties: This is optimal (in a worst-case sense) up to logarithmic factor: sup ER n = Ω( nk) rewards If all rewards are provided to the learner, we have similar algorithms with sup ER n = O( n log(k)). rewards

23 In summary... Distribution-dependent bounds: ( ) UCB: ER n = O k 1 log n k ( ) lower-bound: ER n = Ω log n The bound for UCB depends on the gaps k. We may also derive bounds that do not depend on the distributions of rewards: Distribution-free bounds: UCB: EXP3: lower-bound: sup distributions sup rewards sup rewards ER n ER n ER n ( Kn ) = O log n ( Kn ) log K = O ( Kn ) = Ω Remark: Although EXP3 has a better distribution-free rate w.r.t. n, it does not possess the nice log(n) distribution-dependent rate.

24 Population of bandits Bandit (or regret minimization) algorithms = methods for rapidly selecting the best action. Basic building block for solving more complex problems We now consider a population of bandits: Adversarial bandits Collaborative bandits

25 Adversarial bandits Each bandit chooses actions in order to minimize its own regret. The resulting behavior may be interesting... Examples of applications: Multi-agent learning Congestion games, transmission networks Equilibrium in games

26 Game between bandits Consider a 2-players zero-sum repeated game: A and B play actions: 1 or 2 simultaneously, and receive the reward (for A): A \ B (A likes consensus, B likes conflicts) Now, let A and B be bandit algorithms, aiming at minimizing their regret, i.e. for player A: R n (A) def = max What would happen? a {1,2} t=1 n r A (a,b t ) n r A (A t,b t ). t=1

27 Nash equilibrium Nash equilibrium: (mixed) strategy for both players, such that no player has incentive for changing unilaterally his own strategy. A \ B Ar B=2 Here: A plays 1 with probability p A = 1/2, B plays 1 with probability p B = 1/ A B=1

28 Regret minimization Nash equilibrium Define the regret of A: R n (A) def = max a {1,2} t=1 and that of B accordingly. n r A (a,b t ) n r A (A t,b t ). Proposition If both players perform a (Hannan) consistent regret-minimization strategy (i.e. R n (A)/n 0 and R n (B)/n 0), then the empirical frequencies of chosen actions of both players converge to a Nash equilibrium. t=1 (Remember that EXP3 is consistent!) Note that in general, we have convergence towards correlated equilibrium [Foster and Vohra, 1997].

29 Sketch of proof: Write pa n def = 1 n n t=1 1 A t=1 and pb n def = 1 n n t=1 1 B t=1 and r A (p,q) def = Er A (A p,b q). Regret-minimization algorithm: R n (A)/n 0 means that: ε > 0, for n large enough, 1 max a {1,2} n n r A (a,b t ) 1 n t=1 n r A (A t,b t ) t=1 max r A(a,pB n ) r A(pA n a {1,2},pn B ) r A (p,p n B ) r A(p n A,pn B ) ε, for all p [0,1]. Now, using R n (B)/n 0 we deduce that: r A (p,pb n ) ε r A(pA n,pn B ) r A(pA n,q) + ε, p,q [0,1] Thus the empirical actions of both players is arbitrarily close to a Nash strategy. ε ε

Texas Hold em Poker In the 2-players Poker game, the Nash equilibrium is interesting (zero-sum game) A policy: information set (my cards + board + pot) probabilities over decisions (check, raise,

30 Texas Hold em Poker In the 2-players Poker game, the Nash equilibrium is interesting (zero-sum game) A policy: information set (my cards + board + pot) probabilities over decisions (check, raise, fold) Space of policies is huge! Idea: Approximate the Nash equilibrium by using bandit algorithms assigned to each information set. This provides the world best Texas Hold em Poker program for 2-player with pot-limit [Zinkevitch et al., 2007]

31 Hierarchy of bandits We now consider another way of combining bandits: Hierarchy of bandits: the reward obtained when pulling an arm is itself the return of another bandit in a hierarchy. Applications to tree search, optimization, planning

32 The tree search problem To each leaf j L of a tree is assigned a random variable X j [0,1] whose law is unknown. At each time t, a leaf I t L is selected and a reward x t iid XIt is received. Node i: µ i= max µ j j L(i) Leaf j : Optimal leaf Random variable: Xj µ * = max µ j = E[ j ] j Value of leaf j: µ X j L * Optimal path Goal: find an exploration policy that maximizes the expected sum of obtained rewards. Idea: use bandit algorithms for efficient tree exploration

33 UCB-based leaf selection policy Leaf selection policy: To each node i is assigned a value B i. The chosen leaf I t is selected by following a path from the root to a leaf, where at each node i, the next node (child) is the one with highest B-value. Node i: B j B i Goal: Design B-values (upper bounds on the true values µ i of each node i) such that the resulting leaf selection policy maximizes the expected sum of obtained rewards.

34 Flat UCB We implement UCB directly on the leaves: B i def = { Xi,ni + max j C(i) B j 2 log(np) n i if i is a leaf, otherwise. Property (Chernoff-Hoeffding): With high probability, we have B i µ i, for all nodes i. Bound on the regret: any sub-optimal leaf j is visited in expectation at most En j = O(log(n)/ 2 j ) times (where j = µ µ j ). Thus, the regret is bounded by: ( ER n = O log(n) j L,µ j <µ 1 ). j Problem: all leaves must be visited at least once!

35 UCT (UCB applied to Trees) UCT [Kocsis and Szepesvári, 2006]: Intuition: B i def = X i,ni + 2log(n p ) n i. Explore first the most promising branches Adapts automatically to the effective smoothness of the tree Good results in computer-go

36 The MoGo program Collaborative work with Yizao Wang, Sylvain Gelly, Olivier Teytaud and many others. See [Gelly et al., 2006]. Features: Explore-Exploit with UCT (Min-Max) Monte-Carlo evaluation Asymmetric tree expansion Anytime algo Use of features Among best champions Interestingly: stochastic methods for deterministic problem!

37 Analysis of UCT Properties: The obtained rewards at a (non-leaf) node i are not i.i.d. Thus the B values are not upper confidence bounds on the node values However, all leaves are eventually visited infinitely, thus the algorithm is eventually consistent: the regret is O(log(n)) after an initial period... which may last very... very long!

38 Consider the tree: Bad case for UCT The left branches seem to be the best thus are explored for a very long time before the optimal leaf is eventually reached. The expected regret is disastrous: D 1 D D 2 D D 3 D ER n = Ω(exp(exp(... exp( 1)... )))+O(log(n)). }{{} D times Much much worst than uniform exploration! 1 D 0 1

39 In short... So far we have seen: Flat-UCB: does not exploit possible smoothness, but very good in the worst case! UCT: indeed adapts automatically to the effective smoothness of the tree, but the price of this adaptivity may be very very high. In good cases, UCT is VERY efficient! In bad cases, UCT is VERY poor! We should use the actual smoothness of the tree, if any, to design relevant algorithms. How to define the smoothness of the tree?

40 BAST (Bandit Algorithm for Smooth Trees) [Coquelin and Munos, 2007] Assumption: along an optimal path, for each node i of depth d, for all leaves j L(i), Optimal path Node i µ µ j δ d, where δ d is a smoothness function (Lipschitz property) Define the B-values: { def maxj C(i) B j, B i = min 2log(np) X i,ni + n i + δ d Remark: UCT = (BAST with δ d = 0). Flat-UCB = (BAST with δ d = ). µ * µ j

41 Lipschitz optimization Problem: Find the maximum of f : [0,1] [0,1], assumed to be 1-Lipschitz: f (x) f (y) l(x,y). Evaluation of f at x t returns f (x t ).

42 Lipschitz optimization (continued) If we wish to minimize the regret: max n x [0,1] t=1 f (x) n t=1 f (x t), where should one sample the next point?

43 Lipschitz optimization (continued) The optimistic strategy (i.e. select the point with highest upper bound) is a good strategy!

44 Lipschitz optimization with noisy evaluation The evaluation of f at x t returns r t such that E[r t x t ] = f (x t ).

45 How to defined high confidence upper bounds?

46 Select a domain Upper bound Select a domain X, then w.p. 1 β, Azuma s inequality: 1 n log β 1 r t + + diam(x) sup f (x). n 2n t=1 x X Since 1 n n i=1 [r t f (x t )] and f (x t ) + diam(x) sup x f (x) log β 1 2n

47 A sub-domain? Select a sub-domain of X with smaller diameter Upper bound

48 Another sub-domain? Select a sub-domain of X with more samples Upper bound

49 Combine previous bounds Upper bound Consider concentration inequalities on all domains and use union bound to derive the lowest (tighest) upper bounds.

50 A hierarchical decomposition Use a tree to partition the domain at all scales: { def UCB: B i = min max j C(i) B j, X } i,ni + 2log(n) n i + diam(i)

51 Multi-armed bandits in metric spaces More generally: Let X be a metric space with l(x,y) a distance. Let f (x) be a Lipschitz function: Write f def = sup x X f (x). f (x) f (y) l(x,y). X-armed bandit problem on X: At each round t, choose a point (arm) x t X, receive reward r t independent sample drawn from a distribution ν(x t ) with mean f (x t ). def Goal: minimize regret: R n = n t=1 f r t.

52 Hierarchical Optimistic Optimization (HOO) [Bubeck et al., 2008] Consider a tree of partitions of X, Each node i corresponds to a domain D i of the state space. Write diam(i) = sup x,y Di l(x,y) the diameter of D i. Let T t denote the set of expanded nodes at round t. HOO Algorithm: Start with T 1 = {root}. (whole domain X) At each round t, follow a path from the root to a leaf i t of T t by maximizing the B-values, Expand the node i t : choose (arbitrarily) a point x t D it, and add i t to T t, Observe reward r t ν(x t ) and update the B-values: [ def B i = min max B j, X i,ni + j C(i) 2log(n) ] + diam(i), n i

53 Properties of HOO Properties: These B-values are true upper confidence bounds on the optimal nodes value, The tree grows in an asymmetric way, leaving mainly unexplored the sub-optimal branches, Only the good paths are essentially explored.

Example in 1d r t B(f (x t )) a Bernoulli with parameter f (x t ), where x t is the chosen point at time t. 2 1.8 1.

54 Example in 1d r t B(f (x t )) a Bernoulli with parameter f (x t ), where x t is the chosen point at time t n=10 4 n=10 6 Shape of f µ = Location of the leaf Resulting tree at stage n = 4000.

55 Analysis of the regret Let d be the dimension of X (ie. such that we need O(ε d ) balls of radius ε to cover X). Then ER n = Õ(n d+1 d+2 ). We also have a lower bound ER n = Ω(n d+1 d+2 ) [Kleinberg et al., 2008] Let d be the near-optimality dimension of f in X: i.e. such that we need O(ε d ) balls of radius ε to cover X ε def = {x X,f (x) f ε}. Then ER n = Õ(n d +1 d +2 ). Much better!!! (same result as the zooming algorithm of [Kleinberg et al., 2008] but simpler algo)

56 Example 1: Here the function is locally Lipschitz around its maximum: f (x ) f (x) = Θ( x x ). ε ε It takes O(ǫ 0 ) balls of radius ǫ to cover X ε. Thus d = 0 and the regret is n 1/2.

57 Example 2: Here the function is locally Hölder (with coeff 2) around its maximum: f (x ) f (x) = Θ( x x 2 ). ε ε It takes O(ǫ d/2 ) balls of radius ǫ to cover X ε (of size O(ǫ d/2 )). Thus d = d/2 and the regret is n d+2 d+4.

58 Smoother functions Assume the function is locally Hölder (with coeff α) around its maximum: f (x ) f (x) = Θ( x x α ). Then it takes O(ǫ d/α d ) balls of size ǫ to cover X ε (of size O(ǫ d/α )). Thus d = d(1 1/α) and the regret is What s wrong? n α(d+1) d α(d+2) d. The smoother the function, the poorer the regret! The regret depends on the dimension of the space (very poor in high dimensions).

59 Generalization? We would like to use a metric that would catch the local smoothness around the maximum, i.e. for a function locally α-hölder around its maximum: f (x ) f (x) = Θ( x x α ), we would like to use l(x,y) = x y α : Why? So that X ε may be covered by O(1) balls (under this metric ) of radius ǫ. But l this is not a metric! But who cares? But f cannot be Lipschitz w.r.t. l everywhere! But who cares? BUT??? Shut up!

60 Generalization! [Bubeck et al., 2008] Actually in the regret analysis of HOO algorithm, we don t need the assumption that X is metric, neither that f is Lipschitz! But (almost) only that f is one-sided locally Lipschitz around its max w.r.t. a dissimilarity measure l, i.e. f (x ) f (x) l(x,x). Example: Consider X = [0,1] d. Assume that f is locally α-hölder around its maximum, i.e. f (x ) f (x) = Θ( x x α ). Then we choose l(x,y) = x y α, and X ε is thus covered by O(1) ball of radius ε. Thus the near-optimality dimension d = 0, and the regret of HOO is: ER n = Õ( n), whatever the dimension of the space is!

61 Conclusion on bandits in general spaces If we possess a measure of dissimilarity which expresses how much similar a near-optimal arm is from an optimal one, then the regret of HOO algorithm is Õ( n). Applications: In continuous domains, say f : [0,1] d [0,1] has a finite number of maxima and is α-hölder around its maxima, and if we know the degree α, then the regret of HOO is Õ( n) whatever the dimension d is. Optimization is not more difficult than integration In smooth trees, i.e. when 2 close paths have similar values, e.g. when the value of a path is the sum of discounted rewards obtained along this path. Application to planning

62 Application to planning (Joint work with Jean-François Hren) Consider a controlled deterministic system with discounted rewards. From the current state x t, consider the look-ahead tree of all possible reachable states. Use n computational resources (CPU time, number of calls to a generative model) to explore the tree and return a proposed actions a t This induces a policy π n Goal: Minimize the loss resulting from using policy π n instead of an optimal one: R n def = V V πn

63 Look-ahead tree for planning in deterministic systems At time t, for the current state x t. Build the look-ahead tree: Root of the tree = current state x t Nodes = reachable states by a sequence of actions, Receive discounted sum of rewards along the path: γ t r t, t 0 Explore the tree using n computational resources Propose an action as close as possible to the optimal one Initial state action 1 action 2 Path

64 Uniform exploration Assume all rewards [0,1]. For any reward function, the regret ( ) log 1/γ R n = O n log K For some reward function, ( ) log 1/γ R n = Ω n log K (K is the branching factor of the tree).

65 Proof of the upper bound With uniform exploration, it takes n = Ω(K d ) to expand all nodes of depth d. Because of the discount factor γ, given all rewards up to time d, the optimal value of the tree is known up to a precision ε = γd 1 γ, which allows to take an ε-optimal action. Thus ε = O(γ d log 1/γ ) = O(n log K ), which upper bounds the regret. R n Upper bound in the worst case n

66 Optimistic exploration (BAST/HOO algo in deterministic setting) For any node i of depth d, define the B-values: B i def = d 1 t=0 γ t r t + γd 1 γ v i At each round n, expand the node with highest B-value Observe reward, update B-values, Repeat until no more available resources Return maximizing action Node i Expanded nodes Optimal path

67 Analysis of the regret Define β such that the proportion of ǫ-optimal paths is O(ǫ β ). Let κ def = Kγ β [1,K]. If κ > 1, then R n = O ( n log 1/γ log κ ). (recall that for the uniform planning R n = O ( n log 1/γ log K If κ = 1, then R n = O ( γ (1 γ)β c n ), where c defined by the proportion of ǫ-path being bounded by cǫ β. This provides exponential rates. ).)

68 Some intuition Write T the tree of all expandable nodes: T = {node i of depth d s.t. v i + γd 1 γ v } T = set of nodes from which one cannot decide whether the node is optimal or not, At any round n, the set of expanded nodes T n T, κ = branching factor of T. The regret R n = O ( n log 1/γ log κ ), comes from a search in the tree T with branching factor κ [1,K].

69 Upper and lower bounds For any κ [1,K]. Define P κ as the set of problems having a κ-value. For any problem P P κ, write R A(P) (n) the regret of using the algorithm A on the problem P with n computational resources. Then: sup R Auniform (P)(n) = Θ(n log 1/γ log K ) P P κ sup R Aoptimistic (P)(n) = Θ(n log 1/γ log κ ). P P κ

70 Numerical illustration 2d problem: x = (u,v). Dynamics: ( ) ( ) ut+1 ut + v = t t v t + a t t v t+1 Reward: r(u,v) = u 2. Set of expanded nodes (n = 3000) using the uniform planning. Max depth = 10.

71 Numerical illustration The exploration of the poor paths is shallow. The good paths are explored in deeper depths. Set of expanded nodes (n = 3000) using the optimistic planning. Max depth = 43.

72 Two inverted pendulum linked with a spring State space of dimension 8 4 actions n = 3000 at each iteration

73 References Livre collectif communauté PDMIA, Processus Décisionnels de Markov et Intelligence Artificielle, Hermès, J.Y. Audibert, R. Munos, and C. Szepesvari, Tuning bandit algorithms in stochastic environments, ALT, P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite time analysis of the multiarmed bandit problem, Machine Learning, P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire, The Non-Stochastic Multi-Armed Bandit Problem, SIAM Journal on Computing, S. Bubeck, R. Munos, G. Stoltz, Cs. Szepesvari, Online Optimization in X-armed Bandits, NIPS P.-A. Coquelin and R. Munos, Bandit Algorithm for Tree Search, UAI S. Gelly, Y. Wang, R. Munos, and O. Teytaud, Modification of UCT with Patterns in Monte-Carlo Go, RR INRIA, W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 1963.

74 References (cont ed) J.-F. Hren and R. Munos, Optimistic planning in deterministic systems. EWRL M. Kearns, Y. Mansour, A. Ng, A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes, Machine Learning, R. Kleinberg, Nearly tight bounds for the continuum-armed bandit problem, NIPS R. Kleinberg, A. Slivkins, and E. Upfal, Multi-Armed Bandits in Metric Spaces, ACM Symposium on Theory of Computing, L. Kocsis and Cs. Szepesvári, Bandit based Monte-Carlo Planning, ECML T. L. Lai and H. Robbins, Asymptotically Efficient Adaptive Allocation Rules, Advances in Applied Mathematics, G. Lugosi and N. Cesa-Bianchi, Prediction, Learning, and Games, Cambridge Press, R. Sutton and A. Barto, Reinforcement Learning, an introduction, MIT Press, M. Zinkevich, M. Bowling, M. Johanson, C. Piccione, Regret Minimization in Games with Incomplete Information, NIPS 2007.

Bandit algorithms for tree search Applications to games, optimization, and planning

Bandit algorithms for tree search Applications to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://sequel.futurs.inria.fr/ INRIA Lille - Nord Europe Journées MAS