Exploration for sequential decision making Application to games, tree search, optimization, and planning

Size: px
Start display at page:

Download "Exploration for sequential decision making Application to games, tree search, optimization, and planning"

Transcription

1 Exploration for sequential decision making Application to games, tree search, optimization, and planning Rémi Munos SequeL project: Sequential Learning INRIA Lille - Nord Europe Master MVA, Cours Apprentissage par renforcement, 2008

2 Exploration for sequential decision making Application to games, tree search, optimization, and planning Global view: Exploration in Reinforcement Learning The multi-armed bandit problem The optimistic strategy in uncertain environments The robust strategy in adversarial environments Adversarial bandits Application to games Hierarchical bandits Application to tree search Application to optimization Application to planning

3 Exploration in Reinforcement Learning Assumption about RL: no initial model of the world (dynamics + rewards) Example: a child learns to ride a bicycle How should we act? Exploitation: act optimally according to the current beliefs Exploration: learn more about the environment Tradeoff between exploration and exploitation.

4 General setting: Introduction to multi-armed bandits At each round, several options (actions) are available to choose from. A reward is provided according to the choice made. Our goal is to optimize the sum of rewards. Many potential applications: Clinical trials Advertising: what ad to put on a web-page? Labor markets: which job a worker should choose? Optimization of noisy function Numerical resource allocation

5 Example: a two-armed bandit Say, there are 2 arms: We have pulled the arms so far: Time Arm pulled Reward arm Reward arm Which arm should we pull next? What are the assumption about the rewards? What is really our goal?

6 The stochastic bandit problem Setting: Set of K arms, defined by random variables X k [0,1], whose law is unknown, At each time t, choose an arm k t and receive reward i.i.d. x t X kt. Goal: find an arm selection policy such as to maximize the expected sum of rewards. Definitions : Let µ k = E[X k ] be the expected value of arm k. Let µ = max k µ k the optimal value, and k an optimal arm.

7 Exploration-exploitation tradeoff Define the cumulative regret: R n def = n µ µ kt = t=1 K n k k, where k def = µ µ k, and n k the number of times arm k has been pulled up to time n. (regret results from not pulling an optimal arm from the beginning) Goal: Find an arm selection policy such as to minimize R n. k=1

8 Proposed solutions This is an old problem! [Robbins, 1952] (maybe surprisingly) not fully solved yet! Many proposed solutions. Examples: ǫ-greedy exploration: choose apparent best action with proba 1 ǫ, or random action with proba ǫ, Follow the perturbed leader: choose the best perturbed arm Bayesian exploration: assign prior to the arm distributions and based on the rewards, choose the arm with best posterior mean, or with highest probability of being the best Boltzmann exploration: choose arm k with proba exp( 1 T X k ) Optimistic exploration: choose an arm that has a possibility of being the best

9 The UCB algorithm Upper Confidence Bounds algorithm [Auer et al. 2002]: at each time n, select an arm with where B k,nk,n arg max B k,nk,n, k def = 1 n k n k s=1 x k,s } {{ } bx k,nk + 2log(n), n k }{{} c nk,n n k is the number of times arm k has been pulled up to time n x k,s is the s-th reward obtained when pulling arm k. Note that Sum of an exploitation term and an exploration term. c nk,n is a confidence interval term, so B k,nk,n is a UCB.

10 Idea: Intuition behind the UCB algorithm Select an arm that has a high probability of being the best, given what has been observed so far. Optimism under the face of uncertainty strategy Why? For any arm k, for any t, B k,t,n µ k with high probability. Why? From Chernoff-Hoeffding inequality, What is that? P( 1 n n X t EX ε) 2e 2nε2 t=1 Concentration inequality... What?

11 Concentration of measure phenomenon What is that? Talagrand: A random variable that depends on the influence of many independent random variables (but not too much on any of them) is essentially constant. Ex: Let f (x 1,...,x n ) be a function of n variables such that f (x i ) f (x i ) 1/n, and X 1,...,X n be independent r. v., then: P( f (X 1,...,X n ) Ef ε) 2e 2nε2. Many examples (Chernoff-Hoeffding, Bennett, Bernstein, Efron-Stein, log. Sobolev, Azuma, Talagrand inequalities)

12 Chernoff-Hoeffding inequality Consider f (x 1,...,x n ) def = 1 n Then if X 1,X 2,...,X n are independent and identically distributed (i.i.d.) copies of a r.v. X [0,1], then the empirical mean: n t=1 f (X 1,...,X n ) = X n def = 1 n x t n t=1 concentrates around Ef = EX exponentially fast, i.e. X t P( Xn EX ε) 2e 2nε 2.

13 Chernoff-Hoeffding, sketch of proof Chernoff-Hoeffding inequality: P( Xn EX ε) 2e 2nε2. (1) Proof. Markov inequality: for Y 0, P(Y λ) EY λ. Indeed: P(Y λ) = E(1 Y λ ) = E(1 Y /λ 1 ) E(Y /λ). Thus: P( X n EX ε) = P(e s 1 n P n t=1 (Xt EX) e sε ) e sε E[e s 1 P n n t=1 (Xt EX) ] n e sε E[e s n (Xt EX) ] t=1 Now, use Hoeffding s inequality E[e s(x EX) ] e s2 /8 and set s = 4εn to deduce (1).

14 Let s go back to the UCB algorithm Thus The B-values: P( Xk,t µ k 2log(n) 2 log(n) 2n ) 2e t t B k,t,n def = 1 t t 2log(n) X k,s + t s=1 2n 4 are Upper-Confidence-Bounds on µ k, i.e. with high probability, B k,nk,n µ k.

15 Regret bound for UCB Proposition Each sub-optimal arm k is visited in average, at most: En k (n) 8 log n 2 k times (where k def = µ µ k > 0). Thus the expected regret is bounded by: ER n = k E[n k ] k 8 + cst k: k >0 log n k + cst. This is optimal (up to sub-log terms) since ER n = Ω(log n) [Lai and Robbins, 1985].

16 Intuition of the proof Let k be a sub-optimal arm, and k be an optimal arm. At time n, if arm k is selected, this means that B k,nk,n B k,n k,n 2log(n) X k,nk + n X k,n k + k µ k + 2 2log(n) n k 2log(n) n k µ, with high proba n k 8log(n) 2 k Thus with high probability, if n k > 8log(n), then arm k will not be 2 k selected. Thus n k 8log(n) + 1 with high proba. 2 k

17 Write u = 8log(n) 2 k Sketch of proof + 1. We have: n k (n) u n n 1 kt=k;n k (t)>u 1 s:u<s t, s :1 s t, s.t. B k,s,t B k,s,t t=u+1 n t=u+1 n t=u+1 [1 s:u<s t s.t. Bk,s,t >µ + 1 s :1 s t s.t. Bk,s,t µ ] [ t t ] 1 Bk,s,t >µ + 1 Bk,s,t µ t=u+1 s=u+1 s=1 Now, taking the expectation of both sides, E[n k (n)] u n t=u+1 n t=u+1 [ t s=u+1 [ t s=u+1 P ( B k,s,t > µ ) + t 4 + t s=1 t P ( B k,s,t µ )] s=1 t 4] π2 3

18 PAC-UCB Let β > 0, by slightly changing the confidence interval term, i.e. def B k,t = X log(kt k,t + 2 β 1 ), t then ( P Xk,t µ k log(kt 2 β 1 ) ), k {1,...,K}, t 1 1 β. t PAC-UCB [Audibert et al. 2007]: with probability 1 β, the regret is bounded by a constant independent of n: R n 6log(Kβ 1 ) k: k >0 1 k.

19 Adversarial environment What about if the obtained rewards are no more i.i.d. samples of random variables, but are arbitrary? At time t, rewards x k,t are settled (maybe by some opponent), but not revealed. The player chooses an arm k t. He receives x kt. Can we expect to do almost as good as if all rewards were known ahead of time? the best (constant) arm? Example: Time Arm pulled Reward arm 1 10 (7) 9 11 (12) Reward arm 2 (9) 0 (18) (0) 14 (0) (16) (0) Reward obtained: 74. Best sum of rewards: 96. Arm 1: 79, Arm 2: 57.

20 Define the regret: R n = Notion of regret max k {1,...,K} t=1 n x k,t n x kt. t=1 Performance assessed in terms of the best constant strategy. Generalization to prediction using expert advices [Lugosi and Cesa-Bianchi, 2006]. Can we expect sup R n /n 0? rewards If the policy of the player is deterministic, there exists a reward sequence such that the performance is arbitrarily poor Need internal randomization.

21 EXP3 algorithm EXP3 algorithm (Explore-Exploit using Exponential weights) [Auer et al, 2002]: Start with w k,t = 1. At round t, def Set p k,t = w k,t P K, i=1 w i,t Choose arm k t according to the probabilities p k,t, Receive reward x t, Set estimated rewards x k,t = xt p k,t 1 k=kt, Update weights: w k,t+1 = w k,t exp 1 K t x k,t,

22 EXP3 results The expected regret (wrt internal randomization): sup ER n = O( nk log(k)). rewards (proof not provided here, see [Lugosi and Cesa-Bianchi, 2006] or [Stoltz, 2005]). Properties: This is optimal (in a worst-case sense) up to logarithmic factor: sup ER n = Ω( nk) rewards If all rewards are provided to the learner, we have similar algorithms with sup ER n = O( n log(k)). rewards

23 In summary... Distribution-dependent bounds: ( ) UCB: ER n = O k 1 log n k ( ) lower-bound: ER n = Ω log n The bound for UCB depends on the gaps k. We may also derive bounds that do not depend on the distributions of rewards: Distribution-free bounds: UCB: EXP3: lower-bound: sup distributions sup rewards sup rewards ER n ER n ER n ( Kn ) = O log n ( Kn ) log K = O ( Kn ) = Ω Remark: Although EXP3 has a better distribution-free rate w.r.t. n, it does not possess the nice log(n) distribution-dependent rate.

24 Population of bandits Bandit (or regret minimization) algorithms = methods for rapidly selecting the best action. Basic building block for solving more complex problems We now consider a population of bandits: Adversarial bandits Collaborative bandits

25 Adversarial bandits Each bandit chooses actions in order to minimize its own regret. The resulting behavior may be interesting... Examples of applications: Multi-agent learning Congestion games, transmission networks Equilibrium in games

26 Game between bandits Consider a 2-players zero-sum repeated game: A and B play actions: 1 or 2 simultaneously, and receive the reward (for A): A \ B (A likes consensus, B likes conflicts) Now, let A and B be bandit algorithms, aiming at minimizing their regret, i.e. for player A: R n (A) def = max What would happen? a {1,2} t=1 n r A (a,b t ) n r A (A t,b t ). t=1

27 Nash equilibrium Nash equilibrium: (mixed) strategy for both players, such that no player has incentive for changing unilaterally his own strategy. A \ B Ar B=2 Here: A plays 1 with probability p A = 1/2, B plays 1 with probability p B = 1/ A B=1

28 Regret minimization Nash equilibrium Define the regret of A: R n (A) def = max a {1,2} t=1 and that of B accordingly. n r A (a,b t ) n r A (A t,b t ). Proposition If both players perform a (Hannan) consistent regret-minimization strategy (i.e. R n (A)/n 0 and R n (B)/n 0), then the empirical frequencies of chosen actions of both players converge to a Nash equilibrium. t=1 (Remember that EXP3 is consistent!) Note that in general, we have convergence towards correlated equilibrium [Foster and Vohra, 1997].

29 Sketch of proof: Write pa n def = 1 n n t=1 1 A t=1 and pb n def = 1 n n t=1 1 B t=1 and r A (p,q) def = Er A (A p,b q). Regret-minimization algorithm: R n (A)/n 0 means that: ε > 0, for n large enough, 1 max a {1,2} n n r A (a,b t ) 1 n t=1 n r A (A t,b t ) t=1 max r A(a,pB n ) r A(pA n a {1,2},pn B ) r A (p,p n B ) r A(p n A,pn B ) ε, for all p [0,1]. Now, using R n (B)/n 0 we deduce that: r A (p,pb n ) ε r A(pA n,pn B ) r A(pA n,q) + ε, p,q [0,1] Thus the empirical actions of both players is arbitrarily close to a Nash strategy. ε ε

30 Texas Hold em Poker In the 2-players Poker game, the Nash equilibrium is interesting (zero-sum game) A policy: information set (my cards + board + pot) probabilities over decisions (check, raise, fold) Space of policies is huge! Idea: Approximate the Nash equilibrium by using bandit algorithms assigned to each information set. This provides the world best Texas Hold em Poker program for 2-player with pot-limit [Zinkevitch et al., 2007]

31 Hierarchy of bandits We now consider another way of combining bandits: Hierarchy of bandits: the reward obtained when pulling an arm is itself the return of another bandit in a hierarchy. Applications to tree search, optimization, planning

32 The tree search problem To each leaf j L of a tree is assigned a random variable X j [0,1] whose law is unknown. At each time t, a leaf I t L is selected and a reward x t iid XIt is received. Node i: µ i= max µ j j L(i) Leaf j : Optimal leaf Random variable: Xj µ * = max µ j = E[ j ] j Value of leaf j: µ X j L * Optimal path Goal: find an exploration policy that maximizes the expected sum of obtained rewards. Idea: use bandit algorithms for efficient tree exploration

33 UCB-based leaf selection policy Leaf selection policy: To each node i is assigned a value B i. The chosen leaf I t is selected by following a path from the root to a leaf, where at each node i, the next node (child) is the one with highest B-value. Node i: B j B i Goal: Design B-values (upper bounds on the true values µ i of each node i) such that the resulting leaf selection policy maximizes the expected sum of obtained rewards.

34 Flat UCB We implement UCB directly on the leaves: B i def = { Xi,ni + max j C(i) B j 2 log(np) n i if i is a leaf, otherwise. Property (Chernoff-Hoeffding): With high probability, we have B i µ i, for all nodes i. Bound on the regret: any sub-optimal leaf j is visited in expectation at most En j = O(log(n)/ 2 j ) times (where j = µ µ j ). Thus, the regret is bounded by: ( ER n = O log(n) j L,µ j <µ 1 ). j Problem: all leaves must be visited at least once!

35 UCT (UCB applied to Trees) UCT [Kocsis and Szepesvári, 2006]: Intuition: B i def = X i,ni + 2log(n p ) n i. Explore first the most promising branches Adapts automatically to the effective smoothness of the tree Good results in computer-go

36 The MoGo program Collaborative work with Yizao Wang, Sylvain Gelly, Olivier Teytaud and many others. See [Gelly et al., 2006]. Features: Explore-Exploit with UCT (Min-Max) Monte-Carlo evaluation Asymmetric tree expansion Anytime algo Use of features Among best champions Interestingly: stochastic methods for deterministic problem!

37 Analysis of UCT Properties: The obtained rewards at a (non-leaf) node i are not i.i.d. Thus the B values are not upper confidence bounds on the node values However, all leaves are eventually visited infinitely, thus the algorithm is eventually consistent: the regret is O(log(n)) after an initial period... which may last very... very long!

38 Consider the tree: Bad case for UCT The left branches seem to be the best thus are explored for a very long time before the optimal leaf is eventually reached. The expected regret is disastrous: D 1 D D 2 D D 3 D ER n = Ω(exp(exp(... exp( 1)... )))+O(log(n)). }{{} D times Much much worst than uniform exploration! 1 D 0 1

39 In short... So far we have seen: Flat-UCB: does not exploit possible smoothness, but very good in the worst case! UCT: indeed adapts automatically to the effective smoothness of the tree, but the price of this adaptivity may be very very high. In good cases, UCT is VERY efficient! In bad cases, UCT is VERY poor! We should use the actual smoothness of the tree, if any, to design relevant algorithms. How to define the smoothness of the tree?

40 BAST (Bandit Algorithm for Smooth Trees) [Coquelin and Munos, 2007] Assumption: along an optimal path, for each node i of depth d, for all leaves j L(i), Optimal path Node i µ µ j δ d, where δ d is a smoothness function (Lipschitz property) Define the B-values: { def maxj C(i) B j, B i = min 2log(np) X i,ni + n i + δ d Remark: UCT = (BAST with δ d = 0). Flat-UCB = (BAST with δ d = ). µ * µ j

41 Lipschitz optimization Problem: Find the maximum of f : [0,1] [0,1], assumed to be 1-Lipschitz: f (x) f (y) l(x,y). Evaluation of f at x t returns f (x t ).

42 Lipschitz optimization (continued) If we wish to minimize the regret: max n x [0,1] t=1 f (x) n t=1 f (x t), where should one sample the next point?

43 Lipschitz optimization (continued) The optimistic strategy (i.e. select the point with highest upper bound) is a good strategy!

44 Lipschitz optimization with noisy evaluation The evaluation of f at x t returns r t such that E[r t x t ] = f (x t ).

45 How to defined high confidence upper bounds?

46 Select a domain Upper bound Select a domain X, then w.p. 1 β, Azuma s inequality: 1 n log β 1 r t + + diam(x) sup f (x). n 2n t=1 x X Since 1 n n i=1 [r t f (x t )] and f (x t ) + diam(x) sup x f (x) log β 1 2n

47 A sub-domain? Select a sub-domain of X with smaller diameter Upper bound

48 Another sub-domain? Select a sub-domain of X with more samples Upper bound

49 Combine previous bounds Upper bound Consider concentration inequalities on all domains and use union bound to derive the lowest (tighest) upper bounds.

50 A hierarchical decomposition Use a tree to partition the domain at all scales: { def UCB: B i = min max j C(i) B j, X } i,ni + 2log(n) n i + diam(i)

51 Multi-armed bandits in metric spaces More generally: Let X be a metric space with l(x,y) a distance. Let f (x) be a Lipschitz function: Write f def = sup x X f (x). f (x) f (y) l(x,y). X-armed bandit problem on X: At each round t, choose a point (arm) x t X, receive reward r t independent sample drawn from a distribution ν(x t ) with mean f (x t ). def Goal: minimize regret: R n = n t=1 f r t.

52 Hierarchical Optimistic Optimization (HOO) [Bubeck et al., 2008] Consider a tree of partitions of X, Each node i corresponds to a domain D i of the state space. Write diam(i) = sup x,y Di l(x,y) the diameter of D i. Let T t denote the set of expanded nodes at round t. HOO Algorithm: Start with T 1 = {root}. (whole domain X) At each round t, follow a path from the root to a leaf i t of T t by maximizing the B-values, Expand the node i t : choose (arbitrarily) a point x t D it, and add i t to T t, Observe reward r t ν(x t ) and update the B-values: [ def B i = min max B j, X i,ni + j C(i) 2log(n) ] + diam(i), n i

53 Properties of HOO Properties: These B-values are true upper confidence bounds on the optimal nodes value, The tree grows in an asymmetric way, leaving mainly unexplored the sub-optimal branches, Only the good paths are essentially explored.

54 Example in 1d r t B(f (x t )) a Bernoulli with parameter f (x t ), where x t is the chosen point at time t n=10 4 n=10 6 Shape of f µ = Location of the leaf Resulting tree at stage n = 4000.

55 Analysis of the regret Let d be the dimension of X (ie. such that we need O(ε d ) balls of radius ε to cover X). Then ER n = Õ(n d+1 d+2 ). We also have a lower bound ER n = Ω(n d+1 d+2 ) [Kleinberg et al., 2008] Let d be the near-optimality dimension of f in X: i.e. such that we need O(ε d ) balls of radius ε to cover X ε def = {x X,f (x) f ε}. Then ER n = Õ(n d +1 d +2 ). Much better!!! (same result as the zooming algorithm of [Kleinberg et al., 2008] but simpler algo)

56 Example 1: Here the function is locally Lipschitz around its maximum: f (x ) f (x) = Θ( x x ). ε ε It takes O(ǫ 0 ) balls of radius ǫ to cover X ε. Thus d = 0 and the regret is n 1/2.

57 Example 2: Here the function is locally Hölder (with coeff 2) around its maximum: f (x ) f (x) = Θ( x x 2 ). ε ε It takes O(ǫ d/2 ) balls of radius ǫ to cover X ε (of size O(ǫ d/2 )). Thus d = d/2 and the regret is n d+2 d+4.

58 Smoother functions Assume the function is locally Hölder (with coeff α) around its maximum: f (x ) f (x) = Θ( x x α ). Then it takes O(ǫ d/α d ) balls of size ǫ to cover X ε (of size O(ǫ d/α )). Thus d = d(1 1/α) and the regret is What s wrong? n α(d+1) d α(d+2) d. The smoother the function, the poorer the regret! The regret depends on the dimension of the space (very poor in high dimensions).

59 Generalization? We would like to use a metric that would catch the local smoothness around the maximum, i.e. for a function locally α-hölder around its maximum: f (x ) f (x) = Θ( x x α ), we would like to use l(x,y) = x y α : Why? So that X ε may be covered by O(1) balls (under this metric ) of radius ǫ. But l this is not a metric! But who cares? But f cannot be Lipschitz w.r.t. l everywhere! But who cares? BUT??? Shut up!

60 Generalization! [Bubeck et al., 2008] Actually in the regret analysis of HOO algorithm, we don t need the assumption that X is metric, neither that f is Lipschitz! But (almost) only that f is one-sided locally Lipschitz around its max w.r.t. a dissimilarity measure l, i.e. f (x ) f (x) l(x,x). Example: Consider X = [0,1] d. Assume that f is locally α-hölder around its maximum, i.e. f (x ) f (x) = Θ( x x α ). Then we choose l(x,y) = x y α, and X ε is thus covered by O(1) ball of radius ε. Thus the near-optimality dimension d = 0, and the regret of HOO is: ER n = Õ( n), whatever the dimension of the space is!

61 Conclusion on bandits in general spaces If we possess a measure of dissimilarity which expresses how much similar a near-optimal arm is from an optimal one, then the regret of HOO algorithm is Õ( n). Applications: In continuous domains, say f : [0,1] d [0,1] has a finite number of maxima and is α-hölder around its maxima, and if we know the degree α, then the regret of HOO is Õ( n) whatever the dimension d is. Optimization is not more difficult than integration In smooth trees, i.e. when 2 close paths have similar values, e.g. when the value of a path is the sum of discounted rewards obtained along this path. Application to planning

62 Application to planning (Joint work with Jean-François Hren) Consider a controlled deterministic system with discounted rewards. From the current state x t, consider the look-ahead tree of all possible reachable states. Use n computational resources (CPU time, number of calls to a generative model) to explore the tree and return a proposed actions a t This induces a policy π n Goal: Minimize the loss resulting from using policy π n instead of an optimal one: R n def = V V πn

63 Look-ahead tree for planning in deterministic systems At time t, for the current state x t. Build the look-ahead tree: Root of the tree = current state x t Nodes = reachable states by a sequence of actions, Receive discounted sum of rewards along the path: γ t r t, t 0 Explore the tree using n computational resources Propose an action as close as possible to the optimal one Initial state action 1 action 2 Path

64 Uniform exploration Assume all rewards [0,1]. For any reward function, the regret ( ) log 1/γ R n = O n log K For some reward function, ( ) log 1/γ R n = Ω n log K (K is the branching factor of the tree).

65 Proof of the upper bound With uniform exploration, it takes n = Ω(K d ) to expand all nodes of depth d. Because of the discount factor γ, given all rewards up to time d, the optimal value of the tree is known up to a precision ε = γd 1 γ, which allows to take an ε-optimal action. Thus ε = O(γ d log 1/γ ) = O(n log K ), which upper bounds the regret. R n Upper bound in the worst case n

66 Optimistic exploration (BAST/HOO algo in deterministic setting) For any node i of depth d, define the B-values: B i def = d 1 t=0 γ t r t + γd 1 γ v i At each round n, expand the node with highest B-value Observe reward, update B-values, Repeat until no more available resources Return maximizing action Node i Expanded nodes Optimal path

67 Analysis of the regret Define β such that the proportion of ǫ-optimal paths is O(ǫ β ). Let κ def = Kγ β [1,K]. If κ > 1, then R n = O ( n log 1/γ log κ ). (recall that for the uniform planning R n = O ( n log 1/γ log K If κ = 1, then R n = O ( γ (1 γ)β c n ), where c defined by the proportion of ǫ-path being bounded by cǫ β. This provides exponential rates. ).)

68 Some intuition Write T the tree of all expandable nodes: T = {node i of depth d s.t. v i + γd 1 γ v } T = set of nodes from which one cannot decide whether the node is optimal or not, At any round n, the set of expanded nodes T n T, κ = branching factor of T. The regret R n = O ( n log 1/γ log κ ), comes from a search in the tree T with branching factor κ [1,K].

69 Upper and lower bounds For any κ [1,K]. Define P κ as the set of problems having a κ-value. For any problem P P κ, write R A(P) (n) the regret of using the algorithm A on the problem P with n computational resources. Then: sup R Auniform (P)(n) = Θ(n log 1/γ log K ) P P κ sup R Aoptimistic (P)(n) = Θ(n log 1/γ log κ ). P P κ

70 Numerical illustration 2d problem: x = (u,v). Dynamics: ( ) ( ) ut+1 ut + v = t t v t + a t t v t+1 Reward: r(u,v) = u 2. Set of expanded nodes (n = 3000) using the uniform planning. Max depth = 10.

71 Numerical illustration The exploration of the poor paths is shallow. The good paths are explored in deeper depths. Set of expanded nodes (n = 3000) using the optimistic planning. Max depth = 43.

72 Two inverted pendulum linked with a spring State space of dimension 8 4 actions n = 3000 at each iteration

73 References Livre collectif communauté PDMIA, Processus Décisionnels de Markov et Intelligence Artificielle, Hermès, J.Y. Audibert, R. Munos, and C. Szepesvari, Tuning bandit algorithms in stochastic environments, ALT, P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite time analysis of the multiarmed bandit problem, Machine Learning, P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire, The Non-Stochastic Multi-Armed Bandit Problem, SIAM Journal on Computing, S. Bubeck, R. Munos, G. Stoltz, Cs. Szepesvari, Online Optimization in X-armed Bandits, NIPS P.-A. Coquelin and R. Munos, Bandit Algorithm for Tree Search, UAI S. Gelly, Y. Wang, R. Munos, and O. Teytaud, Modification of UCT with Patterns in Monte-Carlo Go, RR INRIA, W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 1963.

74 References (cont ed) J.-F. Hren and R. Munos, Optimistic planning in deterministic systems. EWRL M. Kearns, Y. Mansour, A. Ng, A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes, Machine Learning, R. Kleinberg, Nearly tight bounds for the continuum-armed bandit problem, NIPS R. Kleinberg, A. Slivkins, and E. Upfal, Multi-Armed Bandits in Metric Spaces, ACM Symposium on Theory of Computing, L. Kocsis and Cs. Szepesvári, Bandit based Monte-Carlo Planning, ECML T. L. Lai and H. Robbins, Asymptotically Efficient Adaptive Allocation Rules, Advances in Applied Mathematics, G. Lugosi and N. Cesa-Bianchi, Prediction, Learning, and Games, Cambridge Press, R. Sutton and A. Barto, Reinforcement Learning, an introduction, MIT Press, M. Zinkevich, M. Bowling, M. Johanson, C. Piccione, Regret Minimization in Games with Incomplete Information, NIPS 2007.

Bandit algorithms for tree search Applications to games, optimization, and planning

Bandit algorithms for tree search Applications to games, optimization, and planning Bandit algorithms for tree search Applications to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://sequel.futurs.inria.fr/ INRIA Lille - Nord Europe Journées MAS

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

Tuning bandit algorithms in stochastic environments

Tuning bandit algorithms in stochastic environments Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, CERTIS - Ecole des Ponts Remi Munos, INRIA Futurs Lille Csaba Szepesvári, University of Alberta The 18th International Conference

More information

Multi-armed bandit problems

Multi-armed bandit problems Multi-armed bandit problems Stochastic Decision Theory (2WB12) Arnoud den Boer 13 March 2013 Set-up 13 and 14 March: Lectures. 20 and 21 March: Paper presentations (Four groups, 45 min per group). Before

More information

Zooming Algorithm for Lipschitz Bandits

Zooming Algorithm for Lipschitz Bandits Zooming Algorithm for Lipschitz Bandits Alex Slivkins Microsoft Research New York City Based on joint work with Robert Kleinberg and Eli Upfal (STOC'08) Running examples Dynamic pricing. You release a

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits JMLR: Workshop and Conference Proceedings vol 49:1 5, 2016 An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits Peter Auer Chair for Information Technology Montanuniversitaet

More information

Monte-Carlo Planning: Basic Principles and Recent Progress

Monte-Carlo Planning: Basic Principles and Recent Progress Monte-Carlo Planning: Basic Principles and Recent Progress Alan Fern School of EECS Oregon State University Outline Preliminaries: Markov Decision Processes What is Monte-Carlo Planning? Uniform Monte-Carlo

More information

1 Bandit View on Noisy Optimization

1 Bandit View on Noisy Optimization 1 Bandit View on Noisy Optimization Jean-Yves Audibert audibert@certis.enpc.fr Imagine, Université Paris Est; Willow, CNRS/ENS/INRIA Paris, France Sébastien Bubeck sebastien.bubeck@inria.fr Sequel Project,

More information

Treatment Allocations Based on Multi-Armed Bandit Strategies

Treatment Allocations Based on Multi-Armed Bandit Strategies Treatment Allocations Based on Multi-Armed Bandit Strategies Wei Qian and Yuhong Yang Applied Economics and Statistics, University of Delaware School of Statistics, University of Minnesota Innovative Statistics

More information

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits Multi-Armed Bandit, Dynamic Environments and Meta-Bandits C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud and M. Sebag Lab. of Computer Science CNRS INRIA Université Paris-Sud, Orsay, France Abstract This

More information

Optimistic Planning for the Stochastic Knapsack Problem

Optimistic Planning for the Stochastic Knapsack Problem Optimistic Planning for the Stochastic Knapsack Problem Anonymous Author Anonymous Author 2 Anonymous Author 3 Unknown Institution Unknown Institution 2 Unknown Institution 3 Abstract The stochastic knapsack

More information

Dynamic Pricing with Varying Cost

Dynamic Pricing with Varying Cost Dynamic Pricing with Varying Cost L. Jeff Hong College of Business City University of Hong Kong Joint work with Ying Zhong and Guangwu Liu Outline 1 Introduction 2 Problem Formulation 3 Pricing Policy

More information

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Lecture 7: Bayesian approach to MAB - Gittins index

Lecture 7: Bayesian approach to MAB - Gittins index Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach

More information

Bernoulli Bandits An Empirical Comparison

Bernoulli Bandits An Empirical Comparison Bernoulli Bandits An Empirical Comparison Ronoh K.N1,2, Oyamo R.1,2, Milgo E.1,2, Drugan M.1 and Manderick B.1 1- Vrije Universiteit Brussel - Computer Sciences Department - AI Lab Pleinlaan 2 - B-1050

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

Adaptive Experiments for Policy Choice. March 8, 2019

Adaptive Experiments for Policy Choice. March 8, 2019 Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann March 8, 2019 Introduction The goal of many experiments is to inform policy choices: 1. Job search assistance for refugees: Treatments:

More information

Revenue optimization in AdExchange against strategic advertisers

Revenue optimization in AdExchange against strategic advertisers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

The Irrevocable Multi-Armed Bandit Problem

The Irrevocable Multi-Armed Bandit Problem The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use

More information

Bandit Learning with switching costs

Bandit Learning with switching costs Bandit Learning with switching costs Jian Ding, University of Chicago joint with: Ofer Dekel (MSR), Tomer Koren (Technion) and Yuval Peres (MSR) June 2016, Harvard University Online Learning with k -Actions

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

Rollout Allocation Strategies for Classification-based Policy Iteration

Rollout Allocation Strategies for Classification-based Policy Iteration Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh firstname.lastname@inria.fr Workshop on Reinforcement Learning and Search in Very Large

More information

Lecture 11: Bandits with Knapsacks

Lecture 11: Bandits with Knapsacks CMSC 858G: Bandits, Experts and Games 11/14/16 Lecture 11: Bandits with Knapsacks Instructor: Alex Slivkins Scribed by: Mahsa Derakhshan 1 Motivating Example: Dynamic Pricing The basic version of the dynamic

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

Recharging Bandits. Joint work with Nicole Immorlica.

Recharging Bandits. Joint work with Nicole Immorlica. Recharging Bandits Bobby Kleinberg Cornell University Joint work with Nicole Immorlica. NYU Machine Learning Seminar New York, NY 24 Oct 2017 Prologue Can you construct a dinner schedule that: never goes

More information

Optimal Regret Minimization in Posted-Price Auctions with Strategic Buyers

Optimal Regret Minimization in Posted-Price Auctions with Strategic Buyers Optimal Regret Minimization in Posted-Price Auctions with Strategic Buyers Mehryar Mohri Courant Institute and Google Research 251 Mercer Street New York, NY 10012 mohri@cims.nyu.edu Andres Muñoz Medina

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

The Non-stationary Stochastic Multi-armed Bandit Problem

The Non-stationary Stochastic Multi-armed Bandit Problem The Non-stationary Stochastic Multi-armed Bandit Problem Robin Allesiardo, Raphaël Féraud, Odalric-Ambrym Maillard To cite this version: Robin Allesiardo, Raphaël Féraud, Odalric-Ambrym Maillard The Non-stationary

More information

ON COMPETING NON-LIFE INSURERS

ON COMPETING NON-LIFE INSURERS ON COMPETING NON-LIFE INSURERS JOINT WORK WITH HANSJOERG ALBRECHER (LAUSANNE) AND CHRISTOPHE DUTANG (STRASBOURG) Stéphane Loisel ISFA, Université Lyon 1 2 octobre 2012 INTRODUCTION Lapse rates Price elasticity

More information

Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning Jean-Bastien Grill Michal Valko SequeL team, INRIA Lille - Nord Europe, France jean-bastien.grill@inria.fr michal.valko@inria.fr

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

MDP Algorithms. Thomas Keller. June 20, University of Basel

MDP Algorithms. Thomas Keller. June 20, University of Basel MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search

More information

Applying Monte Carlo Tree Search to Curling AI

Applying Monte Carlo Tree Search to Curling AI AI 1,a) 2,b) MDP Applying Monte Carlo Tree Search to Curling AI Katsuki Ohto 1,a) Tetsuro Tanaka 2,b) Abstract: We propose an action decision method based on Monte Carlo Tree Search for MDPs with continuous

More information

Lecture 19: March 20

Lecture 19: March 20 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 19: March 0 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may

More information

Online Network Revenue Management using Thompson Sampling

Online Network Revenue Management using Thompson Sampling Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira David Simchi-Levi He Wang Working Paper 16-031 Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira

More information

Smoothed Analysis of Binary Search Trees

Smoothed Analysis of Binary Search Trees Smoothed Analysis of Binary Search Trees Bodo Manthey and Rüdiger Reischuk Universität zu Lübeck, Institut für Theoretische Informatik Ratzeburger Allee 160, 23538 Lübeck, Germany manthey/reischuk@tcs.uni-luebeck.de

More information

Foundations of Artificial Intelligence

Foundations of Artificial Intelligence Foundations of Artificial Intelligence 44. Monte-Carlo Tree Search: Introduction Thomas Keller Universität Basel May 27, 2016 Board Games: Overview chapter overview: 41. Introduction and State of the Art

More information

Action Selection for MDPs: Anytime AO* vs. UCT

Action Selection for MDPs: Anytime AO* vs. UCT Action Selection for MDPs: Anytime AO* vs. UCT Blai Bonet 1 and Hector Geffner 2 1 Universidad Simón Boĺıvar 2 ICREA & Universitat Pompeu Fabra AAAI, Toronto, Canada, July 2012 Online MDP Planning and

More information

Asymptotic results discrete time martingales and stochastic algorithms

Asymptotic results discrete time martingales and stochastic algorithms Asymptotic results discrete time martingales and stochastic algorithms Bernard Bercu Bordeaux University, France IFCAM Summer School Bangalore, India, July 2015 Bernard Bercu Asymptotic results for discrete

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their

More information

Introduction to Reinforcement Learning. MAL Seminar

Introduction to Reinforcement Learning. MAL Seminar Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I CS221 / Spring 2018 / Sadigh Lecture 9: Games I Course plan Search problems Markov decision processes Adversarial games Constraint satisfaction problems Bayesian networks Reflex States Variables Logic

More information

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision

More information

Equity correlations implied by index options: estimation and model uncertainty analysis

Equity correlations implied by index options: estimation and model uncertainty analysis 1/18 : estimation and model analysis, EDHEC Business School (joint work with Rama COT) Modeling and managing financial risks Paris, 10 13 January 2011 2/18 Outline 1 2 of multi-asset models Solution to

More information

A selection of MAS learning techniques based on RL

A selection of MAS learning techniques based on RL A selection of MAS learning techniques based on RL Ann Nowé 14/11/12 Herhaling titel van presentatie 1 Content Single stage setting Common interest (Claus & Boutilier, Kapetanakis&Kudenko) Conflicting

More information

Extending MCTS

Extending MCTS Extending MCTS 2-17-16 Reading Quiz (from Monday) What is the relationship between Monte Carlo tree search and upper confidence bound applied to trees? a) MCTS is a type of UCT b) UCT is a type of MCTS

More information

Microeconomics II. CIDE, MsC Economics. List of Problems

Microeconomics II. CIDE, MsC Economics. List of Problems Microeconomics II CIDE, MsC Economics List of Problems 1. There are three people, Amy (A), Bart (B) and Chris (C): A and B have hats. These three people are arranged in a room so that B can see everything

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

Reduced-Variance Payoff Estimation in Adversarial Bandit Problems

Reduced-Variance Payoff Estimation in Adversarial Bandit Problems Reduced-Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis and Csaba Szepesvári Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, 1111

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

TTIC An Introduction to the Theory of Machine Learning. The Adversarial Multi-armed Bandit Problem Avrim Blum.

TTIC An Introduction to the Theory of Machine Learning. The Adversarial Multi-armed Bandit Problem Avrim Blum. TTIC 31250 An Introduction to the Theory of Machine Learning The Adversarial Multi-armed Bandit Problem Avrim Blum Start with recap 1 Algorithm Consider the following setting Each morning, you need to

More information

EE266 Homework 5 Solutions

EE266 Homework 5 Solutions EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent

More information

Rational Behaviour and Strategy Construction in Infinite Multiplayer Games

Rational Behaviour and Strategy Construction in Infinite Multiplayer Games Rational Behaviour and Strategy Construction in Infinite Multiplayer Games Michael Ummels ummels@logic.rwth-aachen.de FSTTCS 2006 Michael Ummels Rational Behaviour and Strategy Construction 1 / 15 Infinite

More information

Auctions That Implement Efficient Investments

Auctions That Implement Efficient Investments Auctions That Implement Efficient Investments Kentaro Tomoeda October 31, 215 Abstract This article analyzes the implementability of efficient investments for two commonly used mechanisms in single-item

More information

Regret Minimization and Correlated Equilibria

Regret Minimization and Correlated Equilibria Algorithmic Game heory Summer 2017, Week 4 EH Zürich Overview Regret Minimization and Correlated Equilibria Paolo Penna We have seen different type of equilibria and also considered the corresponding price

More information

How to Buy Advice. Ronen Gradwohl Yuval Salant. First version: January 3, 2011 This version: September 20, Abstract

How to Buy Advice. Ronen Gradwohl Yuval Salant. First version: January 3, 2011 This version: September 20, Abstract How to Buy Advice Ronen Gradwohl Yuval Salant First version: January 3, 2011 This version: September 20, 2011 Abstract A decision maker, whose payoff is influenced by an unknown stochastic process, seeks

More information

Cooperative Games with Monte Carlo Tree Search

Cooperative Games with Monte Carlo Tree Search Int'l Conf. Artificial Intelligence ICAI'5 99 Cooperative Games with Monte Carlo Tree Search CheeChian Cheng and Norman Carver Department of Computer Science, Southern Illinois University, Carbondale,

More information

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?

More information

CEC login. Student Details Name SOLUTIONS

CEC login. Student Details Name SOLUTIONS Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1 Lecture 9: Games I Course plan Search problems Markov decision processes Adversarial games Constraint satisfaction problems Bayesian networks Reflex States Variables Logic Low-level intelligence Machine

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

Stochastic Games and Bayesian Games

Stochastic Games and Bayesian Games Stochastic Games and Bayesian Games CPSC 532l Lecture 10 Stochastic Games and Bayesian Games CPSC 532l Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games 4 Analyzing Bayesian

More information

Chapter 7: Estimation Sections

Chapter 7: Estimation Sections 1 / 40 Chapter 7: Estimation Sections 7.1 Statistical Inference Bayesian Methods: Chapter 7 7.2 Prior and Posterior Distributions 7.3 Conjugate Prior Distributions 7.4 Bayes Estimators Frequentist Methods:

More information

Ph.D. Preliminary Examination MICROECONOMIC THEORY Applied Economics Graduate Program August 2017

Ph.D. Preliminary Examination MICROECONOMIC THEORY Applied Economics Graduate Program August 2017 Ph.D. Preliminary Examination MICROECONOMIC THEORY Applied Economics Graduate Program August 2017 The time limit for this exam is four hours. The exam has four sections. Each section includes two questions.

More information

arxiv: v1 [math.oc] 23 Dec 2010

arxiv: v1 [math.oc] 23 Dec 2010 ASYMPTOTIC PROPERTIES OF OPTIMAL TRAJECTORIES IN DYNAMIC PROGRAMMING SYLVAIN SORIN, XAVIER VENEL, GUILLAUME VIGERAL Abstract. We show in a dynamic programming framework that uniform convergence of the

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

On Complexity of Multistage Stochastic Programs

On Complexity of Multistage Stochastic Programs On Complexity of Multistage Stochastic Programs Alexander Shapiro School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0205, USA e-mail: ashapiro@isye.gatech.edu

More information

INTERIM CORRELATED RATIONALIZABILITY IN INFINITE GAMES

INTERIM CORRELATED RATIONALIZABILITY IN INFINITE GAMES INTERIM CORRELATED RATIONALIZABILITY IN INFINITE GAMES JONATHAN WEINSTEIN AND MUHAMET YILDIZ A. We show that, under the usual continuity and compactness assumptions, interim correlated rationalizability

More information

Sequential Coalition Formation for Uncertain Environments

Sequential Coalition Formation for Uncertain Environments Sequential Coalition Formation for Uncertain Environments Hosam Hanna Computer Sciences Department GREYC - University of Caen 14032 Caen - France hanna@info.unicaen.fr Abstract In several applications,

More information

Teaching Bandits How to Behave

Teaching Bandits How to Behave Teaching Bandits How to Behave Manuscript Yiling Chen, Jerry Kung, David Parkes, Ariel Procaccia, Haoqi Zhang Abstract Consider a setting in which an agent selects an action in each time period and there

More information

FDPE Microeconomics 3 Spring 2017 Pauli Murto TA: Tsz-Ning Wong (These solution hints are based on Julia Salmi s solution hints for Spring 2015.

FDPE Microeconomics 3 Spring 2017 Pauli Murto TA: Tsz-Ning Wong (These solution hints are based on Julia Salmi s solution hints for Spring 2015. FDPE Microeconomics 3 Spring 2017 Pauli Murto TA: Tsz-Ning Wong (These solution hints are based on Julia Salmi s solution hints for Spring 2015.) Hints for Problem Set 2 1. Consider a zero-sum game, where

More information

Chapter 3. Dynamic discrete games and auctions: an introduction

Chapter 3. Dynamic discrete games and auctions: an introduction Chapter 3. Dynamic discrete games and auctions: an introduction Joan Llull Structural Micro. IDEA PhD Program I. Dynamic Discrete Games with Imperfect Information A. Motivating example: firm entry and

More information

Final exam solutions

Final exam solutions EE365 Stochastic Control / MS&E251 Stochastic Decision Models Profs. S. Lall, S. Boyd June 5 6 or June 6 7, 2013 Final exam solutions This is a 24 hour take-home final. Please turn it in to one of the

More information

Lecture 23: April 10

Lecture 23: April 10 CS271 Randomness & Computation Spring 2018 Instructor: Alistair Sinclair Lecture 23: April 10 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They

More information

Theoretical Statistics. Lecture 4. Peter Bartlett

Theoretical Statistics. Lecture 4. Peter Bartlett 1. Concentration inequalities. Theoretical Statistics. Lecture 4. Peter Bartlett 1 Outline of today s lecture We have been looking at deviation inequalities, i.e., bounds on tail probabilities likep(x

More information

Martingales. by D. Cox December 2, 2009

Martingales. by D. Cox December 2, 2009 Martingales by D. Cox December 2, 2009 1 Stochastic Processes. Definition 1.1 Let T be an arbitrary index set. A stochastic process indexed by T is a family of random variables (X t : t T) defined on a

More information

Regret Minimization against Strategic Buyers

Regret Minimization against Strategic Buyers Regret Minimization against Strategic Buyers Mehryar Mohri Courant Institute & Google Research Andrés Muñoz Medina Google Research Motivation Online advertisement: revenue of modern search engine and

More information

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models IEOR E4707: Foundations of Financial Engineering c 206 by Martin Haugh Martingale Pricing Theory in Discrete-Time and Discrete-Space Models These notes develop the theory of martingale pricing in a discrete-time,

More information

Importance Sampling for Fair Policy Selection

Importance Sampling for Fair Policy Selection Importance Sampling for Fair Policy Selection Shayan Doroudi Carnegie Mellon University Pittsburgh, PA 15213 shayand@cs.cmu.edu Philip S. Thomas Carnegie Mellon University Pittsburgh, PA 15213 philipt@cs.cmu.edu

More information

Limit Theorems for the Empirical Distribution Function of Scaled Increments of Itô Semimartingales at high frequencies

Limit Theorems for the Empirical Distribution Function of Scaled Increments of Itô Semimartingales at high frequencies Limit Theorems for the Empirical Distribution Function of Scaled Increments of Itô Semimartingales at high frequencies George Tauchen Duke University Viktor Todorov Northwestern University 2013 Motivation

More information

Stochastic Games and Bayesian Games

Stochastic Games and Bayesian Games Stochastic Games and Bayesian Games CPSC 532L Lecture 10 Stochastic Games and Bayesian Games CPSC 532L Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games Stochastic Games

More information

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme Learning for Revenue Optimization Andrés Muñoz Medina Renato Paes Leme How to succeed in business with basic ML? ML $1 $5 $10 $9 Google $35 $1 $8 $7 $7 Revenue $8 $30 $24 $18 $10 $1 $5 Price $7 $8$9$10

More information

A reinforcement learning process in extensive form games

A reinforcement learning process in extensive form games A reinforcement learning process in extensive form games Jean-François Laslier CNRS and Laboratoire d Econométrie de l Ecole Polytechnique, Paris. Bernard Walliser CERAS, Ecole Nationale des Ponts et Chaussées,

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Multi-armed bandits in dynamic pricing

Multi-armed bandits in dynamic pricing Multi-armed bandits in dynamic pricing Arnoud den Boer University of Twente, Centrum Wiskunde & Informatica Amsterdam Lancaster, January 11, 2016 Dynamic pricing A firm sells a product, with abundant inventory,

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

Sublinear Time Algorithms Oct 19, Lecture 1

Sublinear Time Algorithms Oct 19, Lecture 1 0368.416701 Sublinear Time Algorithms Oct 19, 2009 Lecturer: Ronitt Rubinfeld Lecture 1 Scribe: Daniel Shahaf 1 Sublinear-time algorithms: motivation Twenty years ago, there was practically no investigation

More information

arxiv: v1 [cs.lg] 23 Nov 2014

arxiv: v1 [cs.lg] 23 Nov 2014 Revenue Optimization in Posted-Price Auctions with Strategic Buyers arxiv:.0v [cs.lg] Nov 0 Mehryar Mohri Courant Institute and Google Research Mercer Street New York, NY 00 mohri@cims.nyu.edu Abstract

More information

Approximate Revenue Maximization with Multiple Items

Approximate Revenue Maximization with Multiple Items Approximate Revenue Maximization with Multiple Items Nir Shabbat - 05305311 December 5, 2012 Introduction The paper I read is called Approximate Revenue Maximization with Multiple Items by Sergiu Hart

More information