Rollout Allocation Strategies for Classification-based Policy Iteration

Size: px

Start display at page:

Download "Rollout Allocation Strategies for Classification-based Policy Iteration"

Helena Tucker
6 years ago
Views:

1 Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh Workshop on Reinforcement Learning and Search in Very Large Spaces. June 25, 2010, Haifa, Israel.

2 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

3 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

4 Definitions I Discounted Markov Decision Process MDP A discounted Markov Decision Process M is a tuple X, A, r, p, γ, where The state space X, The set of actions A is finite ( A = M < ), The transition model p( x, a) is a distribution over X, The reward function r : X A X R, γ (0, 1) is a discount factor. π is a deterministic policy π : X A.

5 Definitions II Rollout Sampling a trajectory starting from state x 0 with action a 0 and following π with return R π (x 0, a 0 ): R π (x 0, a 0 ) = r(x 0, a 0, x 1 ) + t>0 γt r(x t, π(a t ), x t+1 )), x t+1 p( x t, a t ) E[R π (x, a)] = Q π (x, a), x X and a A

6 Classification-based Policy Iteration Without Value Function representation! The policy improvement step is cast as a classification problem.

7 Classification-based Policy Iteration Without Value Function representation! As accurate as possible The policy improvement step is cast as a classification problem.

8 Classification-based Policy Iteration Without Value Function representation! As accurate as possible How to allocate? The policy improvement step is cast as a classification problem.

9 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

10 Rollout Allocation Problem The Goal: Learn the greedy policy To maximize the accuracy of the training set D T To identify the greedy actions in each state of rollout set D R The Setting: Fixed Budget L At each iteration, L rollouts are performed (generative model) to estimate the Q-values over D R. Fixed Rollout Set The rollout set is sampled from a fixed distribution ρ and is fixed through iterations. The Problem: How should we allocate the budget over actions, states and state-action pairs?

11 Example What is the best strategy to maximize the accuracy of the training set with a fixed budget? a1 a2 a3 state 1 state 2 Can we do better than just allocating uniformly the budget over the state-action pairs? state 3 state 4

12 Allocation Strategies Strategies over Actions The budget is allocated uniformly over states. In each state we face a pure exploration bandit problem. Strategies over States The budget of one state is allocated uniformly over its actions. How many rollouts in each state? (e.g. Allocate more rollouts to states which need more rollouts to discriminate the greedy action). Strategies over States and Actions Can we merge strategies from the above two problems? Should we design something new?

13 Allocation Strategies Strategies over Actions The budget is allocated uniformly over states. In each state we face a pure exploration bandit problem. Strategies over States The budget of one state is allocated uniformly over its actions. How many rollouts in each state? (e.g. Allocate more rollouts to states which need more rollouts to discriminate the greedy action). Strategies over States and Actions Can we merge strategies from the above two problems? Should we design something new?

14 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

15 Previous Works Rollout Classification-based Policy Iteration (RCPI) was introduced by Lagoudakis & Parr (2003) & Fern et al. (2004). Rollout allocation was uniform over states and actions. Dimitrakakis & Lagoudakis (2008) proposed rollout allocation strategies over states. The target of their work is to create an accurate training set by performing the least number of rollouts possible.

16 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

17 Rollout Allocation Strategies over Actions Pure Exploration MAB: A player has a L rounds to explore M arms/actions. He outputs the one with the highest estimated expectation. The number of rounds can be unknown. The Setting: The budget is allocated uniformly over states: L(x) = L/N In each state x, the player estimates Q π (x, a), a A After c(x, a) rollouts, Q c(x,a) π 1 (x, a) = c(x,a) He outputs â = arg max a A Qπ L(x) (x, a) c(x,a) j=1 Rj π (x, a)

18 The Goal: Rollout Allocation Strategies over Actions To minimize the probability of not identifying the greedy action w.r.t. π at a state x with a fixed number of rollouts. P(error) = P[1â a ] = P[1 argmax a A Q π L(x) (x,a) argmax a A Q π (x,a) ] Bubeck et al. (2009) proved that cumulative regret minimizers (e.g. UCB) are not optimal solutions for this problem. Notations: (x, a) = Q(x, a ) Q(x, a) (x) = H(x) = min a A, a a (x, a) a A, a a 1 (x,a) 2

19 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

20 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

21 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

22 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

23 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

24 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

25 Successive Rejects (SR) The length of the phases are defined as, m {1... M 1}: 1 L(x) M n m = log(m) M + 1 m, log(m) = M m=2 The (estimated) K best actions are sampled O(L(x)/K) times. Characteristics No parameter Need knowledge of L(x) Theoretical Results 1 m P(error) Successive Rejects O(exp( L(x) M log(m)h(x) )) Uniform L(x) (x)2 O(exp( M )) SR performs better than Uniform when H(x) M/ (x) 2.

26 UCB-E Audibert et al. (2010) Parameter: exploration parameter s > 0 Definition of index: For a A, for c 1 and B a,0 = +,let Algorithm B a,c = Q c (x, a) + s L(x) M H(x)c for each round t = 1, 2,..., L(x) do Explore the action with the highest index: Perform a rollout from (x, A t ), A t arg max B a,c(x,a). end for a A Output the action with maximum estimated Q-value.

27 UCB-E Audibert et al. (2010) Characteristics Have an exploration parameter Need knowledge of L(x) and H(x) P(error) = O(exp( L(x) M H )) Versions Adaptive On-line under-estimation of H(x) Any-time Replace L(x) with the current round t

28 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

29 Domains We used classical settings of Mountain Car and Inverted Pendulum but increased the noise. Mountain Car (MC) Inverted Pendulum (IP) Formulation: Sutton et.al. (1998) Noise increased to: 1.2 Formulation: Lagoudakis & Parr (2003) Noise increased to: 15

30 Artificially increasing the number of actions left action null action right action real actions... player actions Mapping in Mountain Car One to forward, One to reverse, The rest to stay action. Mapping in Inverted Pendulum One to right, One to left, The rest are stochastic actions: P(right) = P(left) = P(null) = 1 3.

31 Training set accuracy (SR vs Uniform)

32 Training set accuracy (SR vs Uniform) Differences in the percentage of error over a fixed D R : states in D R Actions (M) Budget (L/MN) SR provides a training set more accurate than Uniform. Is this improvement propagating through iterations?

33 SR vs Uniform in Mountain Car

34 SR vs Uniform in Mountain Car Differences in the performance of the policies (steps):

35 SR vs Uniform in Mountain Car Differences in the performance of the policies (steps): states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) When the problem is not too hard or too easy, SR outperforms Uniform.

36 SR vs Uniform in Inverted Pendulum

37 SR vs Uniform in Inverted Pendulum Differences of the performance of the policies (steps) : states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) 100 states in D R Budget (L/MN) 200 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)

38 Adaptive UCB-E vs SR in Mountain Car

39 Adaptive UCB-E vs SR in Mountain Car Differences of the performance of the policies with s = 1: states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)

40 Adaptive UCB-E vs SR in Inverted Pendulum

41 Adaptive UCB-E vs SR in Inverted Pendulum Differences of performance of the policies with s = 1: states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) 100 states in D R Actions (M) Budget (L/MN)

42 Any-time Adaptive UCB-E vs Uniform in Mountain Car

43 Any-time Adaptive UCB-E vs Uniform in Mountain Car Differences of the performance of the policies with s = 1: states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)

44 Any-time Adaptive UCB-E vs Uniform in IP

45 Any-time Adaptive UCB-E vs Uniform in IP Differences of the performance of the policies with s = 1: states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) 100 states in D R Actions (M) Budget (L/MN)

46 Final Comparison

47 Final Comparison Mountain Car: 100 Actions, 50 States Average steps to the goal Uniform Successive Rejects Adaptive UCB E Any time Adaptive UCB E Budget (L/MN)

48 Future work and Conclusions Summary We studied different rollout allocation strategies for RCPI problems with fixed budget. The proposed strategies over actions (SR, UCB-E...) significantly outperforms the uniform strategy. Future Work To address the issue of rollout allocation over states and state-actions pairs. Any-time Adaptive UCB-E could be merged with any strategy over states.

49 Discussion

50 Bibliography I

51 Dimitrakakis et al. Construct a rollout set D R = {x i } N i=1, x iid i ρ, rol = 0, D T = while D T MAXSIZETRAINING rol L do x SelectState(), c(x) c(x) + 1 Sample all the actions in this state. rol rol + M if a dominating action â exists in x then D T = D T {(x, â ), +} D T = D T {(x, a), }, a â Replace x in D R by y ρ end if end while SelectState return arg max x DR (x) + UCB (x) = min a M,a â Qπ k (x, â ) Q π k (x, a) Significance Test Dominating if c(x) (x) 2 Kst Rmax log(1/δ) with δ an arbitray accuracy parameter.

52 [? ] [? ] [? ] [? ] [? ]

53 Dimitrakakis & Lagoudakis (2008) The target of these strategies is to create an accurate training set by performing the least number of rollouts possible. Their work focuses on the states where the greedy action seems easier to discriminate from the other actions. When a dominant action a is found in state x, (x, a) is immediately put in D T and x replaced by a state x new ρ. No rollout allocation strategy over actions.

54 Regression-based Policy Iteration

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned