Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Erick Conley
6 years ago
Views:

1 Reinforcement Learning Model-based RL and Integrated Learning-Planning Planning and Search, Model Learning, Dyna Architecture, Exploration-Exploitation (many slides from lectures of Marc Toussaint & David Silver) Hung Ngo MLR Lab, University of Stuttgart

2 RL Approaches experience data demonstration data D = {(s, a, r, s ) t} T t=0 D = {(s 0:T, a 0:T ) d } n d=1 learn model P (s s, a) R(s, a) learn value fct. V (s) optimize policy π(s) learn policy π(s) learn latent costs R(s, a) Model based RL dynamic prog. V (s) policy π(s) Model free RL policy π(s) Policy Search Imitation Learning dynamic prog. V (s) Inverse RL policy π(s) 2/53

3 Outline 1. Monte-Carlo planning, MCTS, TD-search 2. Model-based RL 3. Integrated learning & planning (Dyna) 4. Exploration vs. exploitation PAC-MDP, artificial curiosity & exploration bonus, Bayesian RL 3/53

4 1. Monte-Carlo Planning, Tree Search Online approximate planning for the now 4/53

5 Refresh: Planning with DP Backup ] V π (s) = E π [r t+1 + γv π (s t+1 ) s t = s = a π(s, a) [ ] s P a ss R a ss + γv π (s ) Full-width backup. Iterate for all states 5/53

6 ecture 8: Integrating Learning and Planning Simulation-Based Search Heuristic/Forward Search orward Search Plan for (only) now: use MDP model to look ahead from current state s t Build a search tree with current state s t as root node Forward search algorithms select the best action by lookahead No need to solve whole MDP; just sub-mdp starting from now They build a search tree with the current state s t at the root Using a model of the MDP to look ahead s t T! T! T! T! T! T! T! T! T! T! No need to solve whole MDP, just sub-mdp starting from now 6/53

7 reason why heuristic search can be so e ective. The distribution of backups can be altered in similar ways to focus on the current state and its likely successors. As a limiting case we might use exactly the methods of heuristic search to construct a search tree, and then perform the individual, one-step backups Plan for from(only) bottom now: up, use as suggested MDP model by Figure to look ahead If the from backups current are ordered state s t in this Build way aand search a table-lookup tree with representation current state is used, s t asthen root exactly nodethe same backup would be achieved as in depth-first heuristic search. Any state-space search can be viewed Noin need this way to solve as thewhole piecingmdp; together just of sub-mdp a large number starting of individual from now one-step backups. Backup Thus, from the performance leaf nodes; improvement values could observed be pre-defined. with deeper searches is not Heuristic/Forward Search Figure 8.12: The deep backups of heuristic search can be implemented as a sequence of one-step backups (shown here outlined). The ordering shown is for a selective depth-first Can we still do fine without having to build an exhaustive search tree? 7/53

8 Refresh: Sample-based Learning During learning, the agent samples experience from the real world real experience: sampled from true model, i.e., environment s t+1 P (s s t, a t ); r t+1 P (r s t, a t ) Then use model-free RL: MC, TD(λ), SARSA, Q-learning, etc. MC, TD(λ): sample backup. 8/53

9 Sample-based Planning Use the model only to generate samples (as a simulator!) simulated experience: sampled from the estimated model s t+1 ˆP (s s t, a t ); r t+1 ˆP (r s t, a t ) 9/53

10 Sample-based Planning Use the model only to generate samples (as a simulator!) simulated experience: sampled from the estimated model s t+1 ˆP (s s t, a t ); r t+1 ˆP (r s t, a t ) Apply model-free RL (MC, TD, Sarsa, Q-learn) to simulated experience Sample-based planning methods are often more efficient Break curse of dimensionality Computationally efficient, anytime, parallelizable Works for black-box models (only requires samples) 9/53

Simulation-based Search Combine forward search + sample-based planning experience is simulated from now, i.e., from the current real state s t {s t, a k t, r k t+1, s k t+1,.

11 Simulation-based Search Combine forward search + sample-based planning experience is simulated from now, i.e., from the current real state s t {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π s ) Apply model-free RL to simulated episodes: MC search, TD search 10/53

12 Simple/Flat Monte-Carlo Search Given a model M and a (fixed) simulation policy πs (e.g., random) For each action a A: simulate K episodes from current (real) state s t {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π sim ) Evaluate actions by average return (Monte-Carlo evaluation) Q(s t, a) = 1 K k R k t P Q πs (s t, a) w.r.t. M Select current (real) action with maximum estimated value a t = arg max a A branch is built but then thrown away. Q(s t, a) 11/53

13 Monte-Carlo Evaluation in Go V(s) = 2/4 = 0.5 Current position s Simulation Outcomes 12/53

14 Monte-Carlo Evaluation in Go V(s) = 2/4 = 0.5 Current position s Simulation Outcomes Discuss AlphaGo: scale things up! Value Network pre-trained using expert games Self-play using MCTS with pre-trained rollout 12/53

15 Monte-Carlo Tree Search (MCTS) Build a search tree during simulated episodes Caching statistics of rewards and #visits at each (s, a) pair Used to update a tree policy, e.g., UCT (UCB applied to trees) π uct (s) = argmax a 2 log ns ˆQ(s, a) + β, s tree n sa Outside the current tree: just follow some default rollout policy 13/53

16 Monte-Carlo Tree Search (MCTS) Build a search tree during simulated episodes Caching statistics of rewards and #visits at each (s, a) pair Used to update a tree policy, e.g., UCT (UCB applied to trees) π uct (s) = argmax a 2 log ns ˆQ(s, a) + β, s tree n sa Outside the current tree: just follow some default rollout policy Grow and visit more often the promising & rewarding parts 13/53

17 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t 14/53

18 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t Greedy tree policy is improved after each simulated episode k interleaving MC policy-evaluation & policy-improvement in each simulated episode exploit regions of the tree that currently appear better than others while continuing to explore unknown or less known parts of the tree 14/53

19 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t Greedy tree policy is improved after each simulated episode k interleaving MC policy-evaluation & policy-improvement in each simulated episode exploit regions of the tree that currently appear better than others while continuing to explore unknown or less known parts of the tree Default/rollout policy: random, pretrained, or learned on real experience using e.g. model-free off-policy methods 14/53

20 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t Greedy tree policy is improved after each simulated episode k interleaving MC policy-evaluation & policy-improvement in each simulated episode exploit regions of the tree that currently appear better than others while continuing to explore unknown or less known parts of the tree Default/rollout policy: random, pretrained, or learned on real experience using e.g. model-free off-policy methods Converges on the optimal search tree value/policy Q (s, a), s tree 14/53

21 in Go ng Monte-Carlo Tree Tree Search Search (1) (MCTS) node(x/y): average return/number of trials 15/53

22 -Based Search n Go ng Monte-Carlo Tree Tree Search Search (2) (MCTS) 16/53

23 -Based Search n Go ng Monte-Carlo Tree Tree Search Search (3) (MCTS) 17/53

24 -Based Search n Go ng Monte-Carlo Tree Tree Search Search (4) (MCTS) 18/53

25 Based Search n Go ng Monte-Carlo Tree Tree Search Search (5) (MCTS) 19/53

26 8.7. MONTE CARLO TREE SEARCH 189 Monte-Carlo Tree Search (MCTS) New node in the tree Node stored in the tree State visited but not stored Terminal outcome Current simulation Previous simulation 20/53

27 Temporal-Difference Search: Bootstrapping MC tree search applies MC control to sub-mdp from now TD search applies Sarsa to sub-mdp from now For each step of simulation, update action-values by Sarsa Q(s, a) = α(r + γq(s, a ) Q(s, a)) For model-free RL, bootstrapping is helpful TD learning/search reduces variance but increases bias TD(λ) learning/search can be much more efficient than MC 21/53

28 -free tems. It is reasonable to imagine that model-based 2. Model-based RL Built once, used forever! 22/53

29 The Big Picture: Planning, Learning, and Acting Learning allows an agent to improve its policy from its interactions with the environment. Planning: to improve its policy w/o further interaction. 23/53

30 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression 24/53

31 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression D as a model: experience replay! 24/53

32 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression D as a model: experience replay! Example: linear forward model, φ(s ) = F a φ(s); Least mean squares (LMS) SGD update rule: r = b a φ(s) F F + α(φ(s ) F φ(s))φ(s) ; b b + α(r b φ(s))φ(s) 24/53

33 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression D as a model: experience replay! Example: linear forward model, φ(s ) = F a φ(s); Least mean squares (LMS) SGD update rule: r = b a φ(s) F F + α(φ(s ) F φ(s))φ(s) ; b b + α(r b φ(s))φ(s) To construct V /π from learned model: use planning discrete case: DP on the estimated model (VI, PI, etc.) sample-based planning (MCTS, TD-search): simple but powerful continuous case: differential DP; planning-by-inference, etc. 24/53

34 Model-based RL: Pros and Cons Advantages: Can efficiently learn model by supervised learning methods Rapid adaptation to new problems and situations (via planning) Can reason about model uncertainty 25/53

35 Model-based RL: Pros and Cons Advantages: Can efficiently learn model by supervised learning methods Rapid adaptation to new problems and situations (via planning) Can reason about model uncertainty Disadvantages: two sources of approximation error! In estimating model and value function If model is inaccurate planning will compute a suboptimal policy Hence, asymptotically model-free methods are often better 25/53

36 Model-based RL: Pros and Cons Advantages: Can efficiently learn model by supervised learning methods Rapid adaptation to new problems and situations (via planning) Can reason about model uncertainty Disadvantages: two sources of approximation error! In estimating model and value function If model is inaccurate planning will compute a suboptimal policy Hence, asymptotically model-free methods are often better Solution 1: reason explicitly about model uncertainty (BRL) Solution 2: use model-free RL when model is wrong Solution 3: integrated model-based and model-free 25/53

37 taught the commuter that on Friday evenings the best action at this intersection is to continue straight and avoid the freeway. Model-free methods are clearly easier to use in terms of online decision-making; however, much trial-and-error experience is required to make the values be good estimates of future consequences. Moreover, the cached values are inherently inflexible: although hearing about an unexpected traffic jam on the radio can immediately affect action selection that is based on a forward model, the effect of the traffic jam on a cached propensity such as avoid 3. the Integrated freeway on Friday evening cannot Learning be calculated without further & trial-and-error Planning: learning on days in which Dyna this traffic jam occurs. Changes in the goal of behavior, as when moving to a new house, also expose the differences between the methods: whereas model-based decision making can be immediately sensitive to such a goal-shift, cached values are again slow to change appropriately. Indeed, many of us have experienced this directly in daily life after moving house. We clearly know the location of our new home, and can make our way to it by concentrating on the new route; but we can occasionally take an habitual wrong turn toward the old address if our minds wander. Such introspection, and a wealth of rigorous Combining behavioral studies (see the [15], for best a review) of suggests both that the worlds! brain employs both model-free and model-based decision-making strategies in parallel, with each dominating in different circumstances [14]. Indeed, somewhat different neural substrates underlie each one [17]. 26/53

38 Dyna: Integrating Learning and Planning Model-free RL No model Learn value function (and/or policy) from real experience Model-based RL (using sample-based planning) Learn a model from real experience Plan value function/policy from simulated experience Dyna Learn a model from real experience Learn & plan value function/policy from real & simulated experience 27/53

The Dyna Architecture Two distributions of states and actions (experience) Learning distribution (real experience) Search distribution (simulated experience)

39 The Dyna Architecture Two distributions of states and actions (experience) Learning distribution (real experience) Search distribution (simulated experience) Integrated approaches differ in generating search distributions simulated transitions: Dyna-Q, Dyna+Priority Sweeping simulated trajectories from TD-search: Dyna-2 28/53

40 Dyna-Q Algorithm steps (a e): real experience; step (f): in simulation 29/53

41 Dyna-Q Algorithm: Example on Simple Maze Introduction to RL, Sutton & Barto, /53

42 Dyna-Q Algorithm: 1st and 2nd Episode Introduction to RL, Sutton & Barto, /53

43 When the Model Is Wrong: Changed Environment The changed environment is harder 32/53

44 When the Model Is Wrong: Short-cut Maze The changed environment is easier 33/53

45 Dyna-Q+ This agent keeps track for each state-action pair of how many time steps t sa have elapsed since the pair was last tried in a real interaction. If the transition has not been tried in t sa time steps: assigned a phantom reward (exploration bonus) r + κ t sa, for small κ on simulated experiences involving these actions To encourage behavior that tests long-untried actions A form of artificial/computational curiosity, intrinsic rewards the agent motivates itself to visit long unvisited states 34/53

46 4. Exploration vs. Exploitation (Too much) curiosity kills the cat! 35/53

47 Exploration vs. Exploitation Dilemma RL agents start to act without a model of the environment: have to learn from its experience what to do in order to fulfill tasks and achieve high average return. Online decision-making involves a fundamental choice: Exploitation: Make the best decision given current information Exploration: Gather more information The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions exploration as fundamental intelligent behavior 36/53

48 Exploration: Motivating Example 37/53

49 Exploration Strategies/Principles Naive Exploration Add noise to greedy policy (e.g. ɛ-greedy) Optimistic Initialization Assume the best until proven otherwise Optimism in the Face of Uncertainty Prefer actions with uncertain values (e.g., UCB: ˆµ sa + β Probability Matching Select actions according to probability they are best 2 log n s n sa ) Information State Search Lookahead search incorporating value of information (e.g., BRL) Other heuristics Recency-based exploration bonus for non-stationary environment Need a notion of optimality: sample complexity 38/53

50 Sample Complexity Let M be an MDP with N states, K actions, discount factor γ [0, 1) and a maximal reward R max > 0. Let A be an algorithm (that is, a reinforcement learning agent) that acts in the environment, resulting in h t = (s 0, a 0, r 1, s 1, a 1, r 2,..., r t, s t ). Let V A t,m = E [ i=0 γi r t+i h t ]; V is the optimal value function. Define an accuracy threshold ɛ: ˆV V ɛ 39/53

51 Sample Complexity and Efficient Exploration Definition (Kakade, 2003): Let ɛ > 0 be a prescribed accuracy and δ > 0 be an allowed probability of failure. The expression η(ɛ, δ, N, K, γ, R max ) is a sample complexity bound for algorithm A if independently of the choice of s 0, with probability at least 1 δ, the number of timesteps such that V A t,m < V (s t ) ɛ is at most η(ɛ, δ, N, K, γ, R max ). An algorithm with sample complexity that is polynomial in 1/ɛ, log(1/δ), N, K, 1/(1 γ), R max is called PAC-MDP (probably approximately correct in MDPs). 40/53

52 Sample Complexity of Exploration Strategies Assume we have estimates ˆQ(s, a) argmax a ˆQ(s, a) with probability 1 ɛ ɛ-greedy: π(s) = random action with probability ɛ Most popular method Converges to the optimal value function with probability 1 (all paths will be visited sooner or later), if the exploration rate diminishes according to an appropriate schedule. Problem: sample complexity exponential in the number of states Boltzmann: choose action a with softmax probabilities Temperature T controls amount of exploration Problem again: sample complexity exponential in #states exp( ˆQ(s,a)/T ) a exp( ˆQ(s,a)/T ) 41/53

53 Sample Complexity of Exploration Strategies Other heuristics for exploration: minimize variance of action value estimates optimistic initial values ( optimism in the face of uncertainty ) state bonuses: frequency, recency, error etc. Again: sample complexity exponential in #states Bayesian RL: optimal exploration strategy maintain a probability distribution over MDP models (i.e., parameters) posterior distribution updated after each new observation (interaction) exploration strategy minimizes uncertainty of parameters Bayes-optimal solution to the exploration-exploitation tradeoff (i.e., no other policy is better in expectation w.r.t. prior distribution over MDPs) only tractable for very simple problems 42/53

54 PAC-MDP Algorithms Explicit-Exploit-or-Explore (E3) & RMAX principled approach to the exploration-exploitation tradeoff with polynomial sample complexity Common intuition: again, optimism in the face of uncertainty If faced the option of certain and uncertain reward regions, try the uncertain reward region! 43/53

55 Explicit-Exploit-or-Explore (E3) Model-based PAC-MDP (Kearns & Singh 02) Assuming maximum reward R max is known Quantify confidence in model estimates Maintaining counts for executed state-action pairs A state s is known if a A(s) have been executed sufficiently often. 44/53

56 Explicit-Exploit-or-Explore (E3) Model-based PAC-MDP (Kearns & Singh 02) Assuming maximum reward R max is known Quantify confidence in model estimates Maintaining counts for executed state-action pairs A state s is known if a A(s) have been executed sufficiently often. From observed data, E3 constructs two MDPs: MDP known : includes known states with (approximately exact) estimates of P (s t+1 a t, s t ) and P (r t+1 a t, s t ) for exploiting! MDP unknown : MDP known + special state s with self-loop where the agent receives maximum reward for exploring! 44/53

57 E3 Sketch Input: State s Output: Action a if s is known then Plan in MDP known Sufficiently accurate model estimates if resulting plan has value above some threshold then return first action of plan Exploitation else Plan in MDP unknown return first action of plan Planned exploration else return action with the least observations in s Direct exploration 45/53

58 E3 Example S. Singh (Tutorial 2005) 46/53

59 E3 Example 47/53

60 E3 Example M : true known state MDP M : estimated known state MDP 48/53

61 E3 Implementation Setting T is the time horizon G T max is the maximum T-step return. (discounted case G T max T R max ) ( ) A state is known if it was visited O (NT G T max/ɛ) 4 ν max log(1/δ) times. (ν max is the maximum variance of the random payoffs over all states) For the exploration/exploitation choice at known states: It s assumed to be given the optimal value function V. If ˆV obtained from the MDP known > (V ɛ) then do exploitation. 49/53

62 RMAX Algorithm R-MAX solves only one unique model (no separate MDP known and MDP unknown ) and therefore implicitly explores or exploits. R-MAX and E3 algorithms achieve roughly the same level of performance (Strehl s thesis). RMAX builds an approximate MDP based on reward function ˆR(s, a) R(s, a) = R max if (s,a) known depending on some parameter m otherwise 50/53

63 RMAX s Pseudocode Inputs: S, A, R max, m. // Initialization: all transitions are to heaven and maximally rewarding! Add heaven state s to the state space: S = S {s }. Initialize ˆR(s, a) = R max, ˆT (s s, a) = δ s (s ) for all s, s S. // Kronecker function δ s (s ) = 1 if s = s, and 0 otherwise. Initialize a uniform random policy π. Initialize all couters n(s, a) = 0; n(s, a, s ) = 0; r(s, a) = 0, s S, s S, a A. while not converged do // Select action randomly until first time model update. Execute a = π(s), observe s, r. Update n(s, a) n(s, a) + 1; n(s, a, s ) n(s, a, s ) + 1; r(s, a) r(s, a) + r. if n(s, a) = m then // For small domains we can use n(s, a) m Update ˆT ( s, a) = n(s, a, )/n(s, a), and ˆR(s, a) = r(s, a)/n(s, a). Update policy π using MDP model ( ˆT, ˆR) // e.g., Q-Iteration. 51/53

64 RMAX Analysis ( ) NT (Upper bound) There exists m = O 2 ɛ ln 2 NK 2 δ, then with probability of at least 1 δ, V (s t ) V (s t ) ɛ is true for all but ( N 2 KT 3 O steps, where N is the number of states. ɛ 3 ) ln 2 NK δ For discounted case: T = log ɛ 1 γ The general PAC-MDP theorem is not easily adapted to the analysis of E3 because of its use of two internal models Original analysis depends on horizon and mixing time 52/53

65 Limitations of E3 and RMAX E3 and RMAX are called efficient because their sample complexity is only polynomial in the number of states. This is usually too slow for practical algorithms but is probably the best that can be done in the worst case. In natural environments the number of states is enormous: exponential in the number of objects in the environment. Hence E3/RMAX sample complexity scales exponentially in the number of objects. Generalization over states and actions is crucial for exploration Exploration in relational RL (Lang & Toussaint 12) 53/53

4 Reinforcement Learning Basic Algorithms

Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems