Reinforcement Learning

Size: px
Start display at page:

Download "Reinforcement Learning"

Transcription

1 Reinforcement Learning Model-based RL and Integrated Learning-Planning Planning and Search, Model Learning, Dyna Architecture, Exploration-Exploitation (many slides from lectures of Marc Toussaint & David Silver) Hung Ngo MLR Lab, University of Stuttgart

2 RL Approaches experience data demonstration data D = {(s, a, r, s ) t} T t=0 D = {(s 0:T, a 0:T ) d } n d=1 learn model P (s s, a) R(s, a) learn value fct. V (s) optimize policy π(s) learn policy π(s) learn latent costs R(s, a) Model based RL dynamic prog. V (s) policy π(s) Model free RL policy π(s) Policy Search Imitation Learning dynamic prog. V (s) Inverse RL policy π(s) 2/53

3 Outline 1. Monte-Carlo planning, MCTS, TD-search 2. Model-based RL 3. Integrated learning & planning (Dyna) 4. Exploration vs. exploitation PAC-MDP, artificial curiosity & exploration bonus, Bayesian RL 3/53

4 1. Monte-Carlo Planning, Tree Search Online approximate planning for the now 4/53

5 Refresh: Planning with DP Backup ] V π (s) = E π [r t+1 + γv π (s t+1 ) s t = s = a π(s, a) [ ] s P a ss R a ss + γv π (s ) Full-width backup. Iterate for all states 5/53

6 ecture 8: Integrating Learning and Planning Simulation-Based Search Heuristic/Forward Search orward Search Plan for (only) now: use MDP model to look ahead from current state s t Build a search tree with current state s t as root node Forward search algorithms select the best action by lookahead No need to solve whole MDP; just sub-mdp starting from now They build a search tree with the current state s t at the root Using a model of the MDP to look ahead s t T! T! T! T! T! T! T! T! T! T! No need to solve whole MDP, just sub-mdp starting from now 6/53

7 reason why heuristic search can be so e ective. The distribution of backups can be altered in similar ways to focus on the current state and its likely successors. As a limiting case we might use exactly the methods of heuristic search to construct a search tree, and then perform the individual, one-step backups Plan for from(only) bottom now: up, use as suggested MDP model by Figure to look ahead If the from backups current are ordered state s t in this Build way aand search a table-lookup tree with representation current state is used, s t asthen root exactly nodethe same backup would be achieved as in depth-first heuristic search. Any state-space search can be viewed Noin need this way to solve as thewhole piecingmdp; together just of sub-mdp a large number starting of individual from now one-step backups. Backup Thus, from the performance leaf nodes; improvement values could observed be pre-defined. with deeper searches is not Heuristic/Forward Search Figure 8.12: The deep backups of heuristic search can be implemented as a sequence of one-step backups (shown here outlined). The ordering shown is for a selective depth-first Can we still do fine without having to build an exhaustive search tree? 7/53

8 Refresh: Sample-based Learning During learning, the agent samples experience from the real world real experience: sampled from true model, i.e., environment s t+1 P (s s t, a t ); r t+1 P (r s t, a t ) Then use model-free RL: MC, TD(λ), SARSA, Q-learning, etc. MC, TD(λ): sample backup. 8/53

9 Sample-based Planning Use the model only to generate samples (as a simulator!) simulated experience: sampled from the estimated model s t+1 ˆP (s s t, a t ); r t+1 ˆP (r s t, a t ) 9/53

10 Sample-based Planning Use the model only to generate samples (as a simulator!) simulated experience: sampled from the estimated model s t+1 ˆP (s s t, a t ); r t+1 ˆP (r s t, a t ) Apply model-free RL (MC, TD, Sarsa, Q-learn) to simulated experience Sample-based planning methods are often more efficient Break curse of dimensionality Computationally efficient, anytime, parallelizable Works for black-box models (only requires samples) 9/53

11 Simulation-based Search Combine forward search + sample-based planning experience is simulated from now, i.e., from the current real state s t {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π s ) Apply model-free RL to simulated episodes: MC search, TD search 10/53

12 Simple/Flat Monte-Carlo Search Given a model M and a (fixed) simulation policy πs (e.g., random) For each action a A: simulate K episodes from current (real) state s t {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π sim ) Evaluate actions by average return (Monte-Carlo evaluation) Q(s t, a) = 1 K k R k t P Q πs (s t, a) w.r.t. M Select current (real) action with maximum estimated value a t = arg max a A branch is built but then thrown away. Q(s t, a) 11/53

13 Monte-Carlo Evaluation in Go V(s) = 2/4 = 0.5 Current position s Simulation Outcomes 12/53

14 Monte-Carlo Evaluation in Go V(s) = 2/4 = 0.5 Current position s Simulation Outcomes Discuss AlphaGo: scale things up! Value Network pre-trained using expert games Self-play using MCTS with pre-trained rollout 12/53

15 Monte-Carlo Tree Search (MCTS) Build a search tree during simulated episodes Caching statistics of rewards and #visits at each (s, a) pair Used to update a tree policy, e.g., UCT (UCB applied to trees) π uct (s) = argmax a 2 log ns ˆQ(s, a) + β, s tree n sa Outside the current tree: just follow some default rollout policy 13/53

16 Monte-Carlo Tree Search (MCTS) Build a search tree during simulated episodes Caching statistics of rewards and #visits at each (s, a) pair Used to update a tree policy, e.g., UCT (UCB applied to trees) π uct (s) = argmax a 2 log ns ˆQ(s, a) + β, s tree n sa Outside the current tree: just follow some default rollout policy Grow and visit more often the promising & rewarding parts 13/53

17 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t 14/53

18 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t Greedy tree policy is improved after each simulated episode k interleaving MC policy-evaluation & policy-improvement in each simulated episode exploit regions of the tree that currently appear better than others while continuing to explore unknown or less known parts of the tree 14/53

19 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t Greedy tree policy is improved after each simulated episode k interleaving MC policy-evaluation & policy-improvement in each simulated episode exploit regions of the tree that currently appear better than others while continuing to explore unknown or less known parts of the tree Default/rollout policy: random, pretrained, or learned on real experience using e.g. model-free off-policy methods 14/53

20 Monte-Carlo Tree Search (MCTS) Simulate K episodes following a tree policy and a rollout policy {s t, a k t, r k t+1, s k t+1,..., s k T k } K k=1 ( M, π k tree, π k rollout) ˆQ(s, a) = 1 n sa K T k Rt k (sk t, ak t sk t = s, ak t = a), s tree k=1 t =t Greedy tree policy is improved after each simulated episode k interleaving MC policy-evaluation & policy-improvement in each simulated episode exploit regions of the tree that currently appear better than others while continuing to explore unknown or less known parts of the tree Default/rollout policy: random, pretrained, or learned on real experience using e.g. model-free off-policy methods Converges on the optimal search tree value/policy Q (s, a), s tree 14/53

21 in Go ng Monte-Carlo Tree Tree Search Search (1) (MCTS) node(x/y): average return/number of trials 15/53

22 -Based Search n Go ng Monte-Carlo Tree Tree Search Search (2) (MCTS) 16/53

23 -Based Search n Go ng Monte-Carlo Tree Tree Search Search (3) (MCTS) 17/53

24 -Based Search n Go ng Monte-Carlo Tree Tree Search Search (4) (MCTS) 18/53

25 Based Search n Go ng Monte-Carlo Tree Tree Search Search (5) (MCTS) 19/53

26 8.7. MONTE CARLO TREE SEARCH 189 Monte-Carlo Tree Search (MCTS) New node in the tree Node stored in the tree State visited but not stored Terminal outcome Current simulation Previous simulation 20/53

27 Temporal-Difference Search: Bootstrapping MC tree search applies MC control to sub-mdp from now TD search applies Sarsa to sub-mdp from now For each step of simulation, update action-values by Sarsa Q(s, a) = α(r + γq(s, a ) Q(s, a)) For model-free RL, bootstrapping is helpful TD learning/search reduces variance but increases bias TD(λ) learning/search can be much more efficient than MC 21/53

28 -free tems. It is reasonable to imagine that model-based 2. Model-based RL Built once, used forever! 22/53

29 The Big Picture: Planning, Learning, and Acting Learning allows an agent to improve its policy from its interactions with the environment. Planning: to improve its policy w/o further interaction. 23/53

30 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression 24/53

31 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression D as a model: experience replay! 24/53

32 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression D as a model: experience replay! Example: linear forward model, φ(s ) = F a φ(s); Least mean squares (LMS) SGD update rule: r = b a φ(s) F F + α(φ(s ) F φ(s))φ(s) ; b b + α(r b φ(s))φ(s) 24/53

33 Model-based RL Model learning: given experience D = {(s t, a t, r t+1, s t+1 )} H t=1 learning P (s, r s, a): regression/density estimation problems discrete state-action space: counting, ˆP (s s, a) = ns sa n sa continuous state-action space: ˆP (s s, a) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression D as a model: experience replay! Example: linear forward model, φ(s ) = F a φ(s); Least mean squares (LMS) SGD update rule: r = b a φ(s) F F + α(φ(s ) F φ(s))φ(s) ; b b + α(r b φ(s))φ(s) To construct V /π from learned model: use planning discrete case: DP on the estimated model (VI, PI, etc.) sample-based planning (MCTS, TD-search): simple but powerful continuous case: differential DP; planning-by-inference, etc. 24/53

34 Model-based RL: Pros and Cons Advantages: Can efficiently learn model by supervised learning methods Rapid adaptation to new problems and situations (via planning) Can reason about model uncertainty 25/53

35 Model-based RL: Pros and Cons Advantages: Can efficiently learn model by supervised learning methods Rapid adaptation to new problems and situations (via planning) Can reason about model uncertainty Disadvantages: two sources of approximation error! In estimating model and value function If model is inaccurate planning will compute a suboptimal policy Hence, asymptotically model-free methods are often better 25/53

36 Model-based RL: Pros and Cons Advantages: Can efficiently learn model by supervised learning methods Rapid adaptation to new problems and situations (via planning) Can reason about model uncertainty Disadvantages: two sources of approximation error! In estimating model and value function If model is inaccurate planning will compute a suboptimal policy Hence, asymptotically model-free methods are often better Solution 1: reason explicitly about model uncertainty (BRL) Solution 2: use model-free RL when model is wrong Solution 3: integrated model-based and model-free 25/53

37 taught the commuter that on Friday evenings the best action at this intersection is to continue straight and avoid the freeway. Model-free methods are clearly easier to use in terms of online decision-making; however, much trial-and-error experience is required to make the values be good estimates of future consequences. Moreover, the cached values are inherently inflexible: although hearing about an unexpected traffic jam on the radio can immediately affect action selection that is based on a forward model, the effect of the traffic jam on a cached propensity such as avoid 3. the Integrated freeway on Friday evening cannot Learning be calculated without further & trial-and-error Planning: learning on days in which Dyna this traffic jam occurs. Changes in the goal of behavior, as when moving to a new house, also expose the differences between the methods: whereas model-based decision making can be immediately sensitive to such a goal-shift, cached values are again slow to change appropriately. Indeed, many of us have experienced this directly in daily life after moving house. We clearly know the location of our new home, and can make our way to it by concentrating on the new route; but we can occasionally take an habitual wrong turn toward the old address if our minds wander. Such introspection, and a wealth of rigorous Combining behavioral studies (see the [15], for best a review) of suggests both that the worlds! brain employs both model-free and model-based decision-making strategies in parallel, with each dominating in different circumstances [14]. Indeed, somewhat different neural substrates underlie each one [17]. 26/53

38 Dyna: Integrating Learning and Planning Model-free RL No model Learn value function (and/or policy) from real experience Model-based RL (using sample-based planning) Learn a model from real experience Plan value function/policy from simulated experience Dyna Learn a model from real experience Learn & plan value function/policy from real & simulated experience 27/53

39 The Dyna Architecture Two distributions of states and actions (experience) Learning distribution (real experience) Search distribution (simulated experience) Integrated approaches differ in generating search distributions simulated transitions: Dyna-Q, Dyna+Priority Sweeping simulated trajectories from TD-search: Dyna-2 28/53

40 Dyna-Q Algorithm steps (a e): real experience; step (f): in simulation 29/53

41 Dyna-Q Algorithm: Example on Simple Maze Introduction to RL, Sutton & Barto, /53

42 Dyna-Q Algorithm: 1st and 2nd Episode Introduction to RL, Sutton & Barto, /53

43 When the Model Is Wrong: Changed Environment The changed environment is harder 32/53

44 When the Model Is Wrong: Short-cut Maze The changed environment is easier 33/53

45 Dyna-Q+ This agent keeps track for each state-action pair of how many time steps t sa have elapsed since the pair was last tried in a real interaction. If the transition has not been tried in t sa time steps: assigned a phantom reward (exploration bonus) r + κ t sa, for small κ on simulated experiences involving these actions To encourage behavior that tests long-untried actions A form of artificial/computational curiosity, intrinsic rewards the agent motivates itself to visit long unvisited states 34/53

46 4. Exploration vs. Exploitation (Too much) curiosity kills the cat! 35/53

47 Exploration vs. Exploitation Dilemma RL agents start to act without a model of the environment: have to learn from its experience what to do in order to fulfill tasks and achieve high average return. Online decision-making involves a fundamental choice: Exploitation: Make the best decision given current information Exploration: Gather more information The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions exploration as fundamental intelligent behavior 36/53

48 Exploration: Motivating Example 37/53

49 Exploration Strategies/Principles Naive Exploration Add noise to greedy policy (e.g. ɛ-greedy) Optimistic Initialization Assume the best until proven otherwise Optimism in the Face of Uncertainty Prefer actions with uncertain values (e.g., UCB: ˆµ sa + β Probability Matching Select actions according to probability they are best 2 log n s n sa ) Information State Search Lookahead search incorporating value of information (e.g., BRL) Other heuristics Recency-based exploration bonus for non-stationary environment Need a notion of optimality: sample complexity 38/53

50 Sample Complexity Let M be an MDP with N states, K actions, discount factor γ [0, 1) and a maximal reward R max > 0. Let A be an algorithm (that is, a reinforcement learning agent) that acts in the environment, resulting in h t = (s 0, a 0, r 1, s 1, a 1, r 2,..., r t, s t ). Let V A t,m = E [ i=0 γi r t+i h t ]; V is the optimal value function. Define an accuracy threshold ɛ: ˆV V ɛ 39/53

51 Sample Complexity and Efficient Exploration Definition (Kakade, 2003): Let ɛ > 0 be a prescribed accuracy and δ > 0 be an allowed probability of failure. The expression η(ɛ, δ, N, K, γ, R max ) is a sample complexity bound for algorithm A if independently of the choice of s 0, with probability at least 1 δ, the number of timesteps such that V A t,m < V (s t ) ɛ is at most η(ɛ, δ, N, K, γ, R max ). An algorithm with sample complexity that is polynomial in 1/ɛ, log(1/δ), N, K, 1/(1 γ), R max is called PAC-MDP (probably approximately correct in MDPs). 40/53

52 Sample Complexity of Exploration Strategies Assume we have estimates ˆQ(s, a) argmax a ˆQ(s, a) with probability 1 ɛ ɛ-greedy: π(s) = random action with probability ɛ Most popular method Converges to the optimal value function with probability 1 (all paths will be visited sooner or later), if the exploration rate diminishes according to an appropriate schedule. Problem: sample complexity exponential in the number of states Boltzmann: choose action a with softmax probabilities Temperature T controls amount of exploration Problem again: sample complexity exponential in #states exp( ˆQ(s,a)/T ) a exp( ˆQ(s,a)/T ) 41/53

53 Sample Complexity of Exploration Strategies Other heuristics for exploration: minimize variance of action value estimates optimistic initial values ( optimism in the face of uncertainty ) state bonuses: frequency, recency, error etc. Again: sample complexity exponential in #states Bayesian RL: optimal exploration strategy maintain a probability distribution over MDP models (i.e., parameters) posterior distribution updated after each new observation (interaction) exploration strategy minimizes uncertainty of parameters Bayes-optimal solution to the exploration-exploitation tradeoff (i.e., no other policy is better in expectation w.r.t. prior distribution over MDPs) only tractable for very simple problems 42/53

54 PAC-MDP Algorithms Explicit-Exploit-or-Explore (E3) & RMAX principled approach to the exploration-exploitation tradeoff with polynomial sample complexity Common intuition: again, optimism in the face of uncertainty If faced the option of certain and uncertain reward regions, try the uncertain reward region! 43/53

55 Explicit-Exploit-or-Explore (E3) Model-based PAC-MDP (Kearns & Singh 02) Assuming maximum reward R max is known Quantify confidence in model estimates Maintaining counts for executed state-action pairs A state s is known if a A(s) have been executed sufficiently often. 44/53

56 Explicit-Exploit-or-Explore (E3) Model-based PAC-MDP (Kearns & Singh 02) Assuming maximum reward R max is known Quantify confidence in model estimates Maintaining counts for executed state-action pairs A state s is known if a A(s) have been executed sufficiently often. From observed data, E3 constructs two MDPs: MDP known : includes known states with (approximately exact) estimates of P (s t+1 a t, s t ) and P (r t+1 a t, s t ) for exploiting! MDP unknown : MDP known + special state s with self-loop where the agent receives maximum reward for exploring! 44/53

57 E3 Sketch Input: State s Output: Action a if s is known then Plan in MDP known Sufficiently accurate model estimates if resulting plan has value above some threshold then return first action of plan Exploitation else Plan in MDP unknown return first action of plan Planned exploration else return action with the least observations in s Direct exploration 45/53

58 E3 Example S. Singh (Tutorial 2005) 46/53

59 E3 Example 47/53

60 E3 Example M : true known state MDP M : estimated known state MDP 48/53

61 E3 Implementation Setting T is the time horizon G T max is the maximum T-step return. (discounted case G T max T R max ) ( ) A state is known if it was visited O (NT G T max/ɛ) 4 ν max log(1/δ) times. (ν max is the maximum variance of the random payoffs over all states) For the exploration/exploitation choice at known states: It s assumed to be given the optimal value function V. If ˆV obtained from the MDP known > (V ɛ) then do exploitation. 49/53

62 RMAX Algorithm R-MAX solves only one unique model (no separate MDP known and MDP unknown ) and therefore implicitly explores or exploits. R-MAX and E3 algorithms achieve roughly the same level of performance (Strehl s thesis). RMAX builds an approximate MDP based on reward function ˆR(s, a) R(s, a) = R max if (s,a) known depending on some parameter m otherwise 50/53

63 RMAX s Pseudocode Inputs: S, A, R max, m. // Initialization: all transitions are to heaven and maximally rewarding! Add heaven state s to the state space: S = S {s }. Initialize ˆR(s, a) = R max, ˆT (s s, a) = δ s (s ) for all s, s S. // Kronecker function δ s (s ) = 1 if s = s, and 0 otherwise. Initialize a uniform random policy π. Initialize all couters n(s, a) = 0; n(s, a, s ) = 0; r(s, a) = 0, s S, s S, a A. while not converged do // Select action randomly until first time model update. Execute a = π(s), observe s, r. Update n(s, a) n(s, a) + 1; n(s, a, s ) n(s, a, s ) + 1; r(s, a) r(s, a) + r. if n(s, a) = m then // For small domains we can use n(s, a) m Update ˆT ( s, a) = n(s, a, )/n(s, a), and ˆR(s, a) = r(s, a)/n(s, a). Update policy π using MDP model ( ˆT, ˆR) // e.g., Q-Iteration. 51/53

64 RMAX Analysis ( ) NT (Upper bound) There exists m = O 2 ɛ ln 2 NK 2 δ, then with probability of at least 1 δ, V (s t ) V (s t ) ɛ is true for all but ( N 2 KT 3 O steps, where N is the number of states. ɛ 3 ) ln 2 NK δ For discounted case: T = log ɛ 1 γ The general PAC-MDP theorem is not easily adapted to the analysis of E3 because of its use of two internal models Original analysis depends on horizon and mixing time 52/53

65 Limitations of E3 and RMAX E3 and RMAX are called efficient because their sample complexity is only polynomial in the number of states. This is usually too slow for practical algorithms but is probably the best that can be done in the worst case. In natural environments the number of states is enormous: exponential in the number of objects in the environment. Hence E3/RMAX sample complexity scales exponentially in the number of objects. Generalization over states and actions is crucial for exploration Exploration in relational RL (Lang & Toussaint 12) 53/53

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

Lecture 4: Model-Free Prediction

Lecture 4: Model-Free Prediction Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction

More information

Introduction to Reinforcement Learning. MAL Seminar

Introduction to Reinforcement Learning. MAL Seminar Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) 1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Reinforcement Learning 04 - Monte Carlo. Elena, Xi Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Action Selection for MDPs: Anytime AO* vs. UCT

Action Selection for MDPs: Anytime AO* vs. UCT Action Selection for MDPs: Anytime AO* vs. UCT Blai Bonet 1 and Hector Geffner 2 1 Universidad Simón Boĺıvar 2 ICREA & Universitat Pompeu Fabra AAAI, Toronto, Canada, July 2012 Online MDP Planning and

More information

Multi-step Bootstrapping

Multi-step Bootstrapping Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, 2017 1 / 29 Multi-step Bootstrapping Generalization

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i Temporal-Di erence Learning Taras Kucherenko, Joonatan Manttari KTH tarask@kth.se manttari@kth.se March 7, 2017 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, 2017 1 / 68 Motivation: disadvantages

More information

TDT4171 Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

Monte-Carlo Planning: Basic Principles and Recent Progress

Monte-Carlo Planning: Basic Principles and Recent Progress Monte-Carlo Planning: Basic Principles and Recent Progress Alan Fern School of EECS Oregon State University Outline Preliminaries: Markov Decision Processes What is Monte-Carlo Planning? Uniform Monte-Carlo

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

Importance Sampling for Fair Policy Selection

Importance Sampling for Fair Policy Selection Importance Sampling for Fair Policy Selection Shayan Doroudi Carnegie Mellon University Pittsburgh, PA 15213 shayand@cs.cmu.edu Philip S. Thomas Carnegie Mellon University Pittsburgh, PA 15213 philipt@cs.cmu.edu

More information

Extending MCTS

Extending MCTS Extending MCTS 2-17-16 Reading Quiz (from Monday) What is the relationship between Monte Carlo tree search and upper confidence bound applied to trees? a) MCTS is a type of UCT b) UCT is a type of MCTS

More information

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

MDP Algorithms. Thomas Keller. June 20, University of Basel

MDP Algorithms. Thomas Keller. June 20, University of Basel MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search

More information

Decision Theory: Value Iteration

Decision Theory: Value Iteration Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision

More information

MDPs: Bellman Equations, Value Iteration

MDPs: Bellman Equations, Value Iteration MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations

More information

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II CS221 / Spring 218 / Sadigh Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 /

More information

2D5362 Machine Learning

2D5362 Machine Learning 2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files

More information

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,

More information

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Reinforcement Learning. Monte Carlo and Temporal Difference Learning Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2015 Introduction to Artificial Intelligence Midterm 1 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

The exam is closed book, closed calculator, and closed notes except your three crib sheets. CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

CS885 Reinforcement Learning Lecture 3b: May 9, 2018 CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

Reinforcement Learning Lectures 4 and 5

Reinforcement Learning Lectures 4 and 5 Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and

More information

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II CS221 / Autumn 218 / Liang Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 / Autumn

More information

Rollout Allocation Strategies for Classification-based Policy Iteration

Rollout Allocation Strategies for Classification-based Policy Iteration Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh firstname.lastname@inria.fr Workshop on Reinforcement Learning and Search in Very Large

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer

More information

Temporal Abstraction in RL

Temporal Abstraction in RL Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;

More information

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision

More information

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model Simerjot Kaur (sk3391) Stanford University Abstract This work presents a novel algorithmic trading system based on reinforcement

More information

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig] Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal

More information

Lecture outline W.B.Powell 1

Lecture outline W.B.Powell 1 Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa

More information

Foundations of Artificial Intelligence

Foundations of Artificial Intelligence Foundations of Artificial Intelligence 44. Monte-Carlo Tree Search: Introduction Thomas Keller Universität Basel May 27, 2016 Board Games: Overview chapter overview: 41. Introduction and State of the Art

More information

Making Complex Decisions

Making Complex Decisions Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2

More information

Markov Decision Process

Markov Decision Process Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf

More information

Lecture 7: Bayesian approach to MAB - Gittins index

Lecture 7: Bayesian approach to MAB - Gittins index Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Hierarchical Reinforcement Learning Action hierarchy, hierarchical RL, semi-mdp Vien Ngo Marc Toussaint University of Stuttgart Outline Hierarchical reinforcement learning Learning

More information

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning

More information

Reasoning with Uncertainty

Reasoning with Uncertainty Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally

More information

Markov Decision Processes. Lirong Xia

Markov Decision Processes. Lirong Xia Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent

More information

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}

More information

Adaptive Experiments for Policy Choice. March 8, 2019

Adaptive Experiments for Policy Choice. March 8, 2019 Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann March 8, 2019 Introduction The goal of many experiments is to inform policy choices: 1. Job search assistance for refugees: Treatments:

More information

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I

CS221 / Spring 2018 / Sadigh. Lecture 9: Games I CS221 / Spring 2018 / Sadigh Lecture 9: Games I Course plan Search problems Markov decision processes Adversarial games Constraint satisfaction problems Bayesian networks Reflex States Variables Logic

More information

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE Rollout algorithms Cost improvement property Discrete deterministic problems Approximations of rollout algorithms Discretization of continuous time

More information

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1

Lecture 9: Games I. Course plan. A simple game. Roadmap. Machine learning. Example: game 1 Lecture 9: Games I Course plan Search problems Markov decision processes Adversarial games Constraint satisfaction problems Bayesian networks Reflex States Variables Logic Low-level intelligence Machine

More information

Equity correlations implied by index options: estimation and model uncertainty analysis

Equity correlations implied by index options: estimation and model uncertainty analysis 1/18 : estimation and model analysis, EDHEC Business School (joint work with Rama COT) Modeling and managing financial risks Paris, 10 13 January 2011 2/18 Outline 1 2 of multi-asset models Solution to

More information

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials

More information

Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks

Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks Hussein Abouzeid Department of Electrical Computer and Systems Engineering Rensselaer Polytechnic Institute abouzeid@ecse.rpi.edu

More information

Supplementary Material: Strategies for exploration in the domain of losses

Supplementary Material: Strategies for exploration in the domain of losses 1 Supplementary Material: Strategies for exploration in the domain of losses Paul M. Krueger 1,, Robert C. Wilson 2,, and Jonathan D. Cohen 3,4 1 Department of Psychology, University of California, Berkeley

More information

The Irrevocable Multi-Armed Bandit Problem

The Irrevocable Multi-Armed Bandit Problem The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision

More information

Relational Regression Methods to Speed Up Monte-Carlo Planning

Relational Regression Methods to Speed Up Monte-Carlo Planning Institute of Parallel and Distributed Systems University of Stuttgart Universitätsstraße 38 D 70569 Stuttgart Relational Regression Methods to Speed Up Monte-Carlo Planning Teresa Böpple Course of Study:

More information

Lattice Model of System Evolution. Outline

Lattice Model of System Evolution. Outline Lattice Model of System Evolution Richard de Neufville Professor of Engineering Systems and of Civil and Environmental Engineering MIT Massachusetts Institute of Technology Lattice Model Slide 1 of 48

More information

Stochastic Differential Equations in Finance and Monte Carlo Simulations

Stochastic Differential Equations in Finance and Monte Carlo Simulations Stochastic Differential Equations in Finance and Department of Statistics and Modelling Science University of Strathclyde Glasgow, G1 1XH China 2009 Outline Stochastic Modelling in Asset Prices 1 Stochastic

More information

Dynamic Pricing with Varying Cost

Dynamic Pricing with Varying Cost Dynamic Pricing with Varying Cost L. Jeff Hong College of Business City University of Hong Kong Joint work with Ying Zhong and Guangwu Liu Outline 1 Introduction 2 Problem Formulation 3 Pricing Policy

More information

The Option-Critic Architecture

The Option-Critic Architecture The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently

More information

Lattice Model of System Evolution. Outline

Lattice Model of System Evolution. Outline Lattice Model of System Evolution Richard de Neufville Professor of Engineering Systems and of Civil and Environmental Engineering MIT Massachusetts Institute of Technology Lattice Model Slide 1 of 32

More information

The Problem of Temporal Abstraction

The Problem of Temporal Abstraction The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,

More information

Inverse reinforcement learning from summary data

Inverse reinforcement learning from summary data Inverse reinforcement learning from summary data Antti Kangasrääsiö, Samuel Kaski Aalto University, Finland ECML PKDD 2018 journal track Published in Machine Learning (2018), 107:1517 1535 September 12,

More information

Bandit algorithms for tree search Applications to games, optimization, and planning

Bandit algorithms for tree search Applications to games, optimization, and planning Bandit algorithms for tree search Applications to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://sequel.futurs.inria.fr/ INRIA Lille - Nord Europe Journées MAS

More information

CEC login. Student Details Name SOLUTIONS

CEC login. Student Details Name SOLUTIONS Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching

More information

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell

More information

Computational Finance Least Squares Monte Carlo

Computational Finance Least Squares Monte Carlo Computational Finance Least Squares Monte Carlo School of Mathematics 2019 Monte Carlo and Binomial Methods In the last two lectures we discussed the binomial tree method and convergence problems. One

More information

Top-down particle filtering for Bayesian decision trees

Top-down particle filtering for Bayesian decision trees Top-down particle filtering for Bayesian decision trees Balaji Lakshminarayanan 1, Daniel M. Roy 2 and Yee Whye Teh 3 1. Gatsby Unit, UCL, 2. University of Cambridge and 3. University of Oxford Outline

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their

More information

Ensemble Methods for Reinforcement Learning with Function Approximation

Ensemble Methods for Reinforcement Learning with Function Approximation Ensemble Methods for Reinforcement Learning with Function Approximation Stefan Faußer and Friedhelm Schwenker Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany {stefan.fausser,friedhelm.schwenker}@uni-ulm.de

More information

INVERSE REWARD DESIGN

INVERSE REWARD DESIGN INVERSE REWARD DESIGN Dylan Hadfield-Menell, Smith Milli, Pieter Abbeel, Stuart Russell, Anca Dragan University of California, Berkeley Slides by Anthony Chen Inverse Reinforcement Learning (Review) Inverse

More information