CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

Size: px
Start display at page:

Download "CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning"

Transcription

1 CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998]

2 Introduction Animals learn through interaction with an environment an infant learning motor skills a pigeon learning to press a particular lever a dog learning commands 1

3 Operant Conditioning 1 Much of the motivation behind RL comes from the work of behaviorist B. F. Skinner Operant Conditioning is a form of behavior modification based on Law of Effect behavior followed by pleasurable consequence tends to be repeated behavior followed by unpleasant consequence tends to be eliminated thus (according to OC) behavior is a function of its consequences the rat is always right B. F. Skinner meaning, a creature s behavior is rational given its reinforcement history 1 Slides adapted from mschnake/reinf.htm 2

4 Types of Reinforcement Positive Reinforcement: pleasurable consequence administered after a desired behavior strengthens behavior e.g. praising a dog after it performs a trick Extinction: withholding a previously administered positive reinforcement following an undesirable behavior reduces behavior e.g. imposing early curfew on a child who stayed out too late Punishment: an unpleasant consequence administered following an undesirable behavior reduces behavior e.g. a choke chain for a dog Negative Reinforcement: following a correct behavior strengthens behavior e.g. a boxer learning to block a jab withholding an unpleasant consequence 3

5 Reinforcement Schedules Continuous Reinforcement: every behavior is reinforced Partial Reinforcement: not every behavior reinforced: Fixed interval: fixed time interval passes between reinforcements Variable interval: time interval varies between a min and max Continuous schedule results in faster learning but fastest extinction if a reinforcement is missed Variable schedule results is most effective for developing more permanent behavior 4

6 The Skinner Box Skinner theorized that our behavior is programmed by the reinforcement schedule delivered by our world The Skinner Box was a way to demonstrate this possibility Skinner programmed these boxes to deliver reinforcements given various actions of its inhabitant Observed how actions changed over time 5

7 Components of reinforcement learning Agent: decision maker/learner interacting with environment agents selects actions Environment: the thing that the agent interacts with responds to actions updates the situation (i.e. state) provides reward State signal is assumed to have the Markov property 1. it should summarize relevant past experiences 2. Pr(S t+1 S t, S t 1,..., S 0 ) = Pr(S t+1 S t ) 6

8 Agent-Environment Interaction state t t reward r t Agent action a t reward r t+1 Environment state s t+1 7

9 Overview of Class We ll discuss three many approaches to solving RL problems dynamic programming Monte Carlo methods temporal difference methods We ll talk about integrating planning and learning And finally, approaches for addressing some weaknesses of these methods 8

10 Examples Pick-and-Place Robot: robot should learn fast, smooth motions for picking up objects and placing them in appropriate location actions: voltages applied to each motor in joints states: joint angles and velocities reward: +1 for each object successfully picked and placed punishment: can give a small negative reward as a function of jerkiness of motion 9

11 Recycling Robot: mobile robot collecting cans in an office environment actions: choose actions based on battery level: 1. actively search for cans for some period of time 2. remain stationary, waiting for someone to bring a can 3. head back to home for recharging states: battery level reward: positive for successfully depositing a can placed punishment: negative for allowing battery level to drop to empty note: in this example, RL agent is not entire robot 1. monitors condition of robot itself (i.e. battery level) 2. environment includes the rest of the robot 10

12 Dynamic Programming Methods for RL If we have an MDP modeling the system, we can define value and action-value functions Value function: expresses the relationship between the value of a state and the value of its successors V π (s) = R(s) + γ s S Pr(s, π(s), s )V π (s ) Action-Value function: expresses the relationship between a state, action pair and the value of the states that can be reached from that state using that action Q(s, a) R(s) + γ s S Pr(s, a, s )Q π (s, a ) 11

13 Backup Diagrams s s,a a r s r s a (a) V π (s) (b) Q π (s,a) 12

14 Graphical representation of relationships among states, actions and values Open circles: represent states Solid circles: represent actions V π (s): s takes its value from the states that can be reached with the available actions a given action could reach more than one state (noisy actions) Q π (s, a): the pair s, a takes its value from the states that can be reached with action a and then following π thereafter 13

15 Optimal Value Function V max s a r s V (s) = max a R(s) + γ Pr(s, a, s )V (s ) S s S 14

16 Optimal Action-Value Function Q s,a r s max a Q (s, a) = R(s, a) + γ s S Pr(s, a, s )mas a Q (s, a ) 15

17 Dynamic Programming Methods for Solving MDPs Policy Iteration: iterates over two phases policy evaluation: update the value function to reflect the current policy makes the value function consistent with current policy policy improvement: revise the policy given the value function makes the policy greedy with respect to the current value function Value Iteration: uses an update function which combines one sweep of policy evaluation with policy improvement 16

18 Is This Learning? When the agent generates a policy, has it actually learned anything? If the agent already knows the transition matrices and the reward function, then how is computing a policy learning? If you are told the rules to chess, does that mean you can automatically play a perfect game? Computing a policy allows you to solve problems in less time this is similar to building macro operators This is often called speed-up learning 17

19 Monte Carlo Methods for RL What if the agent is not given transition matrices and/or reward function? can we still build an optimal policy? we learned motor skills, but do you know the transition matrices for your motor actions? agent must learn entirely from experience, rather than from a model Monte Carlo Methods: solves RL by averaging sampled returns that is, we run some episodes and note the return values episode: a sequence of state transitions ending in a terminal state e.g. starting at beginning of a maze and moving to the end we update the estimated value of a state based on the average return we got for that state in the episodes 18

20 Monte Carlo Policy Evaluation Want to estimate V π (s) i.e. what is the value of being in s using π But we don t have Pr or R Approach: 1. generate sample runs using π 2. observe the (discounted) returns obtained for s 3. compute the average 19

21 Monte Carlo Policy Evaluation Algorithm: Monte Carlo Policy Evaluation 1. Initialize (a) π policy to be evaluated (b) V arbitrary value function (c) Returns(s) empty list for all s S 2. Repeat forever (a) Generate an episode using π (b) For each state s appearing in episode i. R return following first occurrence of s A. i.e. the discounted reward we eventually get in s after reaching terminal state ii. Append R to Returns(s) iii. V π (s) average(returns(s)) 20

22 Example: Playing Blackjack States features: value of cards we have (only care if greater than 11) value of the card dealer is showing whether or not we have an ace that can be treated as states Actions: hit stay 21

23 What if we want to solve blackjack w/ a DP approach? need to know probability of state transitions need to know expected reward for each possible terminal state e.g.: you stay at 14, what is the expected reward given the dealer s showing card We can easily simulate blackjack games and average the returns 22

24 Example: Monte Carlo Value Estimation on Blackjack After 10,000 episodes After 500,000 episodes Usable ace +1 1 No usable ace A Dealer showing Player sum 21 23

25 Backup Diagram for Monte Carlo Policy Evaluation 24

26 Monte Carlo Estimation of Action Values With action models, value function V is sufficient for generating optimal policy look ahead to states that can be reached with one step given possible actions but we must have the transition matrices for actions Without a model, it is particularly useful to estimate the action-value function Q (s, a) Approach: similar to value function estimation 1. generate sample runs 2. observe the (discounted) returns obtained for taking action a in state s 3. store the average in Q(s, a) 25

27 Exploitation vs. Exploration Problem with above approach if we have a deterministic policy, then we will always take same action each time we enter s if we want to choose among actions, we need to try various actions in each state Conflict between exploration vs. exploration we would like to exploit the knowledge we have learned e.g. you found a good path to the goal we would like to explore looking for improvements e.g. is there a better path? 26

28 Possibilities: exploring starts: pick a state-action pair as the start and then follow policy thereafter not always possible in real problems stochastic policy: use a stochastic policy with a non-zero probability of selecting all actions Monte Carlo ES (Exploring Starts): function and optimal policy together method for estimating value 27

29 Monte Carlo ES (Exploring Starts) Algorithm: Monte Carlo ES 1. Initialize, s S, a A: (a) Q(s, a) arbitrary (b) π(s) arbitrary (c) Returns(s, a) empty list 2. Repeat forever (a) Generate an episode using exploring starts and π (b) For each pair s, a appearing in episode i. R return following first occurrence of s, a A. i.e. the discounted reward we eventually get in s after reaching terminal state ii. Append R to Returns(s, a) iii. Q(s, a) average(returns(s, a)) (c) For each state s appearing in episode i. π(s) argmax a Q(s, a) 28

30 Example: Monte Carlo ES on Blackjack π * V * Usable ace No usable ace HIT A HIT A Dealer showing 21 STICK STICK Player sum +1 1 A A Dealer showing Player sum 29

31 On-Policy Vs. Off-Policy Monte Carlo Control We would like to remove assumption of exploratory-starts but to converge on optimal policy, still need to visit all states and actions are visited and selected infinitely often Two types of appraoches: on-policy: evalaute and/or improve the policy used to generate samples need to make policy soft π(s, a) > 0, s S, s A off-policy: the policy that is evaluated/improved is distinct from the one used to generate samples 30

32 ɛ-greedy Policies: An Approach to On-Policy Monte Carlo Control ɛ-greedy policy: choose action with highest estimated value with probability 1 ɛ choose a random action with probability ɛ thus, there is a nonzero probability of selecting each action from each state visited ɛ-soft policy: π(s, a) ɛ A(s) an ɛ-greedy policy is a type of ɛ-soft policy 31

33 Can use Monte Carlo control algorithm from above augmented with ɛ-policy start with an ɛ-soft policy and move it towards a ɛ-greedy policy Can show that it will converge on an ɛ-soft policy that is at least as good as all other ɛ-soft policies proof idea: fold the ɛ random action selection into the noise model of the environment the best one can do in this new environment with general policies is the same as the best one can do in the original environment with ɛ-soft policies 32

34 Off-Policy Monte Carlo Control It is also possible to optimize one policy π while gathering samples with a distinct policy π behavior policy: policy used to sample the space in this case π estimation policy: policy to be evaluated and improved in this case π 33

35 It is still possible to learn optimal policy behavior policy must be soft given an episode generated with π need to take into account the probability of that epsiode relative to π can take the ratio of probability of that episode occuring with π and that epsiode occuring with π for details see Sutton and Barto [1998] 34

36 Temporal Difference (TD) Methods Monte Carlo methods require us to reach the end of an episode before we can update the estimates only then is the return known TD methods provide a useful update after each step uses the instantaneous reward we receive 35

37 Example TD approach given state s take a single step to s observe the reward receive going to s update estimate of value of s (V (s)) stepsize α is used to determine how much to adjust current estimate given new sample TD combines aspects of DP and Monte Carlo like Monte Carlo: don t know Pr and R ahead of time update based on sample backup not full backup sample backup - we only consider one next state s (the one that was sampled) full backup - if we consider all possible next states that could be reached (this is what DP does) like DP: update based on a single step 36

38 Temporal Difference Approach to Estimating V π Algorithm: TD Value Estimation 1. Initialize (a) V (s) arbitrary (b) π(s) policy to be evaluated 2. Repeat (for each episode): (a) Initialize s to start state (b) Repeat (for each step of episode): i. a action taken by π given s ii. Take action a, observe reward r and next state s iii. V (s) V (s) + α [r + γv (s ) V (s)] iv. s s 37

39 Backup Diagram for TD Value Estimation 38

40 Advantages of TD Compared to DP Pr and R do not have to be known Compared to Monte Carlo can learn on-line (i.e. do not have to wait until end of episode) 39

41 Using TD to Solve the Control Problem Objective: learn action-value function: Q(s, a) Two possibilities: on-policy and off-policy We ll talk about the on-line method Approach: generate episodes for each state-action pair s, a observe the reward r received and the next state-action pair s, a update estimate of Q(s, a) based on old estimate, r and Q(s, a ) 40

42 Sarsa: name of approach stands for: state, action, reward, state action i.e. the five elements of the update 41

43 Sara: On-Policy TD Control Algorithm: Sarsa 1. Initialize Q(s, a) arbitrarily 2. Repeat (for each episode): (a) Initialize s to start state (b) Choose a given s using policy derived from Q (ɛ-greedy) (c) Repeat (for each step of episode) until s is terminal: i. Take action a, observe reward r and next state s ii. Choose a given s using policy derived from Q (ɛ-greedy) iii. Q(s, a) Q(s, a) + α [r + γq(s, a ) Q(s, a)] iv. s s ; a a 42

44 Backup Diagram for Sarsa Ask students what they think the diagram would be 43

45 Q-Learning: Off-Policy TD Control Off-Policy TD control is one of the most important breakthroughs of RL Approach use behavior policy to generate samples each sample includes s, a, r, s, a use samples to update estimate Q (s, a) update rule: Q(s t, a t ) Q(s t, a t ) + α [ ] r + γ max a Q(s t+1, a) Q(s t, a t ) 44

46 in other words: in an episode, the expected reward for the stateaction pair s t, a t is relative to the max expected reward of all actions that can be taken in s t+1 Approach called Q-Learning because we are learning the Q function 45

47 Q-Learning: Off-Policy TD Control Algorithm: Q-Learning 1. Initialize Q(s, a) arbitrarily 2. Repeat (for each episode): (a) Initialize s to start state (b) Repeat (for each step of episode) until s is terminal: i. Choose a given s using policy derived from Q (ɛ-greedy) ii. Take action a, observe [ reward r and next state s ] iii. Q(s, a) Q(s, a) + α r + γ max a Q(s, a ) Q(s, a) iv. s s until s is terminal 46

48 Backup Diagram for Q-Learning Ask students what they think the diagram would be 47

49 Why is Q-Learning Considered an Off-Policy Algorithm It does not take into account the ɛ-greedy policy when choosing actions it picks the action that maximizes the Q function but does not consider that that action might not always be the one selected 48

50 Sarsa Compared to Q-Learning r = 1 safe path S T h e C l i f f G optimal path r = Sarsa Reward per epsiode Q-learning Episodes 49

51 Notes on Example Q-Learning finds the optimal policy assuming a deterministic policy Sarsa finds the safer policy which is appropriate for an ɛ-greedy approach This demonstrates how Sarsa takes the behavior generating policy into account (i.e. on-line) and Q-Learning does not 50

52 Actor-Critic Methods Actor Policy state Critic Value Function TD error action reward Environment 51

53 Stores separate policy (actor) and value function (critic) actor uses policy to take actions critic uses value function to critique the actor s choice Critic uses s t and s t+1 to determine if things have gone as anticipated that is, it computes a TD error given by: δ t = r t+1 + γv (s t+1 ) V (s t ) TD error can be used to evaluate choice of action a in state s positive: tendency to select a in this state should be strengthened negative: suggests tendency to select a in this state should be weakened This should sound familiar given the discussion we had to motivate RL Note similarity to Law of Effect of Operant Conditioning 52

54 Credit Assignment Let s assume we have an estimated action-value function and are now trying to use the policy to solve a problem at each step t we are choosing actions to take that we expect will lead us to a high payoff Q(s t, a t ) predicts the (discounted) reward for this state action pair once we reach the end, we will get the actual reward what happens if, looking back, we realize that some state-action pairs made incorrect predictions? either too high or too low Which state-action pairs should we update and by how much? this is the credit assignment problem should pairs far back in time be held accountable for the problems that arise now? should pairs far back in time be given credit for current successes? perhaps they should, but maybe not as much as recent pairs and shouldn t pairs that weren t even encountered during this episode 53 be exempt from blame or credit?

55 Eligibility Traces Eligibility traces: provide a mechanism for determining how eligible a state-action pair (or just state for estimating V) is for an update trace should answer the question: which state-action pairs (or just states) are eligible for update and by how much? Assumptions made by eligibility traces: pairs closer to a reward should receive more credit for the reward than pairs further away eligibility decays pairs that are encountered more times on the way to a reward should receive more credit than pairs encountered fewer times 54

56 eligibility is incremented when we visit a pair see next slide Computing eligibility: eligibility for a state-action pair s, a at time t is given by: e t = { γλet 1 (s, a) + 1 if s = s t and a = a t γλe t 1 (s, a) otherwise s, a 55

57 Example of Accumulating Eligibility accumulating eligibility trace times of visits to a state 56

58 Updating State-Action Pair Reward Estimates At time t we will measure the prediction error by: the TD-error: δ t = r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t ) At time t, we will adjust a state-action pair s prediction based on the error δ t and its eligibility e t (s, a) Q t+1 = Q t (s, a) + αδ t e t (s, a) 57

59 Can think of it as follows (see next slide): we are moving along through the state-action pairs at each step we get a reward we compute the error δ t we shout back the error to the pairs we have visited in the past 58

60 Backward View of Eligibility Traces e t e t e t δt s t-3 s t-2 s t-1 e t s t s t+1 Time 59

61 TD(λ): Bridge Between Monte Carlo and TD Methods Consider what happens when we set λ = 0 e t = { 1 if s = st and a = a t 0 otherwise s, a only current state will be eligible for update so we only update the current step this is analogous to our previous TD approaches called TD(0) 60

62 As λ increases, but still < 1: more of the proceeding pairs are changed eligibility falls by γλ Consider what happens when we set λ = 1 e t = { γet 1 (s, a) + 1 if s = s t and a = a t γe t 1 (s, a) otherwise s, a eligibility falls by just γ per step this is what we had for Monte Carlo 61

63 Backup Diagram for TD(λ) What would the backup diagram for TD(λ) look like? Look back in the sequence of pairs and consider how a particular pair s, a is updated when we take the next step, we ll update s, a based on reward we got (1-step) when we take the next step, we will again update s, a, but this time the eligibility will be less (2-step) when we take the next step, we will again update s, a, but this time the eligibility will be less (3-step). so, the update to s, a can be thought of as an average of n-step returns i.e. the average of the returns we got after 1-step, 2-steps, 3-steps,... 62

64 This is the forward view of TD(λ) see next slide We can show this average over n-step returns graphically like previous backup diagrams bar over top to indicate average of returns weights for each element of average at the bottom see following slide 63

65 Forward View of Eligibility Traces r T r t+1 r t+2 s t+1 s t+2 r t+3 s t+3 s t Time 64

66 TD(λ) s Backup Diagram TD(λ), λ-return 1 λ (1 λ) λ (1 λ) λ 2 Σ = 1 λ T-t-1 65

67 Weights Given to Returns weight given to the 3-step return total area = 1 Weight 1 λ decay by λ weight given to actual, final return t Time T 66

68 Sarsa(λ) s Backup Diagram Sarsa(λ) s, a t t 1 λ (1 λ) λ (1 λ) λ 2 Σ = 1 λ T-t-1 s T 67

69 Sara(λ) Algorithm: Sarsa(λ) 1. Initialize Q(s, a) arbitrarily, and e(s, a) = 0 for all s and a 2. Repeat (for each episode): (a) Initialize s and a (b) Repeat (for each step of episode) until s is terminal: i. Take action a, observe reward r and next state s ii. Choose a given s using policy derived from Q (ɛ-greedy) iii. δ r + γq(s, a ) Q(s, a) iv. e(s, a) e(s, a) + 1 v. For all s, a: A. Q(s, a) Q(s, a) + αδe(s, a) B. e(s, a) γλe(s, a) vi. s s ; a a 68

70 Example: Sarsa(λ) and Gridworld Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(λ) with λ=0.9 69

71 Integrating Planning, Acting and Learning Previous techniques will eventually lead to optimal policy without knowing Pr and R ahead of time but can take a long time What if system not only learned policy, but learned a model of the world 70

72 Relationships Among Planning, Acting and Learning value/policy planning direct RL acting model experience model learning 71

73 Certainty Equivalence Algorithm: 1. Learn Pr and R by exploring world 2. Then learn optimal policy (e.g. through policy iteration) Disadvantages: Arbitrary decision between learning and acting Random exploration could be dangerous and inefficient Problems if environment changes after learning 72

74 Dyna Dyna: combines learning and policy generation Uses experience to learn Pr and R Uses experience to adjust policy Uses model to adjust policy 73

75 Dyna Architecture Integrated Planning, Acting and Learning Policy/value functions planning update direct RL update real experience model learning simulated experience search control Model Environment 74

76 Dyna 1. Loop forever (a) Get experience (s, a, r, s ) from world (b) Update model: i. Update Pr by increasing statistic for transition from s to s with action a ii. Update statistic for receiving reward r for taking action a in s (c) Update Q from experience: i. Q(s, a) R(s, a) + γ a P r(s, a, s )max a Q(s, a ) (d) Update Q from model: i. Repeat N times: A. s random previously observed state B. a random action previously taken in s C. Q(s, a) R(s, a) + γ a P r(s, a, s )max a Q(s, a ) (e) Choose next action to perform 75

77 Performance of Dyna 800 S G 600 actions Steps per episode planning steps (direct RL only) 5 planning steps 50 planning steps Episodes 76

78 Snapshot of Dyna s Policies WITHOUT PLANNING (N=0) G WITH PLANNING (N=50) G S S 77

79 Blocked Maze G G S S 150 Dyna-Q+ Dyna-Q Cumulative reward Dyna-AC Time steps 78

80 Shortcut Maze G G S S 400 Cumulative reward Dyna-Q+ Dyna-Q Dyna-AC Time steps 79

81 Bibliography 80

82 Richard S. Sutton and Andrew G. Barto. Introduction. MIT Press, Reinforcement Learning : An 81

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods

More information

Introduction to Reinforcement Learning. MAL Seminar

Introduction to Reinforcement Learning. MAL Seminar Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology

More information

2D5362 Machine Learning

2D5362 Machine Learning 2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files

More information

Lecture 4: Model-Free Prediction

Lecture 4: Model-Free Prediction Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning

More information

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) 1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used

More information

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Reinforcement Learning 04 - Monte Carlo. Elena, Xi Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i Temporal-Di erence Learning Taras Kucherenko, Joonatan Manttari KTH tarask@kth.se manttari@kth.se March 7, 2017 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, 2017 1 / 68 Motivation: disadvantages

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

Multi-step Bootstrapping

Multi-step Bootstrapping Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, 2017 1 / 29 Multi-step Bootstrapping Generalization

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

Reinforcement Learning Lectures 4 and 5

Reinforcement Learning Lectures 4 and 5 Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and

More information

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Reinforcement Learning. Monte Carlo and Temporal Difference Learning Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision

More information

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,

More information

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials

More information

MDPs: Bellman Equations, Value Iteration

MDPs: Bellman Equations, Value Iteration MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in

More information

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig] Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

Decision Theory: Value Iteration

Decision Theory: Value Iteration Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their

More information

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following

More information

TDT4171 Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-based RL and Integrated Learning-Planning Planning and Search, Model Learning, Dyna Architecture, Exploration-Exploitation (many slides from lectures of Marc Toussaint & David

More information

Markov Decision Processes. Lirong Xia

Markov Decision Processes. Lirong Xia Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent

More information

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

CS885 Reinforcement Learning Lecture 3b: May 9, 2018 CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Midterm 1 To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 2 or more hours on the practice

More information

MDPs and Value Iteration 2/20/17

MDPs and Value Iteration 2/20/17 MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that,

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes. CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use

More information

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure

More information

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II CS221 / Autumn 218 / Liang Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 / Autumn

More information

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II CS221 / Spring 218 / Sadigh Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 /

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa

More information

Temporal Abstraction in RL

Temporal Abstraction in RL Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;

More information

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring 2018 / Sadigh CS221 / Spring

More information

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I CS221 / Spring 2018 / Sadigh Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring

More information

Markov Decision Process

Markov Decision Process Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf

More information

Making Complex Decisions

Making Complex Decisions Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2

More information

An Electronic Market-Maker

An Electronic Market-Maker massachusetts institute of technology artificial intelligence laboratory An Electronic Market-Maker Nicholas Tung Chan and Christian Shelton AI Memo 21-5 April 17, 21 CBCL Memo 195 21 massachusetts institute

More information

Probabilistic Robotics: Probabilistic Planning and MDPs

Probabilistic Robotics: Probabilistic Planning and MDPs Probabilistic Robotics: Probabilistic Planning and MDPs Slide credits: Wolfram Burgard, Dieter Fox, Cyrill Stachniss, Giorgio Grisetti, Maren Bennewitz, Christian Plagemann, Dirk Haehnel, Mike Montemerlo,

More information

Supplementary Material: Strategies for exploration in the domain of losses

Supplementary Material: Strategies for exploration in the domain of losses 1 Supplementary Material: Strategies for exploration in the domain of losses Paul M. Krueger 1,, Robert C. Wilson 2,, and Jonathan D. Cohen 3,4 1 Department of Psychology, University of California, Berkeley

More information

Ensemble Methods for Reinforcement Learning with Function Approximation

Ensemble Methods for Reinforcement Learning with Function Approximation Ensemble Methods for Reinforcement Learning with Function Approximation Stefan Faußer and Friedhelm Schwenker Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany {stefan.fausser,friedhelm.schwenker}@uni-ulm.de

More information

Reasoning with Uncertainty

Reasoning with Uncertainty Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

Deep RL and Controls Homework 1 Spring 2017

Deep RL and Controls Homework 1 Spring 2017 10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact

More information

Introduction to Fall 2011 Artificial Intelligence Midterm Exam

Introduction to Fall 2011 Artificial Intelligence Midterm Exam CS 188 Introduction to Fall 2011 Artificial Intelligence Midterm Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

MDP Algorithms. Thomas Keller. June 20, University of Basel

MDP Algorithms. Thomas Keller. June 20, University of Basel MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search

More information

CS360 Homework 14 Solution

CS360 Homework 14 Solution CS360 Homework 14 Solution Markov Decision Processes 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive,

More information

CEC login. Student Details Name SOLUTIONS

CEC login. Student Details Name SOLUTIONS Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching

More information

EE266 Homework 5 Solutions

EE266 Homework 5 Solutions EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The

More information

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

The exam is closed book, closed calculator, and closed notes except your three crib sheets. CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes. CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2015 Introduction to Artificial Intelligence Midterm 1 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Introduction to Fall 2007 Artificial Intelligence Final Exam

Introduction to Fall 2007 Artificial Intelligence Final Exam NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Final Exam You have 180 minutes. The exam is closed book, closed notes except a two-page crib sheet, basic calculators

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

CS 188 Fall Introduction to Artificial Intelligence Midterm 1 CS 188 Fall 2018 Introduction to Artificial Intelligence Midterm 1 You have 120 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam. Do

More information

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell

More information

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

CPS 270: Artificial Intelligence  Markov decision processes, POMDPs CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward

More information

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov

More information

Importance Sampling for Fair Policy Selection

Importance Sampling for Fair Policy Selection Importance Sampling for Fair Policy Selection Shayan Doroudi Carnegie Mellon University Pittsburgh, PA 15213 shayand@cs.cmu.edu Philip S. Thomas Carnegie Mellon University Pittsburgh, PA 15213 philipt@cs.cmu.edu

More information

Final exam solutions

Final exam solutions EE365 Stochastic Control / MS&E251 Stochastic Decision Models Profs. S. Lall, S. Boyd June 5 6 or June 6 7, 2013 Final exam solutions This is a 24 hour take-home final. Please turn it in to one of the

More information

Lecture 7: Bayesian approach to MAB - Gittins index

Lecture 7: Bayesian approach to MAB - Gittins index Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach

More information

The Option-Critic Architecture

The Option-Critic Architecture The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2016 Introduction to Artificial Intelligence Midterm V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

A selection of MAS learning techniques based on RL

A selection of MAS learning techniques based on RL A selection of MAS learning techniques based on RL Ann Nowé 14/11/12 Herhaling titel van presentatie 1 Content Single stage setting Common interest (Claus & Boutilier, Kapetanakis&Kudenko) Conflicting

More information

1 Dynamic programming

1 Dynamic programming 1 Dynamic programming A country has just discovered a natural resource which yields an income per period R measured in terms of traded goods. The cost of exploitation is negligible. The government wants

More information

Sequential Coalition Formation for Uncertain Environments

Sequential Coalition Formation for Uncertain Environments Sequential Coalition Formation for Uncertain Environments Hosam Hanna Computer Sciences Department GREYC - University of Caen 14032 Caen - France hanna@info.unicaen.fr Abstract In several applications,

More information

Adaptive Experiments for Policy Choice. March 8, 2019

Adaptive Experiments for Policy Choice. March 8, 2019 Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann March 8, 2019 Introduction The goal of many experiments is to inform policy choices: 1. Job search assistance for refugees: Treatments:

More information

The Irrevocable Multi-Armed Bandit Problem

The Irrevocable Multi-Armed Bandit Problem The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)

More information

Efficiency and Herd Behavior in a Signalling Market. Jeffrey Gao

Efficiency and Herd Behavior in a Signalling Market. Jeffrey Gao Efficiency and Herd Behavior in a Signalling Market Jeffrey Gao ABSTRACT This paper extends a model of herd behavior developed by Bikhchandani and Sharma (000) to establish conditions for varying levels

More information