CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning
|
|
- Branden Cobb
- 5 years ago
- Views:
Transcription
1 CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998]
2 Introduction Animals learn through interaction with an environment an infant learning motor skills a pigeon learning to press a particular lever a dog learning commands 1
3 Operant Conditioning 1 Much of the motivation behind RL comes from the work of behaviorist B. F. Skinner Operant Conditioning is a form of behavior modification based on Law of Effect behavior followed by pleasurable consequence tends to be repeated behavior followed by unpleasant consequence tends to be eliminated thus (according to OC) behavior is a function of its consequences the rat is always right B. F. Skinner meaning, a creature s behavior is rational given its reinforcement history 1 Slides adapted from mschnake/reinf.htm 2
4 Types of Reinforcement Positive Reinforcement: pleasurable consequence administered after a desired behavior strengthens behavior e.g. praising a dog after it performs a trick Extinction: withholding a previously administered positive reinforcement following an undesirable behavior reduces behavior e.g. imposing early curfew on a child who stayed out too late Punishment: an unpleasant consequence administered following an undesirable behavior reduces behavior e.g. a choke chain for a dog Negative Reinforcement: following a correct behavior strengthens behavior e.g. a boxer learning to block a jab withholding an unpleasant consequence 3
5 Reinforcement Schedules Continuous Reinforcement: every behavior is reinforced Partial Reinforcement: not every behavior reinforced: Fixed interval: fixed time interval passes between reinforcements Variable interval: time interval varies between a min and max Continuous schedule results in faster learning but fastest extinction if a reinforcement is missed Variable schedule results is most effective for developing more permanent behavior 4
6 The Skinner Box Skinner theorized that our behavior is programmed by the reinforcement schedule delivered by our world The Skinner Box was a way to demonstrate this possibility Skinner programmed these boxes to deliver reinforcements given various actions of its inhabitant Observed how actions changed over time 5
7 Components of reinforcement learning Agent: decision maker/learner interacting with environment agents selects actions Environment: the thing that the agent interacts with responds to actions updates the situation (i.e. state) provides reward State signal is assumed to have the Markov property 1. it should summarize relevant past experiences 2. Pr(S t+1 S t, S t 1,..., S 0 ) = Pr(S t+1 S t ) 6
8 Agent-Environment Interaction state t t reward r t Agent action a t reward r t+1 Environment state s t+1 7
9 Overview of Class We ll discuss three many approaches to solving RL problems dynamic programming Monte Carlo methods temporal difference methods We ll talk about integrating planning and learning And finally, approaches for addressing some weaknesses of these methods 8
10 Examples Pick-and-Place Robot: robot should learn fast, smooth motions for picking up objects and placing them in appropriate location actions: voltages applied to each motor in joints states: joint angles and velocities reward: +1 for each object successfully picked and placed punishment: can give a small negative reward as a function of jerkiness of motion 9
11 Recycling Robot: mobile robot collecting cans in an office environment actions: choose actions based on battery level: 1. actively search for cans for some period of time 2. remain stationary, waiting for someone to bring a can 3. head back to home for recharging states: battery level reward: positive for successfully depositing a can placed punishment: negative for allowing battery level to drop to empty note: in this example, RL agent is not entire robot 1. monitors condition of robot itself (i.e. battery level) 2. environment includes the rest of the robot 10
12 Dynamic Programming Methods for RL If we have an MDP modeling the system, we can define value and action-value functions Value function: expresses the relationship between the value of a state and the value of its successors V π (s) = R(s) + γ s S Pr(s, π(s), s )V π (s ) Action-Value function: expresses the relationship between a state, action pair and the value of the states that can be reached from that state using that action Q(s, a) R(s) + γ s S Pr(s, a, s )Q π (s, a ) 11
13 Backup Diagrams s s,a a r s r s a (a) V π (s) (b) Q π (s,a) 12
14 Graphical representation of relationships among states, actions and values Open circles: represent states Solid circles: represent actions V π (s): s takes its value from the states that can be reached with the available actions a given action could reach more than one state (noisy actions) Q π (s, a): the pair s, a takes its value from the states that can be reached with action a and then following π thereafter 13
15 Optimal Value Function V max s a r s V (s) = max a R(s) + γ Pr(s, a, s )V (s ) S s S 14
16 Optimal Action-Value Function Q s,a r s max a Q (s, a) = R(s, a) + γ s S Pr(s, a, s )mas a Q (s, a ) 15
17 Dynamic Programming Methods for Solving MDPs Policy Iteration: iterates over two phases policy evaluation: update the value function to reflect the current policy makes the value function consistent with current policy policy improvement: revise the policy given the value function makes the policy greedy with respect to the current value function Value Iteration: uses an update function which combines one sweep of policy evaluation with policy improvement 16
18 Is This Learning? When the agent generates a policy, has it actually learned anything? If the agent already knows the transition matrices and the reward function, then how is computing a policy learning? If you are told the rules to chess, does that mean you can automatically play a perfect game? Computing a policy allows you to solve problems in less time this is similar to building macro operators This is often called speed-up learning 17
19 Monte Carlo Methods for RL What if the agent is not given transition matrices and/or reward function? can we still build an optimal policy? we learned motor skills, but do you know the transition matrices for your motor actions? agent must learn entirely from experience, rather than from a model Monte Carlo Methods: solves RL by averaging sampled returns that is, we run some episodes and note the return values episode: a sequence of state transitions ending in a terminal state e.g. starting at beginning of a maze and moving to the end we update the estimated value of a state based on the average return we got for that state in the episodes 18
20 Monte Carlo Policy Evaluation Want to estimate V π (s) i.e. what is the value of being in s using π But we don t have Pr or R Approach: 1. generate sample runs using π 2. observe the (discounted) returns obtained for s 3. compute the average 19
21 Monte Carlo Policy Evaluation Algorithm: Monte Carlo Policy Evaluation 1. Initialize (a) π policy to be evaluated (b) V arbitrary value function (c) Returns(s) empty list for all s S 2. Repeat forever (a) Generate an episode using π (b) For each state s appearing in episode i. R return following first occurrence of s A. i.e. the discounted reward we eventually get in s after reaching terminal state ii. Append R to Returns(s) iii. V π (s) average(returns(s)) 20
22 Example: Playing Blackjack States features: value of cards we have (only care if greater than 11) value of the card dealer is showing whether or not we have an ace that can be treated as states Actions: hit stay 21
23 What if we want to solve blackjack w/ a DP approach? need to know probability of state transitions need to know expected reward for each possible terminal state e.g.: you stay at 14, what is the expected reward given the dealer s showing card We can easily simulate blackjack games and average the returns 22
24 Example: Monte Carlo Value Estimation on Blackjack After 10,000 episodes After 500,000 episodes Usable ace +1 1 No usable ace A Dealer showing Player sum 21 23
25 Backup Diagram for Monte Carlo Policy Evaluation 24
26 Monte Carlo Estimation of Action Values With action models, value function V is sufficient for generating optimal policy look ahead to states that can be reached with one step given possible actions but we must have the transition matrices for actions Without a model, it is particularly useful to estimate the action-value function Q (s, a) Approach: similar to value function estimation 1. generate sample runs 2. observe the (discounted) returns obtained for taking action a in state s 3. store the average in Q(s, a) 25
27 Exploitation vs. Exploration Problem with above approach if we have a deterministic policy, then we will always take same action each time we enter s if we want to choose among actions, we need to try various actions in each state Conflict between exploration vs. exploration we would like to exploit the knowledge we have learned e.g. you found a good path to the goal we would like to explore looking for improvements e.g. is there a better path? 26
28 Possibilities: exploring starts: pick a state-action pair as the start and then follow policy thereafter not always possible in real problems stochastic policy: use a stochastic policy with a non-zero probability of selecting all actions Monte Carlo ES (Exploring Starts): function and optimal policy together method for estimating value 27
29 Monte Carlo ES (Exploring Starts) Algorithm: Monte Carlo ES 1. Initialize, s S, a A: (a) Q(s, a) arbitrary (b) π(s) arbitrary (c) Returns(s, a) empty list 2. Repeat forever (a) Generate an episode using exploring starts and π (b) For each pair s, a appearing in episode i. R return following first occurrence of s, a A. i.e. the discounted reward we eventually get in s after reaching terminal state ii. Append R to Returns(s, a) iii. Q(s, a) average(returns(s, a)) (c) For each state s appearing in episode i. π(s) argmax a Q(s, a) 28
30 Example: Monte Carlo ES on Blackjack π * V * Usable ace No usable ace HIT A HIT A Dealer showing 21 STICK STICK Player sum +1 1 A A Dealer showing Player sum 29
31 On-Policy Vs. Off-Policy Monte Carlo Control We would like to remove assumption of exploratory-starts but to converge on optimal policy, still need to visit all states and actions are visited and selected infinitely often Two types of appraoches: on-policy: evalaute and/or improve the policy used to generate samples need to make policy soft π(s, a) > 0, s S, s A off-policy: the policy that is evaluated/improved is distinct from the one used to generate samples 30
32 ɛ-greedy Policies: An Approach to On-Policy Monte Carlo Control ɛ-greedy policy: choose action with highest estimated value with probability 1 ɛ choose a random action with probability ɛ thus, there is a nonzero probability of selecting each action from each state visited ɛ-soft policy: π(s, a) ɛ A(s) an ɛ-greedy policy is a type of ɛ-soft policy 31
33 Can use Monte Carlo control algorithm from above augmented with ɛ-policy start with an ɛ-soft policy and move it towards a ɛ-greedy policy Can show that it will converge on an ɛ-soft policy that is at least as good as all other ɛ-soft policies proof idea: fold the ɛ random action selection into the noise model of the environment the best one can do in this new environment with general policies is the same as the best one can do in the original environment with ɛ-soft policies 32
34 Off-Policy Monte Carlo Control It is also possible to optimize one policy π while gathering samples with a distinct policy π behavior policy: policy used to sample the space in this case π estimation policy: policy to be evaluated and improved in this case π 33
35 It is still possible to learn optimal policy behavior policy must be soft given an episode generated with π need to take into account the probability of that epsiode relative to π can take the ratio of probability of that episode occuring with π and that epsiode occuring with π for details see Sutton and Barto [1998] 34
36 Temporal Difference (TD) Methods Monte Carlo methods require us to reach the end of an episode before we can update the estimates only then is the return known TD methods provide a useful update after each step uses the instantaneous reward we receive 35
37 Example TD approach given state s take a single step to s observe the reward receive going to s update estimate of value of s (V (s)) stepsize α is used to determine how much to adjust current estimate given new sample TD combines aspects of DP and Monte Carlo like Monte Carlo: don t know Pr and R ahead of time update based on sample backup not full backup sample backup - we only consider one next state s (the one that was sampled) full backup - if we consider all possible next states that could be reached (this is what DP does) like DP: update based on a single step 36
38 Temporal Difference Approach to Estimating V π Algorithm: TD Value Estimation 1. Initialize (a) V (s) arbitrary (b) π(s) policy to be evaluated 2. Repeat (for each episode): (a) Initialize s to start state (b) Repeat (for each step of episode): i. a action taken by π given s ii. Take action a, observe reward r and next state s iii. V (s) V (s) + α [r + γv (s ) V (s)] iv. s s 37
39 Backup Diagram for TD Value Estimation 38
40 Advantages of TD Compared to DP Pr and R do not have to be known Compared to Monte Carlo can learn on-line (i.e. do not have to wait until end of episode) 39
41 Using TD to Solve the Control Problem Objective: learn action-value function: Q(s, a) Two possibilities: on-policy and off-policy We ll talk about the on-line method Approach: generate episodes for each state-action pair s, a observe the reward r received and the next state-action pair s, a update estimate of Q(s, a) based on old estimate, r and Q(s, a ) 40
42 Sarsa: name of approach stands for: state, action, reward, state action i.e. the five elements of the update 41
43 Sara: On-Policy TD Control Algorithm: Sarsa 1. Initialize Q(s, a) arbitrarily 2. Repeat (for each episode): (a) Initialize s to start state (b) Choose a given s using policy derived from Q (ɛ-greedy) (c) Repeat (for each step of episode) until s is terminal: i. Take action a, observe reward r and next state s ii. Choose a given s using policy derived from Q (ɛ-greedy) iii. Q(s, a) Q(s, a) + α [r + γq(s, a ) Q(s, a)] iv. s s ; a a 42
44 Backup Diagram for Sarsa Ask students what they think the diagram would be 43
45 Q-Learning: Off-Policy TD Control Off-Policy TD control is one of the most important breakthroughs of RL Approach use behavior policy to generate samples each sample includes s, a, r, s, a use samples to update estimate Q (s, a) update rule: Q(s t, a t ) Q(s t, a t ) + α [ ] r + γ max a Q(s t+1, a) Q(s t, a t ) 44
46 in other words: in an episode, the expected reward for the stateaction pair s t, a t is relative to the max expected reward of all actions that can be taken in s t+1 Approach called Q-Learning because we are learning the Q function 45
47 Q-Learning: Off-Policy TD Control Algorithm: Q-Learning 1. Initialize Q(s, a) arbitrarily 2. Repeat (for each episode): (a) Initialize s to start state (b) Repeat (for each step of episode) until s is terminal: i. Choose a given s using policy derived from Q (ɛ-greedy) ii. Take action a, observe [ reward r and next state s ] iii. Q(s, a) Q(s, a) + α r + γ max a Q(s, a ) Q(s, a) iv. s s until s is terminal 46
48 Backup Diagram for Q-Learning Ask students what they think the diagram would be 47
49 Why is Q-Learning Considered an Off-Policy Algorithm It does not take into account the ɛ-greedy policy when choosing actions it picks the action that maximizes the Q function but does not consider that that action might not always be the one selected 48
50 Sarsa Compared to Q-Learning r = 1 safe path S T h e C l i f f G optimal path r = Sarsa Reward per epsiode Q-learning Episodes 49
51 Notes on Example Q-Learning finds the optimal policy assuming a deterministic policy Sarsa finds the safer policy which is appropriate for an ɛ-greedy approach This demonstrates how Sarsa takes the behavior generating policy into account (i.e. on-line) and Q-Learning does not 50
52 Actor-Critic Methods Actor Policy state Critic Value Function TD error action reward Environment 51
53 Stores separate policy (actor) and value function (critic) actor uses policy to take actions critic uses value function to critique the actor s choice Critic uses s t and s t+1 to determine if things have gone as anticipated that is, it computes a TD error given by: δ t = r t+1 + γv (s t+1 ) V (s t ) TD error can be used to evaluate choice of action a in state s positive: tendency to select a in this state should be strengthened negative: suggests tendency to select a in this state should be weakened This should sound familiar given the discussion we had to motivate RL Note similarity to Law of Effect of Operant Conditioning 52
54 Credit Assignment Let s assume we have an estimated action-value function and are now trying to use the policy to solve a problem at each step t we are choosing actions to take that we expect will lead us to a high payoff Q(s t, a t ) predicts the (discounted) reward for this state action pair once we reach the end, we will get the actual reward what happens if, looking back, we realize that some state-action pairs made incorrect predictions? either too high or too low Which state-action pairs should we update and by how much? this is the credit assignment problem should pairs far back in time be held accountable for the problems that arise now? should pairs far back in time be given credit for current successes? perhaps they should, but maybe not as much as recent pairs and shouldn t pairs that weren t even encountered during this episode 53 be exempt from blame or credit?
55 Eligibility Traces Eligibility traces: provide a mechanism for determining how eligible a state-action pair (or just state for estimating V) is for an update trace should answer the question: which state-action pairs (or just states) are eligible for update and by how much? Assumptions made by eligibility traces: pairs closer to a reward should receive more credit for the reward than pairs further away eligibility decays pairs that are encountered more times on the way to a reward should receive more credit than pairs encountered fewer times 54
56 eligibility is incremented when we visit a pair see next slide Computing eligibility: eligibility for a state-action pair s, a at time t is given by: e t = { γλet 1 (s, a) + 1 if s = s t and a = a t γλe t 1 (s, a) otherwise s, a 55
57 Example of Accumulating Eligibility accumulating eligibility trace times of visits to a state 56
58 Updating State-Action Pair Reward Estimates At time t we will measure the prediction error by: the TD-error: δ t = r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t ) At time t, we will adjust a state-action pair s prediction based on the error δ t and its eligibility e t (s, a) Q t+1 = Q t (s, a) + αδ t e t (s, a) 57
59 Can think of it as follows (see next slide): we are moving along through the state-action pairs at each step we get a reward we compute the error δ t we shout back the error to the pairs we have visited in the past 58
60 Backward View of Eligibility Traces e t e t e t δt s t-3 s t-2 s t-1 e t s t s t+1 Time 59
61 TD(λ): Bridge Between Monte Carlo and TD Methods Consider what happens when we set λ = 0 e t = { 1 if s = st and a = a t 0 otherwise s, a only current state will be eligible for update so we only update the current step this is analogous to our previous TD approaches called TD(0) 60
62 As λ increases, but still < 1: more of the proceeding pairs are changed eligibility falls by γλ Consider what happens when we set λ = 1 e t = { γet 1 (s, a) + 1 if s = s t and a = a t γe t 1 (s, a) otherwise s, a eligibility falls by just γ per step this is what we had for Monte Carlo 61
63 Backup Diagram for TD(λ) What would the backup diagram for TD(λ) look like? Look back in the sequence of pairs and consider how a particular pair s, a is updated when we take the next step, we ll update s, a based on reward we got (1-step) when we take the next step, we will again update s, a, but this time the eligibility will be less (2-step) when we take the next step, we will again update s, a, but this time the eligibility will be less (3-step). so, the update to s, a can be thought of as an average of n-step returns i.e. the average of the returns we got after 1-step, 2-steps, 3-steps,... 62
64 This is the forward view of TD(λ) see next slide We can show this average over n-step returns graphically like previous backup diagrams bar over top to indicate average of returns weights for each element of average at the bottom see following slide 63
65 Forward View of Eligibility Traces r T r t+1 r t+2 s t+1 s t+2 r t+3 s t+3 s t Time 64
66 TD(λ) s Backup Diagram TD(λ), λ-return 1 λ (1 λ) λ (1 λ) λ 2 Σ = 1 λ T-t-1 65
67 Weights Given to Returns weight given to the 3-step return total area = 1 Weight 1 λ decay by λ weight given to actual, final return t Time T 66
68 Sarsa(λ) s Backup Diagram Sarsa(λ) s, a t t 1 λ (1 λ) λ (1 λ) λ 2 Σ = 1 λ T-t-1 s T 67
69 Sara(λ) Algorithm: Sarsa(λ) 1. Initialize Q(s, a) arbitrarily, and e(s, a) = 0 for all s and a 2. Repeat (for each episode): (a) Initialize s and a (b) Repeat (for each step of episode) until s is terminal: i. Take action a, observe reward r and next state s ii. Choose a given s using policy derived from Q (ɛ-greedy) iii. δ r + γq(s, a ) Q(s, a) iv. e(s, a) e(s, a) + 1 v. For all s, a: A. Q(s, a) Q(s, a) + αδe(s, a) B. e(s, a) γλe(s, a) vi. s s ; a a 68
70 Example: Sarsa(λ) and Gridworld Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(λ) with λ=0.9 69
71 Integrating Planning, Acting and Learning Previous techniques will eventually lead to optimal policy without knowing Pr and R ahead of time but can take a long time What if system not only learned policy, but learned a model of the world 70
72 Relationships Among Planning, Acting and Learning value/policy planning direct RL acting model experience model learning 71
73 Certainty Equivalence Algorithm: 1. Learn Pr and R by exploring world 2. Then learn optimal policy (e.g. through policy iteration) Disadvantages: Arbitrary decision between learning and acting Random exploration could be dangerous and inefficient Problems if environment changes after learning 72
74 Dyna Dyna: combines learning and policy generation Uses experience to learn Pr and R Uses experience to adjust policy Uses model to adjust policy 73
75 Dyna Architecture Integrated Planning, Acting and Learning Policy/value functions planning update direct RL update real experience model learning simulated experience search control Model Environment 74
76 Dyna 1. Loop forever (a) Get experience (s, a, r, s ) from world (b) Update model: i. Update Pr by increasing statistic for transition from s to s with action a ii. Update statistic for receiving reward r for taking action a in s (c) Update Q from experience: i. Q(s, a) R(s, a) + γ a P r(s, a, s )max a Q(s, a ) (d) Update Q from model: i. Repeat N times: A. s random previously observed state B. a random action previously taken in s C. Q(s, a) R(s, a) + γ a P r(s, a, s )max a Q(s, a ) (e) Choose next action to perform 75
77 Performance of Dyna 800 S G 600 actions Steps per episode planning steps (direct RL only) 5 planning steps 50 planning steps Episodes 76
78 Snapshot of Dyna s Policies WITHOUT PLANNING (N=0) G WITH PLANNING (N=50) G S S 77
79 Blocked Maze G G S S 150 Dyna-Q+ Dyna-Q Cumulative reward Dyna-AC Time steps 78
80 Shortcut Maze G G S S 400 Cumulative reward Dyna-Q+ Dyna-Q Dyna-AC Time steps 79
81 Bibliography 80
82 Richard S. Sutton and Andrew G. Barto. Introduction. MIT Press, Reinforcement Learning : An 81
4 Reinforcement Learning Basic Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems
More informationReinforcement Learning
Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods
More informationIntroduction to Reinforcement Learning. MAL Seminar
Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology
More information2D5362 Machine Learning
2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files
More informationLecture 4: Model-Free Prediction
Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning
More informationMonte Carlo Methods (Estimators, On-policy/Off-policy Learning)
1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used
More informationReinforcement Learning 04 - Monte Carlo. Elena, Xi
Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment
More informationComplex Decisions. Sequential Decision Making
Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements
More informationMotivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i
Temporal-Di erence Learning Taras Kucherenko, Joonatan Manttari KTH tarask@kth.se manttari@kth.se March 7, 2017 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, 2017 1 / 68 Motivation: disadvantages
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives
More informationReinforcement Learning
Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline
More information91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010
91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course
More informationMulti-step Bootstrapping
Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, 2017 1 / 29 Multi-step Bootstrapping Generalization
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due
More informationIntro to Reinforcement Learning. Part 3: Core Theory
Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2
More informationCS 461: Machine Learning Lecture 8
CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationReinforcement Learning Lectures 4 and 5
Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and
More informationReinforcement Learning. Monte Carlo and Temporal Difference Learning
Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer
More informationNon-Deterministic Search
Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:
More informationLecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision
More informationReinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum
Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,
More informationLogistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week
CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials
More informationMDPs: Bellman Equations, Value Iteration
MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in
More informationReinforcement Learning and Simulation-Based Search
Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision
More informationCS 188: Artificial Intelligence. Outline
C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence
More informationBasic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]
Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides
More informationCOMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2
COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman
More informationSequential Decision Making
Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming
More informationMarkov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N
Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning
More informationMaking Decisions. CS 3793 Artificial Intelligence Making Decisions 1
Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside
More informationDecision Theory: Value Iteration
Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision
More informationMarkov Decision Processes
Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their
More informationThe Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions
The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}
More informationReinforcement Learning
Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical
More informationQ1. [?? pts] Search Traces
CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a
More informationLecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018
Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following
More informationTDT4171 Artificial Intelligence Methods
TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods
More informationReinforcement Learning
Reinforcement Learning Model-based RL and Integrated Learning-Planning Planning and Search, Model Learning, Dyna Architecture, Exploration-Exploitation (many slides from lectures of Marc Toussaint & David
More informationMarkov Decision Processes. Lirong Xia
Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent
More informationCS885 Reinforcement Learning Lecture 3b: May 9, 2018
CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use
More informationTo earn the extra credit, one of the following has to hold true. Please circle and sign.
CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Midterm 1 To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 2 or more hours on the practice
More informationMDPs and Value Iteration 2/20/17
MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that,
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.
CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use
More informationCS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm
CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure
More informationCS221 / Autumn 2018 / Liang. Lecture 8: MDPs II
CS221 / Autumn 218 / Liang Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 / Autumn
More informationCS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II
CS221 / Spring 218 / Sadigh Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 /
More informationReinforcement Learning
Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa
More informationTemporal Abstraction in RL
Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;
More informationLecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world
Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring 2018 / Sadigh CS221 / Spring
More informationCS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I
CS221 / Spring 2018 / Sadigh Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring
More informationMarkov Decision Process
Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf
More informationMaking Complex Decisions
Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2
More informationAn Electronic Market-Maker
massachusetts institute of technology artificial intelligence laboratory An Electronic Market-Maker Nicholas Tung Chan and Christian Shelton AI Memo 21-5 April 17, 21 CBCL Memo 195 21 massachusetts institute
More informationProbabilistic Robotics: Probabilistic Planning and MDPs
Probabilistic Robotics: Probabilistic Planning and MDPs Slide credits: Wolfram Burgard, Dieter Fox, Cyrill Stachniss, Giorgio Grisetti, Maren Bennewitz, Christian Plagemann, Dirk Haehnel, Mike Montemerlo,
More informationSupplementary Material: Strategies for exploration in the domain of losses
1 Supplementary Material: Strategies for exploration in the domain of losses Paul M. Krueger 1,, Robert C. Wilson 2,, and Jonathan D. Cohen 3,4 1 Department of Psychology, University of California, Berkeley
More informationEnsemble Methods for Reinforcement Learning with Function Approximation
Ensemble Methods for Reinforcement Learning with Function Approximation Stefan Faußer and Friedhelm Schwenker Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany {stefan.fausser,friedhelm.schwenker}@uni-ulm.de
More informationReasoning with Uncertainty
Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally
More informationMonte-Carlo Planning: Introduction and Bandit Basics. Alan Fern
Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned
More informationDeep RL and Controls Homework 1 Spring 2017
10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact
More informationIntroduction to Fall 2011 Artificial Intelligence Midterm Exam
CS 188 Introduction to Fall 2011 Artificial Intelligence Midterm Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators
More informationMonte-Carlo Planning Look Ahead Trees. Alan Fern
Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy
More informationMonte-Carlo Planning: Introduction and Bandit Basics. Alan Fern
Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned
More informationMDP Algorithms. Thomas Keller. June 20, University of Basel
MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search
More informationCS360 Homework 14 Solution
CS360 Homework 14 Solution Markov Decision Processes 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive,
More informationCEC login. Student Details Name SOLUTIONS
Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching
More informationEE266 Homework 5 Solutions
EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The
More informationThe exam is closed book, closed calculator, and closed notes except your three crib sheets.
CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.
CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2015 Introduction to Artificial Intelligence Midterm 1 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib
More informationIntroduction to Fall 2007 Artificial Intelligence Final Exam
NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Final Exam You have 180 minutes. The exam is closed book, closed notes except a two-page crib sheet, basic calculators
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1
CS 188 Fall 2018 Introduction to Artificial Intelligence Midterm 1 You have 120 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam. Do
More informationTemporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options
Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell
More informationCPS 270: Artificial Intelligence Markov decision processes, POMDPs
CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward
More informationMarkov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo
Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov
More informationImportance Sampling for Fair Policy Selection
Importance Sampling for Fair Policy Selection Shayan Doroudi Carnegie Mellon University Pittsburgh, PA 15213 shayand@cs.cmu.edu Philip S. Thomas Carnegie Mellon University Pittsburgh, PA 15213 philipt@cs.cmu.edu
More informationFinal exam solutions
EE365 Stochastic Control / MS&E251 Stochastic Decision Models Profs. S. Lall, S. Boyd June 5 6 or June 6 7, 2013 Final exam solutions This is a 24 hour take-home final. Please turn it in to one of the
More informationLecture 7: Bayesian approach to MAB - Gittins index
Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach
More informationThe Option-Critic Architecture
The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently
More information16 MAKING SIMPLE DECISIONS
247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2016 Introduction to Artificial Intelligence Midterm V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib
More informationA selection of MAS learning techniques based on RL
A selection of MAS learning techniques based on RL Ann Nowé 14/11/12 Herhaling titel van presentatie 1 Content Single stage setting Common interest (Claus & Boutilier, Kapetanakis&Kudenko) Conflicting
More information1 Dynamic programming
1 Dynamic programming A country has just discovered a natural resource which yields an income per period R measured in terms of traded goods. The cost of exploitation is negligible. The government wants
More informationSequential Coalition Formation for Uncertain Environments
Sequential Coalition Formation for Uncertain Environments Hosam Hanna Computer Sciences Department GREYC - University of Caen 14032 Caen - France hanna@info.unicaen.fr Abstract In several applications,
More informationAdaptive Experiments for Policy Choice. March 8, 2019
Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann March 8, 2019 Introduction The goal of many experiments is to inform policy choices: 1. Job search assistance for refugees: Treatments:
More informationThe Irrevocable Multi-Armed Bandit Problem
The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision
More information16 MAKING SIMPLE DECISIONS
253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)
More informationEfficiency and Herd Behavior in a Signalling Market. Jeffrey Gao
Efficiency and Herd Behavior in a Signalling Market Jeffrey Gao ABSTRACT This paper extends a model of herd behavior developed by Bikhchandani and Sharma (000) to establish conditions for varying levels
More information