CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

Size: px

Start display at page:

Download "CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning"

Branden Cobb
5 years ago
Views:

1 CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998]

2 Introduction Animals learn through interaction with an environment an infant learning motor skills a pigeon learning to press a particular lever a dog learning commands 1

3 Operant Conditioning 1 Much of the motivation behind RL comes from the work of behaviorist B. F. Skinner Operant Conditioning is a form of behavior modification based on Law of Effect behavior followed by pleasurable consequence tends to be repeated behavior followed by unpleasant consequence tends to be eliminated thus (according to OC) behavior is a function of its consequences the rat is always right B. F. Skinner meaning, a creature s behavior is rational given its reinforcement history 1 Slides adapted from mschnake/reinf.htm 2

4 Types of Reinforcement Positive Reinforcement: pleasurable consequence administered after a desired behavior strengthens behavior e.g. praising a dog after it performs a trick Extinction: withholding a previously administered positive reinforcement following an undesirable behavior reduces behavior e.g. imposing early curfew on a child who stayed out too late Punishment: an unpleasant consequence administered following an undesirable behavior reduces behavior e.g. a choke chain for a dog Negative Reinforcement: following a correct behavior strengthens behavior e.g. a boxer learning to block a jab withholding an unpleasant consequence 3

5 Reinforcement Schedules Continuous Reinforcement: every behavior is reinforced Partial Reinforcement: not every behavior reinforced: Fixed interval: fixed time interval passes between reinforcements Variable interval: time interval varies between a min and max Continuous schedule results in faster learning but fastest extinction if a reinforcement is missed Variable schedule results is most effective for developing more permanent behavior 4

6 The Skinner Box Skinner theorized that our behavior is programmed by the reinforcement schedule delivered by our world The Skinner Box was a way to demonstrate this possibility Skinner programmed these boxes to deliver reinforcements given various actions of its inhabitant Observed how actions changed over time 5

7 Components of reinforcement learning Agent: decision maker/learner interacting with environment agents selects actions Environment: the thing that the agent interacts with responds to actions updates the situation (i.e. state) provides reward State signal is assumed to have the Markov property 1. it should summarize relevant past experiences 2. Pr(S t+1 S t, S t 1,..., S 0 ) = Pr(S t+1 S t ) 6

8 Agent-Environment Interaction state t t reward r t Agent action a t reward r t+1 Environment state s t+1 7

9 Overview of Class We ll discuss three many approaches to solving RL problems dynamic programming Monte Carlo methods temporal difference methods We ll talk about integrating planning and learning And finally, approaches for addressing some weaknesses of these methods 8

10 Examples Pick-and-Place Robot: robot should learn fast, smooth motions for picking up objects and placing them in appropriate location actions: voltages applied to each motor in joints states: joint angles and velocities reward: +1 for each object successfully picked and placed punishment: can give a small negative reward as a function of jerkiness of motion 9

11 Recycling Robot: mobile robot collecting cans in an office environment actions: choose actions based on battery level: 1. actively search for cans for some period of time 2. remain stationary, waiting for someone to bring a can 3. head back to home for recharging states: battery level reward: positive for successfully depositing a can placed punishment: negative for allowing battery level to drop to empty note: in this example, RL agent is not entire robot 1. monitors condition of robot itself (i.e. battery level) 2. environment includes the rest of the robot 10

12 Dynamic Programming Methods for RL If we have an MDP modeling the system, we can define value and action-value functions Value function: expresses the relationship between the value of a state and the value of its successors V π (s) = R(s) + γ s S Pr(s, π(s), s )V π (s ) Action-Value function: expresses the relationship between a state, action pair and the value of the states that can be reached from that state using that action Q(s, a) R(s) + γ s S Pr(s, a, s )Q π (s, a ) 11

13 Backup Diagrams s s,a a r s r s a (a) V π (s) (b) Q π (s,a) 12

14 Graphical representation of relationships among states, actions and values Open circles: represent states Solid circles: represent actions V π (s): s takes its value from the states that can be reached with the available actions a given action could reach more than one state (noisy actions) Q π (s, a): the pair s, a takes its value from the states that can be reached with action a and then following π thereafter 13

15 Optimal Value Function V max s a r s V (s) = max a R(s) + γ Pr(s, a, s )V (s ) S s S 14

16 Optimal Action-Value Function Q s,a r s max a Q (s, a) = R(s, a) + γ s S Pr(s, a, s )mas a Q (s, a ) 15

17 Dynamic Programming Methods for Solving MDPs Policy Iteration: iterates over two phases policy evaluation: update the value function to reflect the current policy makes the value function consistent with current policy policy improvement: revise the policy given the value function makes the policy greedy with respect to the current value function Value Iteration: uses an update function which combines one sweep of policy evaluation with policy improvement 16

18 Is This Learning? When the agent generates a policy, has it actually learned anything? If the agent already knows the transition matrices and the reward function, then how is computing a policy learning? If you are told the rules to chess, does that mean you can automatically play a perfect game? Computing a policy allows you to solve problems in less time this is similar to building macro operators This is often called speed-up learning 17

19 Monte Carlo Methods for RL What if the agent is not given transition matrices and/or reward function? can we still build an optimal policy? we learned motor skills, but do you know the transition matrices for your motor actions? agent must learn entirely from experience, rather than from a model Monte Carlo Methods: solves RL by averaging sampled returns that is, we run some episodes and note the return values episode: a sequence of state transitions ending in a terminal state e.g. starting at beginning of a maze and moving to the end we update the estimated value of a state based on the average return we got for that state in the episodes 18

20 Monte Carlo Policy Evaluation Want to estimate V π (s) i.e. what is the value of being in s using π But we don t have Pr or R Approach: 1. generate sample runs using π 2. observe the (discounted) returns obtained for s 3. compute the average 19

21 Monte Carlo Policy Evaluation Algorithm: Monte Carlo Policy Evaluation 1. Initialize (a) π policy to be evaluated (b) V arbitrary value function (c) Returns(s) empty list for all s S 2. Repeat forever (a) Generate an episode using π (b) For each state s appearing in episode i. R return following first occurrence of s A. i.e. the discounted reward we eventually get in s after reaching terminal state ii. Append R to Returns(s) iii. V π (s) average(returns(s)) 20

22 Example: Playing Blackjack States features: value of cards we have (only care if greater than 11) value of the card dealer is showing whether or not we have an ace that can be treated as states Actions: hit stay 21

23 What if we want to solve blackjack w/ a DP approach? need to know probability of state transitions need to know expected reward for each possible terminal state e.g.: you stay at 14, what is the expected reward given the dealer s showing card We can easily simulate blackjack games and average the returns 22

24 Example: Monte Carlo Value Estimation on Blackjack After 10,000 episodes After 500,000 episodes Usable ace +1 1 No usable ace A Dealer showing Player sum 21 23

25 Backup Diagram for Monte Carlo Policy Evaluation 24

26 Monte Carlo Estimation of Action Values With action models, value function V is sufficient for generating optimal policy look ahead to states that can be reached with one step given possible actions but we must have the transition matrices for actions Without a model, it is particularly useful to estimate the action-value function Q (s, a) Approach: similar to value function estimation 1. generate sample runs 2. observe the (discounted) returns obtained for taking action a in state s 3. store the average in Q(s, a) 25

27 Exploitation vs. Exploration Problem with above approach if we have a deterministic policy, then we will always take same action each time we enter s if we want to choose among actions, we need to try various actions in each state Conflict between exploration vs. exploration we would like to exploit the knowledge we have learned e.g. you found a good path to the goal we would like to explore looking for improvements e.g. is there a better path? 26

28 Possibilities: exploring starts: pick a state-action pair as the start and then follow policy thereafter not always possible in real problems stochastic policy: use a stochastic policy with a non-zero probability of selecting all actions Monte Carlo ES (Exploring Starts): function and optimal policy together method for estimating value 27

29 Monte Carlo ES (Exploring Starts) Algorithm: Monte Carlo ES 1. Initialize, s S, a A: (a) Q(s, a) arbitrary (b) π(s) arbitrary (c) Returns(s, a) empty list 2. Repeat forever (a) Generate an episode using exploring starts and π (b) For each pair s, a appearing in episode i. R return following first occurrence of s, a A. i.e. the discounted reward we eventually get in s after reaching terminal state ii. Append R to Returns(s, a) iii. Q(s, a) average(returns(s, a)) (c) For each state s appearing in episode i. π(s) argmax a Q(s, a) 28

30 Example: Monte Carlo ES on Blackjack π * V * Usable ace No usable ace HIT A HIT A Dealer showing 21 STICK STICK Player sum +1 1 A A Dealer showing Player sum 29

31 On-Policy Vs. Off-Policy Monte Carlo Control We would like to remove assumption of exploratory-starts but to converge on optimal policy, still need to visit all states and actions are visited and selected infinitely often Two types of appraoches: on-policy: evalaute and/or improve the policy used to generate samples need to make policy soft π(s, a) > 0, s S, s A off-policy: the policy that is evaluated/improved is distinct from the one used to generate samples 30

32 ɛ-greedy Policies: An Approach to On-Policy Monte Carlo Control ɛ-greedy policy: choose action with highest estimated value with probability 1 ɛ choose a random action with probability ɛ thus, there is a nonzero probability of selecting each action from each state visited ɛ-soft policy: π(s, a) ɛ A(s) an ɛ-greedy policy is a type of ɛ-soft policy 31

33 Can use Monte Carlo control algorithm from above augmented with ɛ-policy start with an ɛ-soft policy and move it towards a ɛ-greedy policy Can show that it will converge on an ɛ-soft policy that is at least as good as all other ɛ-soft policies proof idea: fold the ɛ random action selection into the noise model of the environment the best one can do in this new environment with general policies is the same as the best one can do in the original environment with ɛ-soft policies 32

34 Off-Policy Monte Carlo Control It is also possible to optimize one policy π while gathering samples with a distinct policy π behavior policy: policy used to sample the space in this case π estimation policy: policy to be evaluated and improved in this case π 33

35 It is still possible to learn optimal policy behavior policy must be soft given an episode generated with π need to take into account the probability of that epsiode relative to π can take the ratio of probability of that episode occuring with π and that epsiode occuring with π for details see Sutton and Barto [1998] 34

36 Temporal Difference (TD) Methods Monte Carlo methods require us to reach the end of an episode before we can update the estimates only then is the return known TD methods provide a useful update after each step uses the instantaneous reward we receive 35

37 Example TD approach given state s take a single step to s observe the reward receive going to s update estimate of value of s (V (s)) stepsize α is used to determine how much to adjust current estimate given new sample TD combines aspects of DP and Monte Carlo like Monte Carlo: don t know Pr and R ahead of time update based on sample backup not full backup sample backup - we only consider one next state s (the one that was sampled) full backup - if we consider all possible next states that could be reached (this is what DP does) like DP: update based on a single step 36

38 Temporal Difference Approach to Estimating V π Algorithm: TD Value Estimation 1. Initialize (a) V (s) arbitrary (b) π(s) policy to be evaluated 2. Repeat (for each episode): (a) Initialize s to start state (b) Repeat (for each step of episode): i. a action taken by π given s ii. Take action a, observe reward r and next state s iii. V (s) V (s) + α [r + γv (s ) V (s)] iv. s s 37

39 Backup Diagram for TD Value Estimation 38

40 Advantages of TD Compared to DP Pr and R do not have to be known Compared to Monte Carlo can learn on-line (i.e. do not have to wait until end of episode) 39

41 Using TD to Solve the Control Problem Objective: learn action-value function: Q(s, a) Two possibilities: on-policy and off-policy We ll talk about the on-line method Approach: generate episodes for each state-action pair s, a observe the reward r received and the next state-action pair s, a update estimate of Q(s, a) based on old estimate, r and Q(s, a ) 40

42 Sarsa: name of approach stands for: state, action, reward, state action i.e. the five elements of the update 41

43 Sara: On-Policy TD Control Algorithm: Sarsa 1. Initialize Q(s, a) arbitrarily 2. Repeat (for each episode): (a) Initialize s to start state (b) Choose a given s using policy derived from Q (ɛ-greedy) (c) Repeat (for each step of episode) until s is terminal: i. Take action a, observe reward r and next state s ii. Choose a given s using policy derived from Q (ɛ-greedy) iii. Q(s, a) Q(s, a) + α [r + γq(s, a ) Q(s, a)] iv. s s ; a a 42

44 Backup Diagram for Sarsa Ask students what they think the diagram would be 43

45 Q-Learning: Off-Policy TD Control Off-Policy TD control is one of the most important breakthroughs of RL Approach use behavior policy to generate samples each sample includes s, a, r, s, a use samples to update estimate Q (s, a) update rule: Q(s t, a t ) Q(s t, a t ) + α [ ] r + γ max a Q(s t+1, a) Q(s t, a t ) 44

46 in other words: in an episode, the expected reward for the stateaction pair s t, a t is relative to the max expected reward of all actions that can be taken in s t+1 Approach called Q-Learning because we are learning the Q function 45

47 Q-Learning: Off-Policy TD Control Algorithm: Q-Learning 1. Initialize Q(s, a) arbitrarily 2. Repeat (for each episode): (a) Initialize s to start state (b) Repeat (for each step of episode) until s is terminal: i. Choose a given s using policy derived from Q (ɛ-greedy) ii. Take action a, observe [ reward r and next state s ] iii. Q(s, a) Q(s, a) + α r + γ max a Q(s, a ) Q(s, a) iv. s s until s is terminal 46

48 Backup Diagram for Q-Learning Ask students what they think the diagram would be 47

49 Why is Q-Learning Considered an Off-Policy Algorithm It does not take into account the ɛ-greedy policy when choosing actions it picks the action that maximizes the Q function but does not consider that that action might not always be the one selected 48

50 Sarsa Compared to Q-Learning r = 1 safe path S T h e C l i f f G optimal path r = Sarsa Reward per epsiode Q-learning Episodes 49

51 Notes on Example Q-Learning finds the optimal policy assuming a deterministic policy Sarsa finds the safer policy which is appropriate for an ɛ-greedy approach This demonstrates how Sarsa takes the behavior generating policy into account (i.e. on-line) and Q-Learning does not 50

52 Actor-Critic Methods Actor Policy state Critic Value Function TD error action reward Environment 51

53 Stores separate policy (actor) and value function (critic) actor uses policy to take actions critic uses value function to critique the actor s choice Critic uses s t and s t+1 to determine if things have gone as anticipated that is, it computes a TD error given by: δ t = r t+1 + γv (s t+1 ) V (s t ) TD error can be used to evaluate choice of action a in state s positive: tendency to select a in this state should be strengthened negative: suggests tendency to select a in this state should be weakened This should sound familiar given the discussion we had to motivate RL Note similarity to Law of Effect of Operant Conditioning 52

54 Credit Assignment Let s assume we have an estimated action-value function and are now trying to use the policy to solve a problem at each step t we are choosing actions to take that we expect will lead us to a high payoff Q(s t, a t ) predicts the (discounted) reward for this state action pair once we reach the end, we will get the actual reward what happens if, looking back, we realize that some state-action pairs made incorrect predictions? either too high or too low Which state-action pairs should we update and by how much? this is the credit assignment problem should pairs far back in time be held accountable for the problems that arise now? should pairs far back in time be given credit for current successes? perhaps they should, but maybe not as much as recent pairs and shouldn t pairs that weren t even encountered during this episode 53 be exempt from blame or credit?

55 Eligibility Traces Eligibility traces: provide a mechanism for determining how eligible a state-action pair (or just state for estimating V) is for an update trace should answer the question: which state-action pairs (or just states) are eligible for update and by how much? Assumptions made by eligibility traces: pairs closer to a reward should receive more credit for the reward than pairs further away eligibility decays pairs that are encountered more times on the way to a reward should receive more credit than pairs encountered fewer times 54

56 eligibility is incremented when we visit a pair see next slide Computing eligibility: eligibility for a state-action pair s, a at time t is given by: e t = { γλet 1 (s, a) + 1 if s = s t and a = a t γλe t 1 (s, a) otherwise s, a 55

57 Example of Accumulating Eligibility accumulating eligibility trace times of visits to a state 56

58 Updating State-Action Pair Reward Estimates At time t we will measure the prediction error by: the TD-error: δ t = r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t ) At time t, we will adjust a state-action pair s prediction based on the error δ t and its eligibility e t (s, a) Q t+1 = Q t (s, a) + αδ t e t (s, a) 57

59 Can think of it as follows (see next slide): we are moving along through the state-action pairs at each step we get a reward we compute the error δ t we shout back the error to the pairs we have visited in the past 58

60 Backward View of Eligibility Traces e t e t e t δt s t-3 s t-2 s t-1 e t s t s t+1 Time 59

61 TD(λ): Bridge Between Monte Carlo and TD Methods Consider what happens when we set λ = 0 e t = { 1 if s = st and a = a t 0 otherwise s, a only current state will be eligible for update so we only update the current step this is analogous to our previous TD approaches called TD(0) 60

62 As λ increases, but still < 1: more of the proceeding pairs are changed eligibility falls by γλ Consider what happens when we set λ = 1 e t = { γet 1 (s, a) + 1 if s = s t and a = a t γe t 1 (s, a) otherwise s, a eligibility falls by just γ per step this is what we had for Monte Carlo 61

63 Backup Diagram for TD(λ) What would the backup diagram for TD(λ) look like? Look back in the sequence of pairs and consider how a particular pair s, a is updated when we take the next step, we ll update s, a based on reward we got (1-step) when we take the next step, we will again update s, a, but this time the eligibility will be less (2-step) when we take the next step, we will again update s, a, but this time the eligibility will be less (3-step). so, the update to s, a can be thought of as an average of n-step returns i.e. the average of the returns we got after 1-step, 2-steps, 3-steps,... 62

64 This is the forward view of TD(λ) see next slide We can show this average over n-step returns graphically like previous backup diagrams bar over top to indicate average of returns weights for each element of average at the bottom see following slide 63

65 Forward View of Eligibility Traces r T r t+1 r t+2 s t+1 s t+2 r t+3 s t+3 s t Time 64

66 TD(λ) s Backup Diagram TD(λ), λ-return 1 λ (1 λ) λ (1 λ) λ 2 Σ = 1 λ T-t-1 65

67 Weights Given to Returns weight given to the 3-step return total area = 1 Weight 1 λ decay by λ weight given to actual, final return t Time T 66

68 Sarsa(λ) s Backup Diagram Sarsa(λ) s, a t t 1 λ (1 λ) λ (1 λ) λ 2 Σ = 1 λ T-t-1 s T 67

69 Sara(λ) Algorithm: Sarsa(λ) 1. Initialize Q(s, a) arbitrarily, and e(s, a) = 0 for all s and a 2. Repeat (for each episode): (a) Initialize s and a (b) Repeat (for each step of episode) until s is terminal: i. Take action a, observe reward r and next state s ii. Choose a given s using policy derived from Q (ɛ-greedy) iii. δ r + γq(s, a ) Q(s, a) iv. e(s, a) e(s, a) + 1 v. For all s, a: A. Q(s, a) Q(s, a) + αδe(s, a) B. e(s, a) γλe(s, a) vi. s s ; a a 68

70 Example: Sarsa(λ) and Gridworld Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(λ) with λ=0.9 69

71 Integrating Planning, Acting and Learning Previous techniques will eventually lead to optimal policy without knowing Pr and R ahead of time but can take a long time What if system not only learned policy, but learned a model of the world 70

72 Relationships Among Planning, Acting and Learning value/policy planning direct RL acting model experience model learning 71

73 Certainty Equivalence Algorithm: 1. Learn Pr and R by exploring world 2. Then learn optimal policy (e.g. through policy iteration) Disadvantages: Arbitrary decision between learning and acting Random exploration could be dangerous and inefficient Problems if environment changes after learning 72

74 Dyna Dyna: combines learning and policy generation Uses experience to learn Pr and R Uses experience to adjust policy Uses model to adjust policy 73

75 Dyna Architecture Integrated Planning, Acting and Learning Policy/value functions planning update direct RL update real experience model learning simulated experience search control Model Environment 74

76 Dyna 1. Loop forever (a) Get experience (s, a, r, s ) from world (b) Update model: i. Update Pr by increasing statistic for transition from s to s with action a ii. Update statistic for receiving reward r for taking action a in s (c) Update Q from experience: i. Q(s, a) R(s, a) + γ a P r(s, a, s )max a Q(s, a ) (d) Update Q from model: i. Repeat N times: A. s random previously observed state B. a random action previously taken in s C. Q(s, a) R(s, a) + γ a P r(s, a, s )max a Q(s, a ) (e) Choose next action to perform 75

77 Performance of Dyna 800 S G 600 actions Steps per episode planning steps (direct RL only) 5 planning steps 50 planning steps Episodes 76

78 Snapshot of Dyna s Policies WITHOUT PLANNING (N=0) G WITH PLANNING (N=50) G S S 77

79 Blocked Maze G G S S 150 Dyna-Q+ Dyna-Q Cumulative reward Dyna-AC Time steps 78

80 Shortcut Maze G G S S 400 Cumulative reward Dyna-Q+ Dyna-Q Dyna-AC Time steps 79

81 Bibliography 80

82 Richard S. Sutton and Andrew G. Barto. Introduction. MIT Press, Reinforcement Learning : An 81

4 Reinforcement Learning Basic Algorithms

Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems