Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Size: px

Start display at page:

Download "Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration"

Augustus Parrish
5 years ago
Views:

1 Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011

2 Reinforcement Learning Supervised Learning: Uses explicit supervision (input-output pairs) Reinforcement Learning: No explicit supervision

3 Reinforcement Learning Supervised Learning: Uses explicit supervision (input-output pairs) Reinforcement Learning: No explicit supervision Learning is modeled as interactions of an agent with an environment Based on using a feedback mechanism (in form of a reward function)

4 Reinforcement Learning Supervised Learning: Uses explicit supervision (input-output pairs) Reinforcement Learning: No explicit supervision Learning is modeled as interactions of an agent with an environment Based on using a feedback mechanism (in form of a reward function) Applications: Robotics (autonomous driving, robot locomotion, etc.) (Computer) Game Playing Online Advertising Information Retrieval (interactive search).. and many more

5 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in

6 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r)

7 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r) S is a set of states (today s class: finite state space)

8 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r) S is a set of states (today s class: finite state space) A is a set of actions

9 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r) S is a set of states (today s class: finite state space) A is a set of actions P sa is a probability distribution over the state space

10 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r) S is a set of states (today s class: finite state space) A is a set of actions P sa is a probability distribution over the state space i.e., probability of switching to some state s if we took action a in state s

11 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r) S is a set of states (today s class: finite state space) A is a set of actions P sa is a probability distribution over the state space i.e., probability of switching to some state s if we took action a in state s For finite state spaces, P sa is a vector of size S (and sums to 1)

12 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r) S is a set of states (today s class: finite state space) A is a set of actions P sa is a probability distribution over the state space i.e., probability of switching to some state s if we took action a in state s For finite state spaces, P sa is a vector of size S (and sums to 1) R : S A R is the reward function (function of state-action pairs) Note: Often the reward is a function of the state only R : S R

13 Markov Decision Processes (MDP) Used for modeling the environment the agent is acting in Defined by a tuple (S,A,{P sa },γ,r) S is a set of states (today s class: finite state space) A is a set of actions P sa is a probability distribution over the state space i.e., probability of switching to some state s if we took action a in state s For finite state spaces, P sa is a vector of size S (and sums to 1) R : S A R is the reward function (function of state-action pairs) Note: Often the reward is a function of the state only R : S R γ [0,1) is called discount factor for future rewards

14 MDP Dynamics Start in some state s 0 S

15 MDP Dynamics Start in some state s 0 S Choose action a 0 A in state s 0

16 MDP Dynamics Start in some state s 0 S Choose action a 0 A in state s 0 New MDP state s 1 S chosen according to P s0a 0 : s 1 P s0a 0

17 MDP Dynamics Start in some state s 0 S Choose action a 0 A in state s 0 New MDP state s 1 S chosen according to P s0a 0 : s 1 P s0a 0 Choose action a 1 A in state s 1

18 MDP Dynamics Start in some state s 0 S Choose action a 0 A in state s 0 New MDP state s 1 S chosen according to P s0a 0 : s 1 P s0a 0 Choose action a 1 A in state s 1 New MDP state s 2 S chosen according to P s1a 1 : s 2 P s1a 1

19 MDP Dynamics Start in some state s 0 S Choose action a 0 A in state s 0 New MDP state s 1 S chosen according to P s0a 0 : s 1 P s0a 0 Choose action a 1 A in state s 1 New MDP state s 2 S chosen according to P s1a 1 : s 2 P s1a 1 Choose action a 2 A in state s 2, and so on..

20 Payoff and Expected Payoff Payoff defines the cumulative reward Upon visiting states s 0,s 1,... with actions a 0,a 1,..., the payoff: R(s 0,a 0 )+γr(s 1,a 1 )+γ 2 R(s 2,a 2 )+...

21 Payoff and Expected Payoff Payoff defines the cumulative reward Upon visiting states s 0,s 1,... with actions a 0,a 1,..., the payoff: R(s 0,a 0 )+γr(s 1,a 1 )+γ 2 R(s 2,a 2 )+... Reward at time t is discounted by γ t (note: γ < 1) We care more about immediate rewards, rather than the future rewards

22 Payoff and Expected Payoff Payoff defines the cumulative reward Upon visiting states s 0,s 1,... with actions a 0,a 1,..., the payoff: R(s 0,a 0 )+γr(s 1,a 1 )+γ 2 R(s 2,a 2 )+... Reward at time t is discounted by γ t (note: γ < 1) We care more about immediate rewards, rather than the future rewards If rewards defined in terms of states only, then the payoff: R(s 0 )+γr(s 1 )+γ 2 R(s 2 )+...

23 Payoff and Expected Payoff Payoff defines the cumulative reward Upon visiting states s 0,s 1,... with actions a 0,a 1,..., the payoff: R(s 0,a 0 )+γr(s 1,a 1 )+γ 2 R(s 2,a 2 )+... Reward at time t is discounted by γ t (note: γ < 1) We care more about immediate rewards, rather than the future rewards If rewards defined in terms of states only, then the payoff: R(s 0 )+γr(s 1 )+γ 2 R(s 2 )+... We want to choose actions over time to maximize the expected payoff: E[R(s 0 )+γr(s 1 )+γ 2 R(s 2 )+...] Expectation is w.r.t. all possibilities for the initial state

24 Policy Function Policy is a function π : S A, mapping from the states to the actions For an agent with policy π, the action in state s: a = π(s)

25 Policy Function Policy is a function π : S A, mapping from the states to the actions For an agent with policy π, the action in state s: a = π(s) Value Function for a policy π V π (s) = E[R(s 0 )+γr(s 1 )+γ 2 R(s 2 )+... s 0 = s,π] V π (s) is the expected payoff starting in state s and following policy π

26 Policy Function Policy is a function π : S A, mapping from the states to the actions For an agent with policy π, the action in state s: a = π(s) Value Function for a policy π V π (s) = E[R(s 0 )+γr(s 1 )+γ 2 R(s 2 )+... s 0 = s,π] V π (s) is the expected payoff starting in state s and following policy π Bellman s Equation: Gives a recursive definition of the Value Function: V π (s) = R(s)+γ sπ(s) (s s SP )V π (s ) = R(s)+E s P sπ(s) [V π (s )] It s the immediate reward + expected sum of future discounted rewards

27 Computing the Value Function Bellman s equation can be used to compute the value function V π (s) V π (s) = R(s)+γ s SP sπ(s) (s )V π (s ) For an MDP with finite many state, it gives us S equations with S unknowns Efficiently solvable

28 Computing the Value Function Bellman s equation can be used to compute the value function V π (s) V π (s) = R(s)+γ s SP sπ(s) (s )V π (s ) For an MDP with finite many state, it gives us S equations with S unknowns Efficiently solvable Optimal Value Function is defined as: V (s) = max π Vπ (s) It s the best possible payoff that any policy π can give

29 Computing the Value Function Bellman s equation can be used to compute the value function V π (s) V π (s) = R(s)+γ s SP sπ(s) (s )V π (s ) For an MDP with finite many state, it gives us S equations with S unknowns Efficiently solvable Optimal Value Function is defined as: V (s) = max π Vπ (s) It s the best possible payoff that any policy π can give The Optimal Value Function can also be defined as: V (s) = R(s)+max γ P sa (s )V (s ) a A s S

30 Optimal Policy The Optimal Value Function: V (s) = R(s)+max γ P sa (s )V (s ) a A s S

31 Optimal Policy The Optimal Value Function: Optimal Policy π : S A: V (s) = R(s)+max γ P sa (s )V (s ) a A s S π (s) = argmax a A P sa (s )V (s ) The optimal policy for state s gives the action a that maximizes the optimal value function for that state s S

32 Optimal Policy The Optimal Value Function: Optimal Policy π : S A: V (s) = R(s)+max γ P sa (s )V (s ) a A s S π (s) = argmax a A P sa (s )V (s ) The optimal policy for state s gives the action a that maximizes the optimal value function for that state For every state s and every policy π s S V (s) = V π (s) V π (s) Note: π is the optimal policy function for all states s Doesn t matter what the initial MDP state is

33 Finding the Optimal Policy Optimal Policy π : S A: π (s) = argmax a A Two standard methods to find it P sa (s )V (s ) (1) s S

34 Finding the Optimal Policy Optimal Policy π : S A: π (s) = argmax a A Two standard methods to find it P sa (s )V (s ) (1) s S Value Iteration: Zero-initialize and iteratively refine V(s) as it will converge towards V (s). Finally use equation 1 to find the optimal policy π

35 Finding the Optimal Policy Optimal Policy π : S A: π (s) = argmax a A Two standard methods to find it P sa (s )V (s ) (1) s S Value Iteration: Zero-initialize and iteratively refine V(s) as it will converge towards V (s). Finally use equation 1 to find the optimal policy π Policy Iteration: Random-initialize and iteratively refine π(s) by alternating between computing V(s) and then π(s) as per equation 1. π eventually converges to the optimal policy π

36 Finding the Optimal Policy: Value Iteration Iteratively compute/refine the value function V until convergence

37 Finding the Optimal Policy: Value Iteration Iteratively compute/refine the value function V until convergence Value Iteration property: V converges to V

38 Finding the Optimal Policy: Value Iteration Iteratively compute/refine the value function V until convergence Value Iteration property: V converges to V Upon convergence, use π (s) = argmax a A s S P sa(s )V (s )

39 Finding the Optimal Policy: Value Iteration Iteratively compute/refine the value function V until convergence Value Iteration property: V converges to V Upon convergence, use π (s) = argmax a A s S P sa(s )V (s ) Note: The inner loop can update V(s) for all states simultaneously, or in some order

40 Finding the Optimal Policy: Policy Iteration Iteratively compute/refine the policy π until convergence

41 Finding the Optimal Policy: Policy Iteration Iteratively compute/refine the policy π until convergence Step (a) the computes the value function for the current policy π Can be done using Bellman s equations (solving S equations in S unknowns)

42 Finding the Optimal Policy: Policy Iteration Iteratively compute/refine the policy π until convergence Step (a) the computes the value function for the current policy π Can be done using Bellman s equations (solving S equations in S unknowns) Step (b) gives the policy that is greedy w.r.t. V

43 Learning an MDP Model So far we assumed: State transition probabilities {P sa} are given Rewards R(s) at each state are known Often we don t know these and want to learn these

44 Learning an MDP Model So far we assumed: State transition probabilities {P sa} are given Rewards R(s) at each state are known Often we don t know these and want to learn these These are learned using experience (i.e., a set of previous trials) s (j) i a (j) i is the state at time i of trial j is the corresponding action at that state

45 Learning an MDP Model Given this experience, the MLE estimate of state transition probabilities: P sa (s ) = # of times we took action a in state s and got to s # of times we took action a in state s

46 Learning an MDP Model Given this experience, the MLE estimate of state transition probabilities: P sa (s ) = # of times we took action a in state s and got to s # of times we took action a in state s Note: if action a is never taken in state s, the above ratio is 0/0 In that case: P sa(s ) = 1/ S (uniform distribution over all states)

47 Learning an MDP Model Given this experience, the MLE estimate of state transition probabilities: P sa (s ) = # of times we took action a in state s and got to s # of times we took action a in state s Note: if action a is never taken in state s, the above ratio is 0/0 In that case: P sa(s ) = 1/ S (uniform distribution over all states) P sa is easy to update if we gather more experience (i.e., do more trials).. just add counts in the numerator and denominator

48 Learning an MDP Model Given this experience, the MLE estimate of state transition probabilities: P sa (s ) = # of times we took action a in state s and got to s # of times we took action a in state s Note: if action a is never taken in state s, the above ratio is 0/0 In that case: P sa(s ) = 1/ S (uniform distribution over all states) P sa is easy to update if we gather more experience (i.e., do more trials).. just add counts in the numerator and denominator Likewise, the expected reward R(s) in state s can be computed R(s) = average reward in state s across all the trials

49 MDP Learning + Policy Learning Alternate between learning the MDP (P sa and R), and learning the policy Policy learning step can be done using value iteration or policy iteration

50 MDP Learning + Policy Learning Alternate between learning the MDP (P sa and R), and learning the policy Policy learning step can be done using value iteration or policy iteration The Algorithm (uses value iteration) Randomly initialize policy π

51 MDP Learning + Policy Learning Alternate between learning the MDP (P sa and R), and learning the policy Policy learning step can be done using value iteration or policy iteration The Algorithm (uses value iteration) Randomly initialize policy π Repeat until convergence 1 Execute policy π in the MDP to generate a set of trials

52 MDP Learning + Policy Learning Alternate between learning the MDP (P sa and R), and learning the policy Policy learning step can be done using value iteration or policy iteration The Algorithm (uses value iteration) Randomly initialize policy π Repeat until convergence 1 Execute policy π in the MDP to generate a set of trials 2 Use this experience to estimate P sa and R

53 MDP Learning + Policy Learning Alternate between learning the MDP (P sa and R), and learning the policy Policy learning step can be done using value iteration or policy iteration The Algorithm (uses value iteration) Randomly initialize policy π Repeat until convergence 1 Execute policy π in the MDP to generate a set of trials 2 Use this experience to estimate P sa and R 3 Apply value iteration with the estimated P sa and R Gives a new estimate of the value function V

54 MDP Learning + Policy Learning Alternate between learning the MDP (P sa and R), and learning the policy Policy learning step can be done using value iteration or policy iteration The Algorithm (uses value iteration) Randomly initialize policy π Repeat until convergence 1 Execute policy π in the MDP to generate a set of trials 2 Use this experience to estimate P sa and R 3 Apply value iteration with the estimated P sa and R Gives a new estimate of the value function V 4 Update policy π as the greedy policy w.r.t. V

55 MDP Learning + Policy Learning Alternate between learning the MDP (P sa and R), and learning the policy Policy learning step can be done using value iteration or policy iteration The Algorithm (uses value iteration) Randomly initialize policy π Repeat until convergence 1 Execute policy π in the MDP to generate a set of trials 2 Use this experience to estimate P sa and R 3 Apply value iteration with the estimated P sa and R Gives a new estimate of the value function V 4 Update policy π as the greedy policy w.r.t. V Note: Step 3 can be made more efficient by initializing V with values from the previous iteration

56 Value Iteration vs Policy Iteration Small state spaces: Policy Iteration typically very fast and converges quickly

57 Value Iteration vs Policy Iteration Small state spaces: Policy Iteration typically very fast and converges quickly Large state spaces: Policy Iteration may be slow Reason: Policy Iteration needs to solve a large system of linear equations Value iteration is preferred in such cases

58 Value Iteration vs Policy Iteration Small state spaces: Policy Iteration typically very fast and converges quickly Large state spaces: Policy Iteration may be slow Reason: Policy Iteration needs to solve a large system of linear equations Value iteration is preferred in such cases Very large state spaces: Value function can be approximated using some regression algorithm Optimality guarantee is lost however

59 Next Class Continuous state MDP State-space discretization Value function approximation

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision