Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Nancy Tyler
6 years ago
Views:

1 Reinforcement Learning MDP March May, 2013

2 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical Learning Rewards can be multiple: Multi Objective RL Rewards can be unknown: Inverse RL Rewards can be intrinsic: Intrinsically Motivated RL Information can be reused: Transfer learning There can be many learning agents: Stochastic/Markov Games

3 Visual States in a Maze

4 POMDP definition S, A, O, P, Ω, R, γ, µ O is the set of observation Ω(o s, a) is the conditional probability of getting observation o while reaching state s with action a Value function: V π (b) = s S b(s)v π (s) Bayesian belief propagation: b (s ) = Ω(o s, a) s S P(s s, a)b(s) s S Ω(o s, a) s S P(s s, a)b(s)

5 Value of Information capture the value of information Question: We don t know where the larger reward is should we go and read the map? Answer: it depends on: The difference between the rewards The cost of reading the map The accuracy of the map take all such considerations into account and provide an optimal policy

6 In MDPs the state contains all relevant information This aspect is covered in if the belief state is a sufficient statistic of the state This also implies Markovianity of

7 What are? Initial belief: b 0 (s) = Pr(S = s) Belief state updating b (s ) = Pr(s o, a, b) Observation probabilities: Ω(o s, a) Transition probabilities: P(s s, a) Rewards: R(s, a)

8 Role of Observations 1D belief space for a 2 state POMDP with 3 possible observations

9 POMDP is a Continuous Space Belief MDP B = infinite set of belief states A = finite set of actions Reward function: ρ(b, a) = s S b(s)r(s, a) Transition function: P(b b, a) = o O Pr(b b, a, o)pr(o b, a) where { Pr(b b, a, o) = 1 if belief update yields b given b, a, o 0 otherwise

10 A Way to Simplify: Policy Tree With one step remaining, agent selects an action With two steps, takes an action, makes an observation and then it takes a final action General t step policy is a tree

11 Value Function for Policy Tree If p is a one step policy tree, then the value of executing that action in state s is V p(s) = R(s, a(p)) More generally, if p is a t step policy tree then (in reality) V p(s) = R(s, a(p)) + γ P(s s, a(p)) Ω(o t s, a(p))v ot (p)(s ) s S o t O This V p(s) can be thought as a vector associated with the policy tree p since its dimension is the same as the number of states. We often use the notation α p to refer to this vector α p = V p(s 1 ), V p(s 2 ),..., V p(s n)

12 Value Function over Belief Space As the exact world cannot be observed, the agent must compute an expectation over world state of executing policy tree p for belief state b V p (b) = s S b(s)v p (s) = b α p To construct an optimal t step policy, we must maximize over the set of all trees P V p (b) = max p P b α p As V p (b) is linear in b for each p P, the value V t (b) is the upper surface of those functions i.e., V t (b) is piecewise linear and convex

13 Piecewise Linear Value Function Let V p1, V p2 and V p3 be the value functions induced by policy trees p 1, p 2 and p 3. Each of these value functions are of the form V pi (b) = b α pi which is a multi linear function of b. Thus, each value function can be represented as a line, plane, or hyperplane, depending on the number of states, and the optimal t step value V t (b) = max p i P b α p i

14 The Familiar Optimal t step Value Function

15 Optimal t step Policy The optimal t policy is determined by projecting the optimal value function back down onto the belief space The projection of the optimal t step value function yields a partition into regions within each of which there is a single policy tree p such that the value is maximal of the entire region The optimal action in that region a(p) is the action in the root of the policy tree p

16 POMDP APproximations PBVI: Point based value iteration Q MDPs AMDPs: Augmented MDPs Monte Carlo MDPs

17 Abstraction in Learning and Planning General problem in AI: Semantics of abstract knowledge Reduction of complexity by exploiting task structure How can different levels of abstraction be related? spatial: states temporal: time scales Environmentally implied or repeated action trajectories Options are also called: Skills, macros, temporally abstract actions In other contexts also: Behavioral primitives, behaviors, schemata

18 Options Sequences of actions that follow a common theme but are not of fixed lengths o = I, π, β call and return optional I S set of starting states π : S A [0, 1] probabilistic policy to be follow during o β : S [0, 1] probability to terminate in each state Example: Docking of a robot I : all states in which charges is in sight π: approach charger if recharging necessary bit is set β:: terminate when docked or charger not visible

19 Exploiting the Structure of the Environment: Rooms Example 4 rooms 4 hallways 4 unreliable primitive actions: up, left, right, down; fail 33% of the time 8 multi step options (to each room s 2 hallways) Given goal location plan shortest route All rewards 0 γ = 0.9

20 Options and Semi Markov Decision Processes (SMDPs) State discrete time, Homogeneous discount Continuous time, Discrete events, Interval dependent discount Discrete time, Overlaid discrete events, Interval dependent discount

21 Value Functions for Options Value functions for options can be defined similar to the MDP case V π (s) = E[r t+1 + γr t π, s, t] Q π (s, o) = E[r t+1 + γr t o, π, s, t] Consider policies π Π(O) that can choose only among options V O(s) = max π Π(O) V π (s) Q O(s, o) = max π Π(O) Qπ (s, o) Optimal w.r.t. the Bellman criterion for O

22 Consequences of choosing an option Reward part R(s, o) = E[r 1 +γr 2 + +γ k 1 r k s 0 = s, o taken in s 0, for k Next state part P(s s, o) = E[γ k δ sk s s 0 = s, o taken in s 0, for k steps]

23 Synchronous Value Iteration Generalized to Options Initialize Iterate V k+1 (s) max o O ( V 0 (s) 0, s S R(s, o) + s S P(s s, o)v k (s ) ), s S Converges to the optimal value function, given the options lim k V k = V O Once VO is computed, π O can be determined If O = A we are back to the conventional value iteration If A O, then V O = V

24 Rooms Example

25 If Goal Subgoal: Both Primitive Actions and Options

26 Benefits of Options Transfer Solutions to sub tasks can be saved and reused Domain knowledge can be provided as options and subgoals Potentially much faster learning and planning By representing action at an appropriate temporal scale Models of options are a form of knowledge representation Expressive Clear Suitable for learning and planning Much more to learn than just one policy, one set of values Framework for constructivism For finding models of the world that are useful for rapid planning and learning

27 Disadvantages of hierarchical RL Which options are useful? Suboptimal choice of options implies suboptimal behavior Option typically become rigid when high level planning sets in Negative transfer: Options learned for one task may be inappropriate for a relatively similar task Algorithm s complexity increase... no free lunches!

28 Tasks and Subgoals An option is the solution to a subtask E.g., in the rooms example: treat hallways as subgoals G, g Subgoal G S states that form a subgoal g : G R subgoal values Value function for option o V o g (s) = E[r 1 +γr 2 + +γ k 1 r k +g(s k ) s 0 = s, o, s k G] Vg (s) = max V o o g (s) Learning: intra option, off policy

29 Intra option Learning Learning with options does not exclude learning of options SMDP learning: execute option to termination then update only the option taken Intra option learning: after each primitive action, update all the options that could have taken that action Proven to converge to correct values under the assumptions of 1 step Q learning

30 Single-objective vs Multi-objective MDPs MOMDPs In some cases agents may have multiple objectives (i.e., reward functions) According to the importance given to the objectives the optimal policy may change The goal is to find the policies on the Pareto frontier each policy has a performance value for each objective if a policy performs worse than another for each objective, it is said to be dominated the Pareto frontier is the set of non dominated solutions The solution is a set of (eventually infinite) policies

MDP without Reward Intrinsically motivated and Inverse Reinforcement learning It may happen that the reward function is not available This framework is interesting in two

produced by the learning algorithm itself according to the environmental characteristics Inverse reinforcement learning In many real-world applications, we do not want an agent to

31 MDP without Reward Intrinsically motivated and Inverse Reinforcement learning It may happen that the reward function is not available This framework is interesting in two scenarios: Intrinsically motivated learning Is that kind of learning that allows puppies and babies to autonomously develop skills The reward signal is not extrinsic, but is produced by the learning algorithm itself according to the environmental characteristics Inverse reinforcement learning In many real-world applications, we do not want an agent to learn from scratch, performing random exploring actions By observing the behavior of an expert, we want to infer which is the reward function she is optimizing It is related to imitation learning

32 Learning from Scratch vs Reuse of Knowledge Usually, RL algorithms are designed to learn from scratch Animals and human beings never learn from scratch New tasks can be solved by reusing knowledge learned when solving similar tasks Transfer Learning What kind of knowledge can be transferred? value functions policies experience samples Problems when two tasks are similar? how to prevent negative transfer?

33 Stationary vs Non-stationary MDPs Multi-agent learning In many real world problems, transition dynamics and the reward function may be time dependent. In the cyclostationary case, time can be added to the state space What happens in presence of other agents? if other agents policies are stationary, we can ignore them if other agents are learning, things get interesting... multi-agent learning is much more complex than single-agent learning strictly related to Game Theory optimal policy is replaced by best response and equilibrium policy competitive agents cooperative agents

The Problem of Temporal Abstraction

The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,