CS 188: Artificial Intelligence Fall Markov Decision Processes

Size: px

Start display at page:

Download "CS 188: Artificial Intelligence Fall Markov Decision Processes"

Sherman Kelley
6 years ago
Views:

1 CS 188: Artificial Intelligence Fall 2007 Lecture 10: MDP 9/27/2007 Dan Klein UC Berkeley Markov Deciion Procee An MDP i defined by: A et of tate S A et of action a A A tranition function T(,a, ) Prob that a from lead to i.e., P(,a) Alo called the model A reward function R(, a, ) Sometime jut R() or R( ) A tart tate (or ditribution) Maybe a terminal tate MDP are a family of nondeterminitic earch problem Reinforcement learning: MDP where we don t know the tranition or reward function 1

2 Example: High-Low Three card type: 2, 3, 4 Infinite deck, twice a many 2 Start with 3 howing After each card, you ay high or low New card i flipped If you re right, you win the point hown on the new card Tie are no-op If you re wrong, game end Difference from expectimax: #1: get reward a you go #2: you might play forever! High-Low State: 2, 3, 4, done Action: High, Low Model: T(, a, ): P( =done 4, High) = 3/4 P( =2 4, High) = 0 P( =3 4, High) = 0 P( =4 4, High) = 1/4 P( =done 4, Low) = 0 P( =2 4, Low) = 1/2 P( =3 4, Low) = 1/4 P( =4 4, Low) = 1/4 Reward: R(, a, ): Number hown on if 0 otherwie Start: 3 4 Note: could chooe action with earch. How? 2

3 Example: High-Low High 3 Low 3 3, High, Low T = 0.5, R = 2 T = 0.25, R = 3 T = 0, R = 4 T = 0.25, R = High Low High Low High Low MDP Search Tree Each MDP tate give an expectimax-like earch tree i a tate a (, a) i a q-tate, a (,a, ) called a tranition,a, T(,a, ) = P(,a) R(,a, ) 3

4 Utilitie of Sequence In order to formalize optimality of a policy, need to undertand utilitie of equence of reward Typically conider tationary preference: Theorem: only two way to define tationary utilitie Additive utility: Auming that reward depend only on tate for thee lide! Dicounted utility: Infinite Utilitie?! Problem: infinite equence with infinite reward Solution: Finite horizon: Terminate after a fixed T tep Give nontationary policy (π depend on time left) Aborbing tate(): guarantee that for every policy, agent will eventually die (like done for High-Low) Dicounting: for 0 < γ < 1 Smaller γ mean maller horizon horter term focu 4

5 Dicounting Typically dicount reward by γ < 1 each time tep Sooner reward have higher utility than later reward Alo help the algorithm converge Epiode and Return An epiode i a run of an MDP Sequence of tranition (,a, ) Start at tart tate End at terminal tate (if it end) Stochatic! The utility, or return, of an epiode The dicounted um of the reward 5

6 Utilitie under Policie Fundamental operation: compute the utility of a tate Define the value (utility) of a tate, under a fixed policy π: V π () = expected return tarting in and following π π(), π(),a, Recurive relation (one-tep lookahead): Policy Evaluation How do we calculate value for a fixed policy? Idea one: it jut a linear ytem, olve with Matlab (or whatever) Idea two: turn recurive equation into update V iπ () = expected return over the next i tranition while following π π(), π(),a, Equivalent to doing depth i earch and plugging in zero at leave 6

7 Example: High-Low Policy: alway ay high Iterative update: [DEMO] Q-Function Alo, define a q-value, for a tate and action (q-tate) Q π () = expected return tarting in, taking action a and following π thereafter a, a,a, 7

8 Recap: MDP Quantitie Return = Sum of future dicounted reward in one epiode (tochatic) a, a,a, V: Expected return from a tate under a policy Q: Expected return from a q-tate under a policy Optimal Utilitie Fundamental operation: compute the optimal utilitie of tate Define the utility of a tate : V * () = expected return tarting in and acting optimally Define the utility of a q-tate (,a): Q * () = expected return tarting in, taking action a and thereafter acting optimally a, a,a, Define the optimal policy: π * () = optimal action from tate 8

9 The Bellman Equation Definition of utility lead to a imple relationhip amongt optimal utility value: Optimal reward = maximize over firt action and then follow optimal policy Formally: a, a,a, Solving MDP We want to find the optimal policyπ Propoal 1: modified expectimax earch: a, a,a, 9

10 MDP Search Tree? Problem: Thi tree i uually infinite (why?) The ame tate appear over and over (why?) There actually one tree per tate (why?) Idea: Compute to a finite depth (like expectimax) Conider return from equence of increaing length Cache value o we don t repeat work Value Etimate Calculate etimate V k* () Not the optimal value of! The optimal value conidering only next k time tep (k reward) A k, it approache the optimal value Why: If dicounting, ditant reward become negligible If terminal tate reachable from everywhere, fraction of epiode not ending become negligible Otherwie, can get infinite expected utility and then thi approach actually won t work 10

11 Memoized Recurion? Recurrence: Cache all function call reult o you never repeat work What happened to the evaluation function? Value Iteration Problem with the recurive computation: Have to keep all the V k* () around all the time Don t know which depth π k () to ak for when planning Solution: value iteration Calculate value for all tate, bottom-up Keep increaing k until convergence 11

12 Value Iteration Idea: Start with V 0* () = 0, which we know i right (why?) Given V i*, calculate the value for all tate for depth i+1: Thi i called a value update or Bellman update Repeat until convergence Theorem: will converge to unique optimal value Baic idea: approximation get refined toward optimal value Policy may converge long before value do Example: Bellman Update 12

13 Example: Value Iteration V 2 V 3 Information propagate outward from terminal tate and eventually all tate have correct value etimate [DEMO] Define the max-norm: Convergence* Theorem: For any two approximation U and V I.e. any ditinct approximation mut get cloer to each other, o, in particular, any approximation mut get cloer to the true U and value iteration converge to a unique, table, optimal olution Theorem: I.e. once the change in our approximation i mall, it mut alo be cloe to correct 13

14 Policy Iteration Alternative approach: Step 1: Policy evaluation: calculate utilitie for a fixed policy (not optimal utilitie!) until convergence Step 2: Policy improvement: update policy baed on reulting converged (but not optimal!) utilitie Repeat tep until policy converge Thi i policy iteration Can converge fater under ome condition Policy Iteration Policy evaluation: with fixed current policy π, find value with implified Bellman update: Iterate until value converge Policy improvement: with fixed utilitie, find the bet action according to one-tep look-ahead 14

15 Comparion In value iteration: Every pa (or backup ) update both utilitie (explicitly, baed on current utilitie) and policy (poibly implicitly, baed on current policy) In policy iteration: Several pae to update utilitie with frozen policy Occaional pae to update policie Hybrid approache (aynchronou policy iteration): Any equence of partial update to either policy entrie or utilitie will converge if every tate i viited infinitely often 15

Announcements. CS 188: Artificial Intelligence Spring Outline. Reinforcement Learning. Grid Futures. Grid World. Lecture 9: MDPs 2/16/2011

Announcements. CS 188: Artificial Intelligence Spring Outline. Reinforcement Learning. Grid Futures. Grid World. Lecture 9: MDPs 2/16/2011 CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDP 2/16/2011 Announcement Midterm: Tueday March 15, 5-8pm P2: Due Friday 4:59pm W3: Minimax, expectimax and MDP---out tonight, due Monday February