10/12/2012. Logistics. Planning Agent. MDPs. Review: Expectimax. PS 2 due Tuesday Thursday 10/18. PS 3 due Thursday 10/25.

Size: px

Start display at page:

Download "10/12/2012. Logistics. Planning Agent. MDPs. Review: Expectimax. PS 2 due Tuesday Thursday 10/18. PS 3 due Thursday 10/25."

Rosamund Phillips
5 years ago
Views:

Logitic PS 2 due Tueday Thurday 10/18 CSE 473 Markov Deciion Procee PS 3 due Thurday 10/25 Dan Weld Many lide from Chri Bihop, Mauam, Dan Klein, Stuart Ruell, Andrew Moore & Luke Zettlemoyer MDP

1 Logitic PS 2 due Tueday Thurday 10/18 CSE 473 Markov Deciion Procee PS 3 due Thurday 10/25 Dan Weld Many lide from Chri Bihop, Mauam, Dan Klein, Stuart Ruell, Andrew Moore & Luke Zettlemoyer MDP Planning Agent Markov Deciion Procee Planning Under Uncertainty Static v. Dynamic Environment Mathematical Framework Bellman Equation Value Iteration Real Time Dynamic Programming Policy Iteration Reinforcement Learning Andrey Markov ( ) Fully v. Partially Obervablebl Perfect v. Noiy What action next Determinitic v. Stochatic Intantaneou v. Durative Percept Action Find a policy : Objective of an MDP which optimize minimize dicounted expected cot to reach a goal or maximize undicount. expected reward maximize expected (reward cot) given a horizon finite infinite indefinite Review: Expectimax What if we don t know what the reult of an action will be E.g., In olitaire, next card i unknown In pacman, the ghot act randomly Can do expectimax earch Max node a in minimax earch Chance node, like min node, except the outcome i uncertain take average (expectation) of children Calculate expected utilitie Today, we formalize a an Markov Deciion Proce Handle intermediate reward & infinite plan More efficient proceing max chance 1

Grid World Wall block the agent path Agent action may go atray: 80% of the time, North action take the agent North (auming no wall) 10% actually go Wet 10% actually go Eat If there i a wall in the

tranition function T(,a, ) Probthat a from lea

, P(,a) Alo called the model A reward function R(, a, ) Sometime jut R() or R( ) A tart tate (or ditribution) Maybe a terminal tate MDP: non determinitic earch Reinforcement learning: MDP where we

2 Grid World Wall block the agent path Agent action may go atray: 80% of the time, North action take the agent North (auming no wall) 10% actually go Wet 10% actually go Eat If there i a wall in the choen direction, the agent tay put Small living reward each tep Big reward come at the end Goal: maximize um of reward Markov Deciion Procee An MDP i defined by: A et of tate S A et of action a A A tranition function T(,a, ) Probthat a from lead to i.e., P(,a) Alo called the model A reward function R(, a, ) Sometime jut R() or R( ) A tart tate (or ditribution) Maybe a terminal tate MDP: non determinitic earch Reinforcement learning: MDP where we don t know the tranition or reward function What i Markov about MDP Andrey Markov ( ) Markov generally mean that conditioned on the preent tate, the future i independent of the pat For Markov deciion procee, Markov mean: Solving MDP In determinitic ingle-agent earch problem, want an optimal plan, or equence of action, from tart to a goal In an MDP, we want an optimal policy *: S A A policy give an action for each tate Anoptimal policymaximize expected utilityif if followed Define a reflex agent Optimal policy when R(, a, ) = 0.03 for all non terminal Example Optimal Policie Example Optimal Policie R() = 0.01 R() = 0.03 R() = 0.01 R() = 0.03 R() = 0.4 R() = 2.0 R() = 0.4 R() = 2.0 2

3 Example Optimal Policie Example Optimal Policie R() = 0.01 R() = 0.03 R() = 0.01 R() = 0.03 R() = 0.4 R() = 2.0 R() = 0.4 R() = 2.0 Example: High Low Three card type: 2, 3, 4 Infinite deck, twice a many 2 Start with 3 howing After each card, you ay high or low New card i flipped If you re right, ih you win the point hown on the new card Tie are no op (no reward) 0 If you re wrong, game end Difference from expectimax problem: #1: get reward a you go #2: you might play forever! 3 High Low a an MDP State: 2, 3, 4, done Action: High, Low Model: T(, a, ): P( =4 4, Low) = 1/4 P( =3 4, Low) = 1/4 3 P( =2 4, Low) = 1/2 P( =done 4, Low) = 0 P( =4 4, High) = 1/4 P( =3 4, High) = 0 P( =2 4, High) = 0 P( =done 4, High) = 3/4 Reward: R(, a, ): Number hown on if < a= high 0 otherwie Start: 3 Search Tree: High Low MDP Search Tree Each MDP tate give an expectimax like earch tree T = 0.5, R = 2 T = 0.25, R = 3 Low High Low High Low High Low High, Low, High T = 0, R = 4 T = 0.25, R = 0 (, a) i a q-tate,a, a, a i a tate (,a, ) called a tranition T(,a, ) = P(,a) R(,a, ) 3

Utilitie of Sequence In order to formalize optimality of a policy, need to undertand utilitie of equence of reward Typically conider tationary preference: Theorem: only two way to define tationary

life) Give nontationary policie ( depend on time left) Aborbing tate: guarantee that for every policy, a terminal tate will eventually be reached (like done for High Low) Dicounting: for 0 < < 1

4 Utilitie of Sequence In order to formalize optimality of a policy, need to undertand utilitie of equence of reward Typically conider tationary preference: Theorem: only two way to define tationary utilitie Additive utility: Infinite Utilitie! Problem: infinite tate equence have infinite reward Solution: Finite horizon: Terminate epiode after a fixed T tep (e.g. life) Give nontationary policie ( depend on time left) Aborbing tate: guarantee that for every policy, a terminal tate will eventually be reached (like done for High Low) Dicounting: for 0 < < 1 Dicounted utility: Smaller mean maller horizon horter term focu Typically dicount reward by < 1 each time tep Sooner reward have higher utility than later reward Alo help the algorithm converge Dicounting Recap: Defining MDP Markov deciion procee: State S Start tate 0 Action A Tranition P(, a) aka T(,a, ),, Reward R(,a, ) (and dicount ) a, a,a, MDP quantitie o far: Policy, = Function that chooe an action for each tate Utility (aka return ) = um of dicounted reward Optimal Utilitie Why Not Search Tree Define the value of a tate : V * () = expected utility tarting in and acting optimally Define the value of a q tate (,a): Q * (,a) = expected utility tarting in, taking action a and thereafter acting optimally Define the optimal policy: * () = optimal action from tate a, a,a, Why not olve with expectimax Problem: Thi tree i uually infinite (why) Same tate appear over and over (why) Wewouldearch once per tate (why) Idea: Value iteration Compute optimal value for all tate all at once uing ucceive approximation Will be a bottom up dynamic program imilar in cot to memoization Do all planning offline, no replanning needed! 4

The Bellman Equation Definition of optimal utility lead to a imple one tep look ahead relationhip between optimal utility value: Bellman Equation for MDP Q*(a, ) (1920 1984) a, a,a, Bellman Backup

5 Bellman Backup a 1 1 5 0 a 2 2 a 3 V 0 = 0 V 0 = 1 Q 1 (,a 1 ) = 2 + 0 ~ 2 Q 1 (,a 2 ) = 5 + 0.9~ + 0.1~ 2 ~ 6.1 Q 1 (,a 3 ) = 4.5 + 2 ~ 6.

5 The Bellman Equation Definition of optimal utility lead to a imple one tep look ahead relationhip between optimal utility value: Bellman Equation for MDP Q*(a, ) ( ) a, a,a, Bellman Backup (MDP) Given an etimate of V* function (ay ) Backup function at tate calculate a new etimate (+1 ) : V ax V V 1 = 6.5 Bellman Backup a a 2 2 a 3 V 0 = 0 V 0 = 1 Q 1 (,a 1 ) = ~ 2 Q 1 (,a 2 ) = ~ + 0.1~ 2 ~ 6.1 Q 1 (,a 3 ) = ~ 6.5 Q n+1 (,a) : value/cot of the trategy: execute action a in, execute n ubequently n = argmax a Ap() Q n (,a) max 3 V 0 = 2 Value iteration [Bellman 57] aign an arbitrary aignment of V 0 to each tate. repeat for all tate compute +1 () by Bellman backup at. Iteration n+1 n1 until max +1 () () < -convergence Reidual() Theorem: will converge to unique optimal value Baic idea: approximation get refined toward optimal value Policy may converge long before value do Value Iteration Idea: Start with V 0* () = 0, which we know i right (why) Given V i*, calculate the value for all tate for depth i+1: Thi i called a value update or Bellman update Repeat until convergence Theorem: will converge to unique optimal value Baic idea: approximation get refined toward optimal value Policy may converge long before value do 5

Value Etimate Example: Bellman Update Example: =0.9, living reward=0, noie=0.

$reachable from everywhere, fraction of epiode not ending become negligible Otherwie, can get infinite expected utility and then thi approach actually won t work Example: Value Iteration Example:$ Information propagate outward from terminal tate and eventually all tate have correct value etimate Practice: Computing Action Which action hould we choe from tate : Given optimal value Q Given

Information propagate outward from terminal tate and eventually all tate have correct value etimate Practice: Computing Action Which action hould we choe from tate : Given optimal value Q Given

6 Value Etimate Example: Bellman Update Example: =0.9, living reward=0, noie=0.2 Calculate etimate V k* () The optimal value conidering only next k time tep (k reward) A k, V k approache the optimal value Why: If dicounting, ditant reward become negligible If terminal tate reachable from everywhere, fraction of epiode not ending become negligible Otherwie, can get infinite expected utility and then thi approach actually won t work Example: Value Iteration Example: Value Iteration V 1 V 2 QuickTime and a GIF decompreor are needed to ee thi picture. Information propagate outward from terminal tate and eventually all tate have correct value etimate Practice: Computing Action Which action hould we choe from tate : Given optimal value Q Given optimal value V Leon: action are eaier to elect from Q! Comment Deciion theoretic Algorithm Dynamic Programming Fixed Point Computation Probabilitic verion of Bellman Ford Algorithm for hortet path computation MDP 1 : Stochatic Shortet Path Problem Time Complexity one iteration: O( 2 ) number of iteration: poly(,, 1/ 1 ) Space Complexity: O( ) Factored MDP = Planning under uncertainty exponential pace, exponential time 6

7 Convergence Propertie Convergence V* in the limit a n -convergence: function i within of V* Optimality: current policy i within 2 of optimal Monotonicity * V 0 p V * p V* ( monotonic from below) V 0 p V * p V* ( monotonic from above) otherwie non monotonic Define the max norm: Theorem: For any two approximation U t and V t I.e. any ditinct approximation mut get cloer to each other, o, in particular, any approximation mut get cloer to the true V* (aka U) and value iteration converge to a unique, table, optimal olution Theorem: I.e. once the change in our approximation i mall, it mut alo be cloe to correct Value Iteration Complexity Problem ize: A action and S tate Each Iteration Computation: O( A S 2 ) Space: O( S ) Num of iteration Can be exponential in the dicount factor γ Markov Deciion Procee Planning Under Uncertainty MDP Mathematical Framework Bellman Equation Value Iteration Real Time Dynamic Programming Policy Iteration Reinforcement Learning Andrey Markov ( ) Aynchronou Value Iteration State may be backed up in any order Intead of ytematically, iteration by iteration Theorem: A long a every tate i backed up infinitely often Aynchronou value iteration converge to optimal Aynchonou Value Iteration Prioritized Sweeping Why backup a tate if value of ucceor ame Prefer backing a tate whoe ucceor had mot change Priority Queue of (tate, expected change in value) Backup in the order of priority After backing a tate update priority queue for all predeceor 7

8 Aynchonou Value Iteration Real Time Dynamic Programming [Barto, Bradtke, Singh 95] Why Why i next lide aying min Trial: imulate greedy policy tarting from tart tate; perform Bellman backup on viited tate RTDP: Repeat Trial until value function converge RTDP Trial Comment a greedy = a 2 Q n+1 ( 0,a) Min Propertie if all tate are viited infinitely often then V* +1 ( 0 ) a 1 0 a 2 a 3 Goal Advantage Anytime: more probable tate explored quickly Diadvantage complete convergence can be low! Labeled RTDP [Bonet&Geffner ICAPS03] Stochatic Shortet Path Problem Policy w/ min expected cot to reach goal Initialize v 0 () with admiible heuritic Underetimate remaining cot Theorem: if reidual of V k () < and V k ( ) < for all ucc(),, in greedy graph Then V k i conitent and will remain o Labeling algorithm detect convergence Goal Markov Deciion Procee Planning Under Uncertainty MDP Mathematical Framework Bellman Equation Value Iteration Real Time Dynamic Programming Policy Iteration Reinforcement Learning Andrey Markov ( ) 0 8

9 Changing the Search Space Utilitie for Fixed Policie Value Iteration Search in value pace Compute the reulting policy Policy Iteration Search in policy pace Compute the reulting value Another baic operation: compute the utility of a tate under a fix (general non optimal) policy Define the utility of a tate, under a fixed policy : V () = expected total dicounted reward (return) tarting in and following Recurive relation (one tep lookahead / Bellman equation):, (), (), () Policy Evaluation Policy Iteration How do we calculate the V for a fixed policy Idea one: modify Bellman update Idea two: it jut a linear ytem, olve with Matlab (or whatever) Problem with value iteration: Conidering all action each iteration i low: take A time longer than policy evaluation But policy doen t change each iteration, time wated Alternative to value iteration: Step 1: Policy evaluation: calculate utilitie for a fixed policy (not optimal utilitie!) until convergence (fat) Step 2: Policy improvement: update policy uing one tep lookahead with reulting converged (but not optimal!) utilitie (low but infrequent) Repeat tep until policy converge Policy Iteration Policy iteration [Howard 60] Policy evaluation: with fixed current policy, find value with implified Bellman update: Iterate until value converge aign an arbitrary aignment of 0 to each tate. repeat Policy Evaluation: compute +1 the evaluation of n Policy Improvement: for all tate compute n+1 (): argmax a Ap() Q n+1 (,a) cotly: O(n 3 ) Policy improvement: with fixed utilitie, find the bet action according to one tep look ahead until n+1 n Advantage Modified Policy Iteration approximate by value iteration uing fixed policy earching in a finite (policy) pace a oppoed to uncountably infinite (value) pace convergence fater. all other propertie follow! 9

10 Modified Policy iteration aign an arbitrary aignment of 0 to each tate. repeat Policy Evaluation: compute +1 the approx. evaluation of n Policy Improvement: for all tate compute n+1 (): argmax a Ap() Q n+1 (,a) until n+1 n Advantage probably the mot competitive ynchronou dynamic programming algorithm. Policy Iteration Complexity Problem ize: A action and S tate Each Iteration Computation: O( S 3 + A S 2 ) Space: O( S ) Num of iteration Unknown, but can be fater in practice Convergence i guaranteed Comparion In value iteration: Every pa (or backup ) update both utilitie (explicitly, baed on current utilitie) and policy (poibly implicitly, baed on current policy) In policy iteration: Several pae to update utilitie with frozen policy Occaional pae to update policie Hybrid approache (aynchronou policy iteration): Any equence of partial update to either policy entrie or utilitie will converge if every tate i viited infinitely often Recap: MDP Markov deciion procee: State S Action A Tranition P(,a) (or T(,a, )) a Reward R(,a, ) (and dicount ), a Start tate 0,a, Quantitie: Return = um of dicounted reward Value = expected future return from a tate (optimal, or for a fixed policy) Q Value = expected future return from a q tate (optimal, or for a fixed policy) 10

Announcements. CS 188: Artificial Intelligence Spring Outline. Reinforcement Learning. Grid Futures. Grid World. Lecture 9: MDPs 2/16/2011

Announcements. CS 188: Artificial Intelligence Spring Outline. Reinforcement Learning. Grid Futures. Grid World. Lecture 9: MDPs 2/16/2011 CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDP 2/16/2011 Announcement Midterm: Tueday March 15, 5-8pm P2: Due Friday 4:59pm W3: Minimax, expectimax and MDP---out tonight, due Monday February