Recap: MDPs. CS 188: Artificial Intelligence Fall Optimal Utilities. The Bellman Equations. Value Estimates. Practice: Computing Actions

Size: px

Start display at page:

Download "Recap: MDPs. CS 188: Artificial Intelligence Fall Optimal Utilities. The Bellman Equations. Value Estimates. Practice: Computing Actions"

Lauren Stanley
5 years ago
Views:

1 CS 188: Artificil Intelligence Fll 2008 Lecture 10: MDP 9/30/2008 Dn Klein UC Berkeley Recp: MDP Mrkov deciion procee: Stte S Action A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ) Strt tte 0 Quntitie: Policy = mp of tte to ction Epiode = one run of n MDP Return = um of dicounted rewrd Vlue = expected future return from tte Q-Vlue = expected future return from q-tte,,, Mny lide over the coure dpted from either Sturt Ruell or Andrew Moore 1 [DEMO Grid Vlue] 3 Optiml Utilitie The Bellmn Eqution Fundmentl opertion: compute the optiml utilitie of ll tte Why? Optiml vlue define optiml policie! Define the utility of tte : V * () = expected return trting in nd cting optimlly Define the utility of q-tte (,): Q * () = expected return trting in, tking ction nd therefter cting optimlly,,, Definition of utility led to imple one-tep lookhed reltionhip mongt optiml utility vlue: Optiml rewrd = mximize over firt ction nd then follow optiml policy Formlly:,,, Define the optiml policy: π * () = optiml ction from tte 4 5 Prctice: Computing Action Vlue Etimte Which ction hould we choe from tte : Given optiml vlue V? Given optiml q-vlue Q? Leon: ction re eier to elect from Q! [DEMO Grid Policie] 6 Clculte etimte V k* () Not the optiml vlue of! The optiml vlue conidering only next k time tep (k rewrd) A k, it pproche the optiml vlue Why: If dicounting, ditnt rewrd become negligible If terminl tte rechble from everywhere, frction of epiode not ending become negligible Otherwie, cn get infinite expected utility nd then thi pproch ctully won t work [DEMO -- V k ] 9 1

2 Recurrence: Memoized Recurion? Vlue Itertion Problem with the recurive computtion: Hve to keep ll the V k* () round ll the time Don t know which depth π k () to k for when plnning Solution: vlue itertion Clculte vlue for ll tte, bottom-up Keep increing k until convergence Cche ll function cll reult o you never repet work Wht hppened to the evlution function? Vlue Itertion Exmple: Bellmn Updte Ide: Strt with V 0* () = 0, which we know i right (why?) Given V i*, clculte the vlue for ll tte for depth i+1: Throw out old vector V i * Thi i clled vlue updte or Bellmn updte Repet until convergence Theorem: will converge to unique optiml vlue Bic ide: pproximtion get refined towrd optiml vlue Policy my converge long before vlue do Exmple: Vlue Itertion V 2 V 3 Define the mx-norm: Convergence* Theorem: For ny two pproximtion U nd V Informtion propgte outwrd from terminl tte nd eventully ll tte hve correct vlue etimte [DEMO] 14 I.e. ny ditinct pproximtion mut get cloer to ech other, o, in prticulr, ny pproximtion mut get cloer to the true U nd vlue itertion converge to unique, tble, optiml olution Theorem: I.e. once the chnge in our pproximtion i mll, it mut lo be cloe to correct 15 2

Utilitie for Fixed Policie Policy Evlution Another bic opertion: compute the utility of tte under fixed (generlly non-optiml) policy Define the utility of tte, under fixed policy π: V π () = expected

3 Utilitie for Fixed Policie Policy Evlution Another bic opertion: compute the utility of tte under fixed (generlly non-optiml) policy Define the utility of tte, under fixed policy π: V π () = expected totl dicounted rewrd (return) trting in nd following π π(), π(),π(), How do we clculte the V for fixed policy? Ide one: turn recurive eqution into updte Recurive reltion (one-tep lookhed / Bellmn eqution): Ide two: it jut liner ytem, olve with Mtlb (or whtever) [DEMO Right-Only Policy] Policy Itertion Alterntive pproch: Step 1: Policy evlution: clculte utilitie for fixed policy (not optiml utilitie!) until convergence Step 2: Policy improvement: updte policy uing onetep lookhed with reulting converged (but not optiml!) utilitie Repet tep until policy converge Thi i policy itertion It till optiml! Cn converge fter under ome condition Policy Itertion Policy evlution: with fixed current policy π, find vlue with implified Bellmn updte: Iterte until vlue converge Policy improvement: with fixed utilitie, find the bet ction ccording to one-tep look-hed Comprion In vlue itertion: Every p (or bckup ) updte both utilitie (explicitly, bed on current utilitie) nd policy (poibly implicitly, bed on current policy) In policy itertion: Severl pe to updte utilitie with frozen policy Policy evlution pe re fter thn vlue itertion pe (why?) Occionl pe to updte policie Hybrid pproche (ynchronou policy itertion): Any equence of prtil updte to either policy entrie or utilitie will converge if every tte i viited infinitely often Reinforcement Lerning Reinforcement lerning: Still hve n MDP: A et of tte S A et of ction (per tte) A A model T(,, ) A rewrd function R(,, ) Still looking for policy π() [DEMO] New twit: don t know T or R I.e. don t know which tte re good or wht the ction do Mut ctully try ction nd tte out to lern

4 Exmple: Animl Lerning RL tudied experimentlly for more thn 60 yer in pychology Rewrd: food, pin, hunger, drug, etc. Mechnim nd ophitiction debted Exmple: forging Bee lern ner-optiml forging pln in field of rtificil flower with controlled nectr upplie Bee hve direct neurl connection from nectr intke meurement to motor plnning re Exmple: Bckgmmon Rewrd only for win / lo in terminl tte, zero otherwie TD-Gmmon lern function pproximtion to V() uing neurl network Combined with depth 3 erch, one of the top 3 plyer in the world You could imgine trining Pcmn thi wy but it tricky! Pive Lerning Simplified tk You don t know the trnition T(,, ) You don t know the rewrd R(,, ) You re given policy π() Gol: lern the tte vlue (nd mybe the model) In thi ce: No choice bout wht ction to tke Jut execute the policy nd lern from experience We ll get to the generl ce oon Exmple: Direct Etimtion Epiode: y γ = 1, R = -1 (4,2) exit -100 U(1,1) ~ ( ) / 2 = -7 U(3,3) ~ ( ) / 3 = 31.3 x Model-Bed Lerning In generl, wnt to lern the optiml policy, not evlute fixed policy Ide: dptive dynmic progrmming Lern n initil model of the environment: Solve for the optiml policy for thi model (vlue or policy itertion) Refine model through experience nd repet Crucil: we hve to mke ure we ctully lern bout ll of the model Model-Bed Lerning Ide: Lern the model empiriclly (rther thn vlue) Solve the MDP if the lerned model were correct Empiricl model lerning Simplet ce: Count outcome for ech, Normlize to give etimte of T(,, ) Dicover R(,, ) the firt time we experience (,, ) More complex lerner re poible (e.g. if we know tht ll qure hve relted ction outcome, e.g. ttionry noie )

5 Exmple: Model-Bed Lerning Epiode: (4,2) exit -100 y γ = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 x Exmple: Greedy ADP Imgine we find the lower pth to the good exit firt Some tte will never be viited following thi policy from (1,1) We ll keep re-uing thi policy becue following it never collect the region of the model we need to lern the optiml policy?? Wht Went Wrong? Model-Free Lerning Problem with following optiml policy for current model: Never lern bout better region of the pce if current policy neglect them Fundmentl trdeoff: explortion v. exploittion Explortion: mut tke ction with uboptiml etimte to dicover new rewrd nd incree eventul utility Exploittion: once the true optiml policy i lerned, explortion reduce utility Sytem mut explore in the beginning nd exploit in the limit?? Big ide: why bother lerning T? Updte V ech time we experience trnition Frequent outcome will contribute more updte (over time) Temporl difference lerning (TD) Policy till fixed! Move vlue towrd vlue of whtever ucceor occur,,, Exmple: Pive TD Problem with TD Vlue Lerning TD vlue lening i model-free for policy evlution However, if we wnt to turn our vlue etimte into policy, we re unk:,,, (4,2) exit -100 Tke γ = 1, α = 0.5 Ide: lern Q-vlue directly Mke ction election model-free too!

Announcements. CS 188: Artificial Intelligence Fall Recap: MDPs. Recap: Optimal Utilities. Practice: Computing Actions. Recap: Bellman Equations

Announcements. CS 188: Artificial Intelligence Fall Recap: MDPs. Recap: Optimal Utilities. Practice: Computing Actions. Recap: Bellman Equations CS 188: Artificil Intelligence Fll 2009 Lecture 10: MDP 9/29/2009 Announcement P2: Due Wednedy P3: MDP nd Reinforcement Lerning i up! W2: Out lte thi week Dn Klein UC Berkeley Mny lide over the coure dpted