Announcements. CS 188: Artificial Intelligence Fall Recap: MDPs. Recap: Optimal Utilities. Practice: Computing Actions. Recap: Bellman Equations

Size: px

Start display at page:

Download "Announcements. CS 188: Artificial Intelligence Fall Recap: MDPs. Recap: Optimal Utilities. Practice: Computing Actions. Recap: Bellman Equations"

Suzanna Malone
6 years ago
Views:

1 CS 188: Artificil Intelligence Fll 2009 Lecture 10: MDP 9/29/2009 Announcement P2: Due Wednedy P3: MDP nd Reinforcement Lerning i up! W2: Out lte thi week Dn Klein UC Berkeley Mny lide over the coure dpted from either Sturt Ruell or Andrew Moore 1 2 Recp: MDP Recp: Optiml Utilitie Mrkov deciion procee: Stte S Action A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ) Strt tte 0 Quntitie: Policy = mp of tte to ction Epiode = one run of n MDP Utility = um of dicounted rewrd Vlue = expected future utility from tte Q-Vlue = expected future utility from q-tte,,, [DEMO MDP Quntitie] 3 The utility of tte : V * () = expected utility trting in nd cting optimlly The utility of q-tte (,): Q * (,) = expected utility trting in, tking ction nd therefter cting optimlly The optiml policy: π * () = optiml ction from tte,,, i tte (, ) i q-tte (,, ) i trnition 4 Recp: Bellmn Eqution Prctice: Computing Action Definition of utility led to imple one-tep lookhed reltionhip mongt optiml utility vlue: Totl optiml rewrd = mximize over choice of (firt ction plu optiml future) Formlly:,,, Which ction hould we choe from tte : Given optiml vlue V? Given optiml q-vlue Q? Leon: ction re eier to elect from Q! 5 [DEMO MDP ction election] 6 1

2 Vlue Etimte Memoized Recurion? Clculte etimte V k* () Not the optiml vlue of! The optiml vlue conidering only next k time tep (k rewrd) A k, it pproche the optiml vlue Recurrence (biclly truncted expectimx): Almot olution: recurion (i.e. expectimx) Correct olution: dynmic progrmming [DEMO -- V k ] 7 Cche ll function cll reult o you never repet work Wht hppened to the evlution function? 8 Vlue Itertion Vlue Itertion Problem with the recurive computtion: Hve to keep ll the V k* () round ll the time Don t know which depth π k () to k for when plnning Solution: vlue itertion Clculte vlue for ll tte, bottom-up Keep increing k until convergence Ide: Strt with V 0* () = 0, which we know i right (why?) Given V i*, clculte the vlue for ll tte for depth i+1: Throw out old vector V i * Repet until convergence Thi i clled vlue updte or Bellmn updte Theorem: will converge to unique optiml vlue Bic ide: pproximtion get refined towrd optiml vlue Policy my converge long before vlue do 9 10 Convergence* Utilitie for Fixed Policy Define the mx-norm: Theorem: For ny two pproximtion U nd V I.e. ny ditinct pproximtion mut get cloer to ech other, o, in prticulr, ny pproximtion mut get cloer to the true U nd vlue itertion converge to unique, tble, optiml olution Theorem: Another bic opertion: compute the utility of tte under fixed (generlly non-optiml) policy Define the utility of tte, under fixed policy π: V π () = expected totl dicounted rewrd (return) trting in nd following π Recurive reltion (one-tep lookhed / Bellmn eqution):,π(), π(), π() I.e. once the chnge in our pproximtion i mll, it mut lo be cloe to correct 11 [DEMO Right-Only Policy] 12 2

3 Policy Evlution How do we clculte the V for fixed policy? Ide one: turn recurive eqution into updte Ide two: it jut liner ytem, olve with Mtlb (or whtever) Policy Itertion Alterntive pproch: Step 1: Policy evlution: clculte utilitie for ome fixed policy (not optiml utilitie!) until convergence Step 2: Policy improvement: updte policy uing onetep look-hed with reulting converged (but not optiml!) utilitie future vlue Repet tep until policy converge Thi i policy itertion It till optiml! Cn converge fter under ome condition Policy Itertion Policy evlution: with fixed current policy π, find vlue with implified Bellmn updte: Iterte until vlue converge Policy improvement: with fixed utilitie, find the bet ction ccording to one-tep look-hed 15 Comprion Both compute me thing (optiml vlue for ll tte) In vlue itertion: Every p (or bckup ) updte both utilitie (explicitly, bed on current utilitie) nd policy (implicitly, bed on current utilitie) Trcking the policy in t necery; we tke the mx In policy itertion: Severl pe to updte utilitie with fixed policy After policy i evluted, new policy i choen Together, thee re dynmic progrmming for MDP 16 Aynchronou Vlue Itertion* Reinforcement Lerning In vlue itertion, we updte every tte in ech itertion Actully, ny equence of Bellmn updte will converge if every tte i viited infinitely often In fct, we cn updte the policy eldom or often we like, nd we will till converge Reinforcement lerning: Still hve n MDP: A et of tte S A et of ction (per tte) A A model T(,, ) A rewrd function R(,, ) Still looking for policy π() [DEMO] Ide: Updte tte whoe vlue we expect to chnge: If i lrge then updte predeceor of New twit: don t know T or R I.e. don t know which tte re good or wht the ction do Mut ctully try ction nd tte out to lern 18 3

4 Exmple: Animl Lerning RL tudied experimentlly for more thn 60 yer in pychology Rewrd: food, pin, hunger, drug, etc. Mechnim nd ophitiction debted Exmple: forging Bee lern ner-optiml forging pln in field of rtificil flower with controlled nectr upplie Bee hve direct neurl connection from nectr intke meurement to motor plnning re Exmple: Bckgmmon Rewrd only for win / lo in terminl tte, zero otherwie TD-Gmmon lern function pproximtion to V() uing neurl network Combined with depth 3 erch, one of the top 3 plyer in the world You could imgine trining Pcmn thi wy but it tricky! (It lo P3) Pive Lerning Simplified tk You don t know the trnition T(,, ) You don t know the rewrd R(,, ) You re given policy π() Gol: lern the tte vlue (nd mybe the model) I.e., policy evlution In thi ce: Lerner long for the ride No choice bout wht ction to tke Jut execute the policy nd lern from experience We ll get to the ctive ce oon Thi i NOT offline plnning! Exmple: Direct Etimtion Epiode: [DEMO Optiml Policy] y x γ = 1, R = -1 (4,2) exit -100 V(1,1) ~ ( ) / 2 = -7 V(3,3) ~ ( ) / 3 = Model-Bed Lerning Ide: Lern the model empiriclly (rther thn vlue) Solve the MDP if the lerned model were correct Better thn direct etimtion? Empiricl model lerning Simplet ce: Count outcome for ech, Normlize to give etimte of T(,, ) Dicover R(,, ) the firt time we experience (,, ) More complex lerner re poible (e.g. if we know tht ll qure hve relted ction outcome, e.g. ttionry noie ) Exmple: Model-Bed Lerning Epiode: (4,2) exit -100 y γ = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 x

5 Recp: Model-Bed Policy Evlution Smple Avg to Replce Expecttion? Simplified Bellmn updte to clculte V for fixed policy: New V i expected one-tep-lookhed uing current V Unfortuntely, need T nd R π(), π(),π(), Who need T nd R? Approximte the expecttion with mple (drwn from T!) π(), π() Model-Free Lerning Exmple: TD Policy Evlution Big ide: why bother lerning T? Updte V ech time we experience trnition Frequent outcome will contribute more updte (over time) Temporl difference lerning (TD) Policy till fixed! Move vlue towrd vlue of whtever ucceor occur: running verge! π(), π() (4,2) exit -100 Tke γ = 1, α = Problem with TD Vlue Lerning TD vlue lening i model-free for policy evlution (pive lerning) However, if we wnt to turn our vlue etimte into policy, we re unk:,,, Ide: lern Q-vlue directly Mke ction election model-free too! 29 5

Recap: MDPs. CS 188: Artificial Intelligence Fall Optimal Utilities. The Bellman Equations. Value Estimates. Practice: Computing Actions

Recap: MDPs. CS 188: Artificial Intelligence Fall Optimal Utilities. The Bellman Equations. Value Estimates. Practice: Computing Actions CS 188: Artificil Intelligence Fll 2008 Lecture 10: MDP 9/30/2008 Dn Klein UC Berkeley Recp: MDP Mrkov deciion procee: Stte S Action A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ) Strt tte 0 Quntitie: