Announcements. CS 188: Artificial Intelligence Fall Reinforcement Learning. Markov Decision Processes. Example Optimal Policies.

Size: px

Start display at page:

Download "Announcements. CS 188: Artificial Intelligence Fall Reinforcement Learning. Markov Decision Processes. Example Optimal Policies."

Julie Anderson
6 years ago
Views:

CS 188: Artificil Intelligence Fll 2008 Lecture 9: MDP 9/25/2008 Announcement Homework olution / review eion: Mondy 9/29, 7-9pm in 2050 Vlley LSB Tuedy 9/0, 6-8pm in 10 Evn Check web for detil Cover

Homework grded ccordingly Dn Klein UC Berkeley Mny lide over the coure dpted from either Sturt Ruell or Andrew Moore Reding: For MDP / reinforcement lerning, we re uing n online reding Different

1 CS 188: Artificil Intelligence Fll 2008 Lecture 9: MDP 9/25/2008 Announcement Homework olution / review eion: Mondy 9/29, 7-9pm in 2050 Vlley LSB Tuedy 9/0, 6-8pm in 10 Evn Check web for detil Cover W1-2, mybe Homework / difficulty: There going to be rnge, jut like exm quetion ome will be quite hrd! Homework grded ccordingly Dn Klein UC Berkeley Mny lide over the coure dpted from either Sturt Ruell or Andrew Moore Reding: For MDP / reinforcement lerning, we re uing n online reding Different tretment thn the R&N book, bewre! 1 2 Reinforcement Lerning Mrkov Deciion Procee [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut lern to ct o to mximize expected rewrd Chnge the rewrd, chnge the lerned behvior Exmple: Plying gme, rewrd t the end for winning / loing Vcuuming houe, rewrd for ech piece of dirt picked up Automted txi, rewrd for ech penger delivered An MDP i defined by: A et of tte S A et of ction A A trnition function T(,,) Prob tht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R() A trt tte (or ditribution) Mybe terminl tte MDP re fmily of nondeterminitic erch problem Reinforcement lerning: MDP where we don t know the trnition or rewrd function 4 Solving MDP Exmple Optiml Policie In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy π() A policy give n ction for ech tte Optiml policy mximize expected if followed Define reflex gent R() = R() = -0.0 Optiml policy when R(,, ) = for ll non-terminl 5 R() = -0.4 R() =

2 Exmple: High-Low High-Low Three crd type: 2,, 4 Infinite deck, twice mny 2 Strt with howing After ech crd, you y high or low New crd i flipped If you re right, you win the point hown on the new crd Tie re no-op If you re wrong, gme end Difference from expectimx: #1: get rewrd you go #2: you might ply forever! 7 Stte: 2,, 4, done Action: High, Low Model: T(,, ): P(=done 4, High) = /4 P(=2 4, High) = 0 P(= 4, High) = 0 P(=4 4, High) = 1/4 P(=done 4, Low) = 0 P(=2 4, Low) = 1/2 P(= 4, Low) = 1/4 P(=4 4, Low) = 1/4 Rewrd: R(,, ): Number hown on if 0 otherwie Strt: Note: could chooe ction with erch. How? 8 Exmple: High-Low MDP Serch Tree Ech MDP tte give n expectimx-like erch tree High Low i tte, High, Low T = 0.5, R = 2 T = 0.25, R = T = 0, R = 4 T = 0.25, R = 0 (, ) i q-tte,,, (,,) clled trnition T(,,) = P(,) R(,,) High Low High Low High Low 9 10 Utilitie of Sequence In order to formlize optimlity of policy, need to undertnd utilitie of equence of rewrd Typiclly conider ttionry preference: Auming tht rewrd depend only on tte for thee lide! Theorem: only two wy to define ttionry utilitie Additive utility: Infinite Utilitie?! Problem: infinite equence with infinite rewrd Solution: Finite horizon: Terminte epiode fter fixed T tep Give nonttionry policy (π depend on time left) Aborbing tte(): gurntee tht for every policy, gent will eventully die (like done for High-Low) Dicounting: for 0 < γ < 1 Dicounted utility: Smller γ men mller horizon horter term focu

3 Dicounting Optiml Utilitie Typiclly dicount rewrd by γ < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge Fundmentl opertion: compute the optiml utilitie of tte (ll t once) Why? Optiml vlue define optmil policie! Define the utility of tte : V * () = expected return trting in nd cting optimlly Define the utility of q-tte (,): Q * () = expected return trting in, tking ction nd therefter cting optimlly Define the optiml policy: π * () = optiml ction from tte,,, 1 14 The Bellmn Eqution Solving MDP Definition of utility led to imple one-tep lookhed reltionhip mongt optiml utility vlue: Optiml rewrd = mximize over firt ction nd then follow optiml policy Formlly:,,, We wnt to find the optiml policyπ* Propol 1: modified expectimx erch, trting from ech tte :,,, Why Not Serch Tree? Why not olve with expectimx? Problem: Thi tree i uully infinite (why?) Sme tte pper over nd over (why?) We would erch once per tte (why?) Ide: Vlue itertion Compute optiml vlue for ll tte ll t once uing ucceive pproximtion Will be bottom-up dynmic progrm imilr in cot to memoiztion Do ll plnning offline, no replnning needed! 17 Vlue Etimte Clculte etimte V k* () Not the optiml vlue of! The optiml vlue conidering only next k time tep (k rewrd) A k, it pproche the optiml vlue Why: If dicounting, ditnt rewrd become negligible If terminl tte rechble from everywhere, frction of epiode not ending become negligible Otherwie, cn get infinite expected utility nd then thi pproch ctully won t work 18

4 Recurrence: Memoized Recurion? Vlue Itertion Problem with the recurive computtion: Hve to keep ll the V k* () round ll the time Don t know which depth π k () to k for when plnning Solution: vlue itertion Clculte vlue for ll tte, bottom-up Keep increing k until convergence Cche ll function cll reult o you never repet work Wht hppened to the evlution function? Vlue Itertion Exmple: Bellmn Updte Ide: Strt with V 0* () = 0, which we know i right (why?) Given V i*, clculte the vlue for ll tte for depth i+1: Thi i clled vlue updte or Bellmn updte Repet until convergence Theorem: will converge to unique optiml vlue Bic ide: pproximtion get refined towrd optiml vlue Policy my converge long before vlue do Exmple: Vlue Itertion V 2 V Define the mx-norm: Convergence* Theorem: For ny two pproximtion U nd V Informtion propgte outwrd from terminl tte nd eventully ll tte hve correct vlue etimte [DEMO] 2 I.e. ny ditinct pproximtion mut get cloer to ech other, o, in prticulr, ny pproximtion mut get cloer to the true U nd vlue itertion converge to unique, tble, optiml olution Theorem: I.e. once the chnge in our pproximtion i mll, it mut lo be cloe to correct 24 4

5 Prctice: Computing Action Utilitie for Fixed Policie Which ction hould we choe from tte : Given optiml vlue V? Given optiml q-vlue Q? Another bic opertion: compute the utility of tte under fix (generl non-optiml) policy Define the utility of tte, under fixed policy π: V π () = expected totl dicounted rewrd (return) trting in nd following π π(), π(),π(), Leon: ction re eier to elect from Q! Recurive reltion (one-tep lookhed / Bellmn eqution): Policy Evlution How do we clculte the V for fixed policy? Ide one: turn recurive eqution into updte Ide two: it jut liner ytem, olve with Mtlb (or whtever) Policy Itertion Alterntive pproch: Step 1: Policy evlution: clculte utilitie for fixed policy (not optiml utilitie!) until convergence Step 2: Policy improvement: updte policy uing onetep lookheh with reulting converged (but not optiml!) utilitie Repet tep until policy converge Thi i policy itertion It till optiml! Cn converge fter under ome condition Policy Itertion Policy evlution: with fixed current policy π, find vlue with implified Bellmn updte: Iterte until vlue converge Policy improvement: with fixed utilitie, find the bet ction ccording to one-tep look-hed Comprion In vlue itertion: Every p (or bckup ) updte both utilitie (explicitly, bed on current utilitie) nd policy (poibly implicitly, bed on current policy) In policy itertion: Severl pe to updte utilitie with frozen policy Occionl pe to updte policie Hybrid pproche (ynchronou policy itertion): Any equence of prtil updte to either policy entrie or utilitie will converge if every tte i viited infinitely often 0 1 5

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs? CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to) ct