Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?

Size: px

Start display at page:

Download "Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?"

Laureen Lamb
5 years ago
Views:

CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to)

1 CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to) ct o to mximize expected rewrd Dn Klein UC Berkeley Mny lide over the coure dpted from either Sturt Ruell or Andrew Moore 1 Grid World [DEMO Gridworld Intro] Grid Future The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put Smll living rewrd ech tep Big rewrd come t the end Gol: mximize um of rewrd* Determinitic Grid World E N S W Stochtic Grid World E N S W? 4 Mrkov Deciion Procee An MDP i defined by: A et of tte S A et of ction A A trnition function T(,,) Prob tht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R() A trt tte (or ditribution) Mybe terminl tte Wht i Mrkov bout MDP? Andrey Mrkov ( ) Mrkov generlly men tht given the preent tte, the future nd the pt re independent For Mrkov deciion procee, Mrkov men: MDP re fmily of nondeterminitic erch problem Reinforcement lerning: MDP where we don t know the trnition or rewrd function 5 1

Solving MDP Exmple Optiml Policie In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy π*: S A A policy π give n ction for

0 8 Exmple: High-Low High-Low n MDP Three crd type: 2,, 4 Infinite deck, twice mny 2 Strt with howing After ech crd, you y high or low New crd i flipped If you re right, you win the point hown on the

2 Solving MDP Exmple Optiml Policie In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy π*: S A A policy π give n ction for ech tte An optiml policy mximize expected utility if followed Define reflex gent R() = R() = -0.0 Optiml policy when R(,, ) = -0.0 for ll non-terminl [Demo] R() = -0.4 R() = Exmple: High-Low High-Low n MDP Three crd type: 2,, 4 Infinite deck, twice mny 2 Strt with howing After ech crd, you y high or low New crd i flipped If you re right, you win the point hown on the new crd Tie re no-op If you re wrong, gme end Why not ue expectimx? #1: get rewrd you go #2: you might ply forever! 9 Stte: 2,, 4, done Action: High, Low Model: T(,, ): P(=4 4, Low) = 1/4 P(= 4, Low) = 1/4 P(=2 4, Low) = 1/2 P(=done 4, Low) = 0 P(=4 4, High) = 1/4 P(= 4, High) = 0 P(=2 4, High) = 0 P(=done 4, High) = /4 Rewrd: R(,, ): Number hown on if 0 otherwie Strt: Exmple: High-Low MDP Serch Tree Ech MDP tte give n expectimx-like erch tree Low High i tte, Low, High T = 0.5, R = 2 T = 0.25, R = T = 0, R = 4 T = 0.25, R = 0 (, ) i q-tte,,, (,,) clled trnition T(,,) = P(,) R(,,) High Low High Low High Low

3 Utilitie of Sequence In order to formlize optimlity of policy, need to undertnd utilitie of equence of rewrd Typiclly conider ttionry preference: Theorem: only two wy to define ttionry utilitie Additive utility: Infinite Utilitie?! Problem: infinite tte equence hve infinite rewrd Solution: Finite horizon: Terminte epiode fter fixed T tep (e.g. life) Give nonttionry policie (π depend on time left) Aborbing tte: gurntee tht for every policy, terminl tte will eventully be reched (like done for High-Low) Dicounting: for 0 < γ < 1 Dicounted utility: Smller γ men mller horizon horter term focu 1 14 Typiclly dicount rewrd by γ < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge Dicounting Recp: Defining MDP Mrkov deciion procee: Stte S Strt tte 0 Action A Trnition P(,) (or T(,,)) Rewrd R(,,) (nd dicount γ),, MDP quntitie o fr: Policy = Choice of ction for ech tte Utility (or return) = um of dicounted rewrd, Optiml Utilitie The Bellmn Eqution Fundmentl opertion: compute the vlue (optiml expectimx utilitie) of tte Why? Optiml vlue define optiml policie! Define the vlue of tte : V * () = expected utility trting in nd cting optimlly Define the vlue of q-tte (,): Q * (,) = expected utility trting in, tking ction nd therefter cting optimlly [DEMO Grid Vlue],,, Definition of optiml utility led to imple one-tep lookhed reltionhip mongt optiml utility vlue: Optiml rewrd = mximize over firt ction nd then follow optiml policy Formlly:,,, Define the optiml policy: π * () = optiml ction from tte 17 18

4 Solving MDP We wnt to find the optiml policyπ* Propol 1: modified expectimx erch, trting from ech tte :,,, Why Not Serch Tree? Why not olve with expectimx? Problem: Thi tree i uully infinite (why?) Sme tte pper over nd over (why?) We would erch once per tte (why?) Ide: Vlue itertion Compute optiml vlue for ll tte ll t once uing ucceive pproximtion Will be bottom-up dynmic progrm imilr in cot to memoiztion Do ll plnning offline, no replnning needed! Vlue Etimte Clculte etimte V k* () Not the optiml vlue of! The optiml vlue conidering only next k time tep (k rewrd) A k, it pproche the optiml vlue Almot olution: recurion (i.e. expectimx) Correct olution: dynmic progrmming Vlue Itertion Ide: Strt with V 0* () = 0, which we know i right (why?) Given V i*, clculte the vlue for ll tte for depth i+1: Thi i clled vlue updte or Bellmn updte Repet until convergence Theorem: will converge to unique optiml vlue Bic ide: pproximtion get refined towrd optiml vlue Policy my converge long before vlue do [DEMO -- V k ] Exmple: Bellmn Updte Exmple: γ=0.9, living rewrd=0, noie=0.2 Exmple: Vlue Itertion V 2 V mx hppen for =right, other ction not hown 2 Informtion propgte outwrd from terminl tte nd eventully ll tte hve correct vlue etimte [DEMO] 24 4

5 Define the mx-norm: Convergence* Theorem: For ny two pproximtion U nd V I.e. ny ditinct pproximtion mut get cloer to ech other, o, in prticulr, ny pproximtion mut get cloer to the true U nd vlue itertion converge to unique, tble, optiml olution Theorem: I.e. once the chnge in our pproximtion i mll, it mut lo be cloe to correct 25 5

Announcements. CS 188: Artificial Intelligence Fall Reinforcement Learning. Markov Decision Processes. Example Optimal Policies.

Announcements. CS 188: Artificial Intelligence Fall Reinforcement Learning. Markov Decision Processes. Example Optimal Policies. CS 188: Artificil Intelligence Fll 2008 Lecture 9: MDP 9/25/2008 Announcement Homework olution / review eion: Mondy 9/29, 7-9pm in 2050 Vlley LSB Tuedy 9/0, 6-8pm in 10 Evn Check web for detil Cover W1-2,