Non-Deterministic Search. CS 188: Artificial Intelligence Markov Decision Processes. Grid World Actions. Example: Grid World

Size: px

Start display at page:

Download "Non-Deterministic Search. CS 188: Artificial Intelligence Markov Decision Processes. Grid World Actions. Example: Grid World"

Charleen Warren
5 years ago
Views:

CS 188: Artificil Intelligence Mrkov Deciion Procee Non-Determinitic Serch Dn Klein, Pieter Abbeel Univerity of Cliforni, Berkeley Exmple: Grid World Grid World Action A mze-like problem The gent

10% Et If there i wll in the direction the gent would hve been tken, the gent ty put The gent receive rewrd ech time tep Smll living rewrd ech tep (cn be negtive) Big rewrd come t the end (good or

1 CS 188: Artificil Intelligence Mrkov Deciion Procee Non-Determinitic Serch Dn Klein, Pieter Abbeel Univerity of Cliforni, Berkeley Exmple: Grid World Grid World Action A mze-like problem The gent live in grid Wll block the gent pth Noiy movement: ction do not lwy go plnned 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put The gent receive rewrd ech time tep Smll living rewrd ech tep (cn be negtive) Big rewrd come t the end (good or bd) Gol: mximize um of rewrd Determinitic Grid World Stochtic Grid World Gridworldexmple: Sturt Ruell Mrkov Deciion Procee Wht i Mrkov bout MDP? An MDP i defined by: A et of tte S A et of ction A A trnition function T(,, ) Probtht from led to, i.e., P(,) Alo clled the model or the dynmic A rewrd function R(,, ) Sometime jut R() or R( ) A trt tte Mybe terminl tte MDP re non-determinitic erch problem One wy to olve them i with expectimx erch We ll hve new tool oon Mrkov generlly men tht given the preent tte, the future nd the pt re independent For Mrkov deciion procee, Mrkov men ction outcome depend only on the current tte Thi i jut like erch, where the ucceor function could only depended on the current tte (not the hitory) AndreyMrkov ( ) 1

2 Policie Optiml Policie In determinitic ingle-gent erch problem, we wnted n optiml pln, or equence of ction, from trt to gol For MDP, we wnt n optiml policy π*: S A A policy πgive n ction for ech tte An optiml policy i one tht mximize expected utility if followed An explicit policy define reflex gent Expectimx didn t compute entire policie It computed the ction for ingle tte only Optiml policy when R(,, ) = for ll non-terminl R() = R() = R() = -0.4 R() = -2.0 Exmple: Rcing Exmple: Rcing A robot cr wnt to trvel fr, quickly Three tte: Cool, Wrm, Overheted Two ction: Slow, Ft Going fter get double rewrd +1 Slow Ft -10 Slow Wrm Ft Cool Overheted Rcing Serch Tree MDP Serch Tree Ech MDP tte project n expectimx-like erch tree i tte (, ) i q-tte, (,, ) clled trnition,, T(,, ) = P(,) R(,, ) 2

3 Utilitie of Sequence Utilitie of Sequence Wht preference hould n gent hve over rewrd equence? More or le? Now or lter? [1, 2, 2] or [2, 3, 4] [0, 0, 1] or [1, 0, 0] Dicounting It reonble to mximize the um of rewrd It lo reonble to prefer rewrd now to rewrd lter One olution: vlue of rewrd decy exponentilly How to dicount? Ech time we decend level, we multiply in the dicount once Dicounting Why dicount? Sooner rewrd probbly do hve higher utility thn lter rewrd Alo help our lgorithm converge Worth Now Worth Next Step Worth In Two Step Exmple: dicount of 0.5 U([1,2,3]) = 1* * *3 U([1,2,3]) < U([3,2,1]) Sttionry Preference Theorem: if we ume ttionry preference: Then: there re only two wy to define utilitie Additive utility: Dicounted utility: Infinite Utilitie?! Problem: Wht if the gme lt forever? Do we get infinite rewrd? Solution: Finite horizon: (imilr to depth-limited erch) Terminte epiode fter fixed T tep (e.g. life) Give nonttionrypolicie (π depend on time left) Dicounting: ue 0 < γ< 1 Smller γ men mller horizon horter term focu Aborbing tte: gurntee tht for every policy, terminl tte will eventully be reched (like overheted for rcing) 3

4 Recp: Defining MDP Solving MDP Mrkov deciion procee: Set of tte S Strt tte 0 Set of ction A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ),,, MDP quntitie o fr: Policy = Choice of ction for ech tte Utility = um of (dicounted) rewrd Optiml Quntitie Vlue of Stte The vlue(utility) of tte : V * () = expected utility trting in nd cting optimlly The vlue (utility) of q-tte (,): Q * (,) = expected utility trting out hving tken ction from tte nd (therefter) cting optimlly,,, i tte (, ) i q-tte (,, ) i trnition Fundmentl opertion: compute the (expectimx) vlue of tte Expected utility under optiml ction Averge um of (dicounted) rewrd Thi i jut wht expectimx computed! Recurive definition of vlue:,,, The optiml policy: π * () = optiml ction from tte [demo gridworldvlue] Rcing Serch Tree Rcing Serch Tree 4

We re doing wy too much work with expectimx!

Time-Limited Vlue Define V k () to be the optiml vlue of if the gme end in k more time tep Equivlently, it wht depth-k

chnge i mll Note: deep prt of the tree eventully don t mtter if γ< 1 [demo time-limited vlue] Computing Time-Limited

of zero Given vector of V k () vlue, do one ply of expectimxfrom ech tte: Repet until convergence Complexity of ech

5 We re doing wy too much work with expectimx! Problem: Stte re repeted Ide: Only compute needed quntitie once Rcing Serch Tree Key ide: time-limited vlue Time-Limited Vlue Define V k () to be the optiml vlue of if the gme end in k more time tep Equivlently, it wht depth-k expectimx would give from Problem: Tree goe on forever Ide: Do depth-limited computtion, but with increing depth until chnge i mll Note: deep prt of the tree eventully don t mtter if γ< 1 [demo time-limited vlue] Computing Time-Limited Vlue Vlue Itertion Vlue Itertion Exmple: Vlue Itertion Strt with V 0 () = 0: no time tep left men n expected rewrd um of zero Given vector of V k () vlue, do one ply of expectimxfrom ech tte: Repet until convergence Complexity of ech itertion: O(S 2 A) Theorem: will converge to unique optiml vlue Bic ide: pproximtion get refined towrd optiml vlue Policy my converge long before vlue do,,, V k( ) V k+1() Aume no dicount! 5

6 Convergence* How do we know the V k vector re going to converge? CS 188: Artificil Intelligence Mrkov Deciion Procee II Ce 1: If the tree h mximum depth M, then V M hold the ctul untrunctedvlue Ce 2: If the dicount i le thn 1 Sketch: For ny tte V knd V k+1cn be viewed depth k+1 expectimxreult in nerly identicl erch tree The difference i tht on the bottom lyer, V k+1h ctul rewrd while V kh zero Tht lt lyer i t bet ll R MAX It i t wort R MIN But everything i dicounted by γ k tht fr out So V knd V k+1re t mot γ k mx R different So k incree, the vlue converge Dn Klein, Pieter Abbeel Univerity of Cliforni, Berkeley A mze-like problem The gent live in grid Wll block the gent pth Exmple: Grid World Noiy movement: ction do not lwy go plnned 80% of the time, the ction North tke the gent North 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put The gent receive rewrd ech time tep Smll living rewrd ech tep (cn be negtive) Big rewrd come t the end (good or bd) Gol: mximize um of (dicounted) rewrd Mrkov deciion procee: Stte S Action A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ) Strt tte 0 Recp: MDP,, Quntitie: Policy = mp of tte to ction Utility = um of dicounted rewrd Vlue = expected future utility from tte (mx node) Q-Vlue = expected future utility from q-tte (chnce node), Optiml Quntitie The Bellmn Eqution The vlue(utility) of tte : V * () = expected utility trting in nd cting optimlly The vlue (utility) of q-tte (,): Q * (,) = expected utility trting out hving tken ction from tte nd (therefter) cting optimlly,,, i tte (, ) i q-tte (,, ) i trnition How to be optiml: Step 1: Tke correct firt ction Step 2: Keep being optiml The optiml policy: π * () = optiml ction from tte [demo gridworldvlue] 6

The Bellmn Eqution Vlue Itertion Definition of optiml utility vi expectimx recurrence give imple one-tep lookhed reltionhip mongt optiml utility vlue, Bellmn eqution chrcterizethe optiml vlue: V(),,,

interpretble time-limited vlue Policy Method Policy Evlution Fixed Policie Utilitie for Fixed Policy Do the optiml ction,,, Do wht πy to do π(), π(), π(), Another bic opertion: compute the utility of

7 The Bellmn Eqution Vlue Itertion Definition of optiml utility vi expectimx recurrence give imple one-tep lookhed reltionhip mongt optiml utility vlue, Bellmn eqution chrcterizethe optiml vlue: V(),,, Vlue itertion compute them:,, V( ) Thee re the Bellmn eqution, nd they chrcterize optiml vlue in wy we ll ue over nd over Vlue itertion i jut fixed point olution method though the V k vector re lo interpretble time-limited vlue Policy Method Policy Evlution Fixed Policie Utilitie for Fixed Policy Do the optiml ction,,, Do wht πy to do π(), π(), π(), Another bic opertion: compute the utility of tte under fixed (generlly non-optiml) policy Define the utility of tte, under fixed policy π: V π () = expected totl dicounted rewrd trting in nd following π Recurive reltion (one-tep look-hed / Bellmn eqution): π(), π(), π(), Expectimx tree mx over ll ction to compute the optiml vlue If we fixed ome policy π(), then the tree would be impler only one ction per tte though the tree vlue would depend on which policy we fixed 7

Exmple: Policy Evlution Exmple: Policy Evlution Alwy Go Right Alwy Go Forwrd

π(), Efficiency: O(S 2 ) per itertion Ide 2: Without the mxe, the Bellmn

Computing Action from Vlue Let imgine we hve the optiml vlue V*() How hould we

Computing Action from Q-Vlue Let imgine we hve the optiml q-vlue: How hould we

8 Exmple: Policy Evlution Exmple: Policy Evlution Alwy Go Right Alwy Go Forwrd Alwy Go Right Alwy Go Forwrd Policy Evlution Policy Extrction How do we clculte the V for fixed policy π? Ide 1: Turn recurive Bellmn eqution into updte (like vlue itertion) π(), π(), π(), Efficiency: O(S 2 ) per itertion Ide 2: Without the mxe, the Bellmn eqution re jut liner ytem Solve with Mtlb(or your fvorite liner ytem olver) Computing Action from Vlue Let imgine we hve the optiml vlue V*() How hould we ct? It not obviou! Computing Action from Q-Vlue Let imgine we hve the optiml q-vlue: How hould we ct? Completely trivil to decide! We need to do mini-expectimx(one tep) Thi i clled policy extrction, ince it get the policy implied by the vlue Importnt leon: ction re eier to elect from q-vlue thn vlue! 8

Policy Itertion Problem with Vlue Itertion Vlue itertion repet the Bellmn updte: Problem 1: It low O(S 2 A) per itertion Problem 2: The mx t ech tte rrely chnge,,, Problem 3: The policy often

9 Policy Itertion Problem with Vlue Itertion Vlue itertion repet the Bellmn updte: Problem 1: It low O(S 2 A) per itertion Problem 2: The mx t ech tte rrely chnge,,, Problem 3: The policy often converge long before the vlue [demo vlue itertion] Policy Itertion Alterntive pproch for optiml vlue: Step 1: Policy evlution: clculte utilitie for ome fixed policy (not optiml utilitie!) until convergence Step 2: Policy improvement: updte policy uing one-tep look-hed with reulting converged (but not optiml!) utilitie future vlue Repet tep until policy converge Thi i policy itertion It till optiml! Cn converge (much) fter under ome condition Policy Itertion Evlution: For fixed current policy π, find vlue with policy evlution: Iterte until vlue converge: Improvement: For fixed vlue, get better policy uing policy extrction One-tep look-hed: Comprion Both vlue itertion nd policy itertion compute the me thing (ll optiml vlue) In vlue itertion: Every itertion updte both the vlue nd (implicitly) the policy We don t trck the policy, but tking the mx over ction implicitly recomputeit In policy itertion: We do everl pe tht updte utilitie with fixed policy (ech p i ft becue we conider only one ction, not ll of them) After the policy i evluted, new policy i choen (low like vlue itertion p) The new policy will be better (or we re done) Summry: MDP Algorithm So you wnt to. Compute optiml vlue: ue vlue itertion or policy itertion Compute vlue for prticulr policy: ue policy evlution Turn your vlue into policy: ue policy extrction (one-tep lookhed) Thee ll look the me! They biclly re they re ll vrition of Bellmn updte They ll ue one-tep lookhed expectimx frgment They differ only in whether we plug in fixed policy or mx over ction Both re dynmic progrm for olving MDP 9

10 Double Bndit Double-Bndit MDP Action: Blue, Red Stte: Win, Loe W 0.75 $ $ $ $0 L No dicount 100 time tep Both tte hve the me vlue Offline Plnning Let Ply! Solving MDP i offline plnning You determine ll quntitie through computtion You need to know the detil of the MDP You do not ctully ply the gme! No dicount 100 time tep Both tte hve the me vlue Ply Red Ply Blue Vlue W 0.75 $ $ $ $0 L $2 $2 $0 $2 $2 $2 $2 $0 $0 $0 Online Plnning Let Ply! Rule chnged! Red win chnce i different.?? $0 W?? $2?? $2?? $0 L $0 $0 $0 $2 $0 $2 $0 $0 $0 $0 10

11 Wht Jut Hppened? Tht wn t plnning, it w lerning! Specificlly, reinforcement lerning There w n MDP, but you couldn t olve it with jut computtion You needed to ctully ct to figure it out Importnt ide in reinforcement lerning tht cme up Explortion: you hve to try unknown ction to get informtion Exploittion: eventully, you hve to ue wht you know Regret: even if you lern intelligently, you mke mitke Smpling: becue of chnce, you hve to try thing repetedly Difficulty: lerning cn be much hrder thn olving known MDP 11

Gridworld Values V* Gridworld: Q*

Gridworld Values V* Gridworld: Q* CS 188: Artificil Intelligence Mrkov Deciion Procee II Intructor: Dn Klein nd Pieter Abbeel --- Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI