Gridworld Values V* Gridworld: Q*

Size: px

Start display at page:

Download "Gridworld Values V* Gridworld: Q*"

Lisa Barton
5 years ago
Views:

CS 188: Artificil Intelligence Mrkov Deciion Procee II Intructor: Dn Klein nd Pieter Abbeel ---

Intro to AI t UC Berkeley. All CS188 mteril re vilble t http://i.berkeley.edu.

movement: ction do not lwy go plnned 80% of the time, the ction North tke the gent North 10% of

tken, the gent ty put The gent receive rewrd ech time tep Smll living rewrd ech tep (cn be

Optiml Quntitie Mrkov deciion procee: Stte S Action A Trnition P(,) (or T()) Rewrd R() (nd dicount

expected future utility from tte (mx node) Q-Vlue = expected future utility from q-tte (chnce

(utility) of q-tte (,): Q * (,) = expected utility trting out hving tken ction from tte nd

1 CS 188: Artificil Intelligence Mrkov Deciion Procee II Intructor: Dn Klein nd Pieter Abbeel --- Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI t UC Berkeley. All CS188 mteril re vilble t A mze-like problem The gent live in grid Wll block the gent pth Exmple: Grid World Noiy movement: ction do not lwy go plnned 80% of the time, the ction North tke the gent North 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put The gent receive rewrd ech time tep Smll living rewrd ech tep (cn be negtive) Big rewrd come t the end (good or bd) Gol: mximize um of (dicounted) rewrd Recp: MDP Optiml Quntitie Mrkov deciion procee: Stte S Action A Trnition P(,) (or T()) Rewrd R() (nd dicount γ) Strt tte 0 Quntitie: Policy = mp of tte to ction Utility = um of dicounted rewrd Vlue = expected future utility from tte (mx node) Q-Vlue = expected future utility from q-tte (chnce node), The vlue (utility) of tte : V * () = expected utility trting in nd cting optimlly The vlue (utility) of q-tte (,): Q * (,) = expected utility trting out hving tken ction from tte nd (therefter) cting optimlly The optiml policy: π * () = optiml ction from tte, i tte (, ) i q-tte () i trnition [Demo: gridworld vlue (L9D1)] Gridworld Vlue V* Gridworld: Q*

The Bellmn Eqution The Bellmn Eqution How to be optiml: Step 1: Tke correct firt ction Step 2: Keep being optiml Definition of optiml utility vi expectimx recurrence give imple

eqution chrcterize the optiml vlue: Vlue itertion compute them: Vlue itertion i jut fixed point olution method though the V k vector re lo interpretble time-limited vlue V(), V(

Ce 1: If the tree h mximum depth M, then V M hold the ctul untruncted vlue Ce 2: If the dicount i le thn 1 Sketch: For ny tte V k nd V k+1 cn be viewed depth k+1 expectimx reult

2 The Bellmn Eqution The Bellmn Eqution How to be optiml: Step 1: Tke correct firt ction Step 2: Keep being optiml Definition of optiml utility vi expectimx recurrence give imple one-tep lookhed reltionhip mongt optiml utility vlue, Thee re the Bellmn eqution, nd they chrcterize optiml vlue in wy we ll ue over nd over Vlue Itertion Convergence* Bellmn eqution chrcterize the optiml vlue: Vlue itertion compute them: Vlue itertion i jut fixed point olution method though the V k vector re lo interpretble time-limited vlue V(), V( ) How do we know the V k vector re going to converge? Ce 1: If the tree h mximum depth M, then V M hold the ctul untruncted vlue Ce 2: If the dicount i le thn 1 Sketch: For ny tte V k nd V k+1 cn be viewed depth k+1 expectimx reult in nerly identicl erch tree The difference i tht on the bottom lyer, V k+1 h ctul rewrd while V k h zero Tht lt lyer i t bet ll R MAX It i t wort R MIN But everything i dicounted by γ k tht fr out So V k nd V k+1 re t mot γ k mx R different So k incree, the vlue converge Policy Method Policy Evlution

Fixed Policie Utilitie for Fixed Policy Do the optiml ction, Do wht π y to do π(), π(), π(), Another bic opertion:

() = expected totl dicounted rewrd trting in nd following π Recurive reltion (one-tep look-hed / Bellmn eqution):, π(),

impler only one ction per tte though the tree vlue would depend on which policy we fixed Exmple: Policy Evlution Exmple:

3 Fixed Policie Utilitie for Fixed Policy Do the optiml ction, Do wht π y to do π(), π(), π(), Another bic opertion: compute the utility of tte under fixed (generlly non-optiml) policy Define the utility of tte, under fixed policy π: V π () = expected totl dicounted rewrd trting in nd following π Recurive reltion (one-tep look-hed / Bellmn eqution):, π(), π(), π() Expectimx tree mx over ll ction to compute the optiml vlue If we fixed ome policy π(), then the tree would be impler only one ction per tte though the tree vlue would depend on which policy we fixed Exmple: Policy Evlution Exmple: Policy Evlution Alwy Go Right Alwy Go Forwrd Alwy Go Right Alwy Go Forwrd Policy Evlution Policy Extrction How do we clculte the V for fixed policy π? Ide 1: Turn recurive Bellmn eqution into updte (like vlue itertion) π(), π(), π(), Efficiency: O(S 2 ) per itertion Ide 2: Without the mxe, the Bellmn eqution re jut liner ytem Solve with Mtlb (or your fvorite liner ytem olver)

Computing Action from Vlue Let imgine we hve the optiml vlue V*() How hould we ct?

Computing Action from Q-Vlue Let imgine we hve the optiml q-vlue: How hould we ct?

We need to do mini-expectimx (one tep) Thi i clled policy extrction, ince it get the

Policy Itertion Problem with Vlue Itertion Vlue itertion repet the Bellmn updte: Problem

4 Computing Action from Vlue Let imgine we hve the optiml vlue V*() How hould we ct? It not obviou! Computing Action from Q-Vlue Let imgine we hve the optiml q-vlue: How hould we ct? Completely trivil to decide! We need to do mini-expectimx (one tep) Thi i clled policy extrction, ince it get the policy implied by the vlue Importnt leon: ction re eier to elect from q-vlue thn vlue! Policy Itertion Problem with Vlue Itertion Vlue itertion repet the Bellmn updte: Problem 1: It low O(S 2 A) per itertion Problem 2: The mx t ech tte rrely chnge, Problem 3: The policy often converge long before the vlue [Demo: vlue itertion (L9D2)] k=0 k=1 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0

5 k=2 k=3 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 k=4 k=5 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 k=6 k=7 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0

k=8 k=9 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 k=10 k=11 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 k=12 k=100 Noie = 0.

6 k=8 k=9 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 k=10 k=11 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 k=12 k=100 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0

Policy Itertion Alterntive pproch for optiml vlue: Step 1: Policy evlution: clculte utilitie for ome fixed policy (not optiml utilitie!

) utilitie future vlue Repet tep until policy converge Thi i policy itertion It till optiml!

Improvement: For fixed vlue, get better policy uing policy extrction One-tep look-hed: Comprion Both vlue itertion nd policy itertion compute the me thing (ll optiml

policy itertion: We do everl pe tht updte utilitie with fixed policy (ech p i ft becue we conider only one ction, not ll of them) After the policy i evluted, new

Compute optiml vlue: ue vlue itertion or policy itertion Compute vlue for prticulr policy: ue policy evlution Turn your vlue into policy: ue policy extrction (one-tep

7 Policy Itertion Alterntive pproch for optiml vlue: Step 1: Policy evlution: clculte utilitie for ome fixed policy (not optiml utilitie!) until convergence Step 2: Policy improvement: updte policy uing one-tep look-hed with reulting converged (but not optiml!) utilitie future vlue Repet tep until policy converge Thi i policy itertion It till optiml! Cn converge (much) fter under ome condition Policy Itertion Evlution: For fixed current policy π, find vlue with policy evlution: Iterte until vlue converge: Improvement: For fixed vlue, get better policy uing policy extrction One-tep look-hed: Comprion Both vlue itertion nd policy itertion compute the me thing (ll optiml vlue) In vlue itertion: Every itertion updte both the vlue nd (implicitly) the policy We don t trck the policy, but tking the mx over ction implicitly recompute it In policy itertion: We do everl pe tht updte utilitie with fixed policy (ech p i ft becue we conider only one ction, not ll of them) After the policy i evluted, new policy i choen (low like vlue itertion p) The new policy will be better (or we re done) Summry: MDP Algorithm So you wnt to. Compute optiml vlue: ue vlue itertion or policy itertion Compute vlue for prticulr policy: ue policy evlution Turn your vlue into policy: ue policy extrction (one-tep lookhed) Thee ll look the me! They biclly re they re ll vrition of Bellmn updte They ll ue one-tep lookhed expectimx frgment They differ only in whether we plug in fixed policy or mx over ction Both re dynmic progrm for olving MDP Double Bndit Double-Bndit MDP Action: Blue, Red Stte: Win, Loe W 0.75 $ $ $ $0 L No dicount 100 time tep Both tte hve the me vlue

8 Offline Plnning Let Ply! Solving MDP i offline plnning You determine ll quntitie through computtion You need to know the detil of the MDP You do not ctully ply the gme! No dicount 100 time tep Both tte hve the me vlue Ply Red Ply Blue Vlue W 0.75 $ $ $ $0 L $2 $2 $0 $2 $2 $2 $2 $0 $0 $0 Online Plnning Let Ply! Rule chnged! Red win chnce i different.?? $0 W?? $2?? $2?? $0 L $0 $0 $0 $2 $0 $2 $0 $0 $0 $0 Wht Jut Hppened? Next Time: Reinforcement Lerning! Tht wn t plnning, it w lerning! Specificlly, reinforcement lerning There w n MDP, but you couldn t olve it with jut computtion You needed to ctully ct to figure it out Importnt ide in reinforcement lerning tht cme up Explortion: you hve to try unknown ction to get informtion Exploittion: eventully, you hve to ue wht you know Regret: even if you lern intelligently, you mke mitke Smpling: becue of chnce, you hve to try thing repetedly Difficulty: lerning cn be much hrder thn olving known MDP

Non-Deterministic Search. CS 188: Artificial Intelligence Markov Decision Processes. Grid World Actions. Example: Grid World

Non-Deterministic Search. CS 188: Artificial Intelligence Markov Decision Processes. Grid World Actions. Example: Grid World CS 188: Artificil Intelligence Mrkov Deciion Procee Non-Determinitic Serch Dn Klein, Pieter Abbeel Univerity of Cliforni, Berkeley Exmple: Grid World Grid World Action A mze-like problem The gent live