Fully Observable. Perfect

Size: px

Start display at page:

Download "Fully Observable. Perfect"

Audra Lawrence
5 years ago
Views:

CS 188: Ar)ficil Intelligence Mrkov Deciion Procee II Stoch)c

Univerity of Cliforni, Berkeley [Thee lide were creted by Dn

All CS188 mteril re vilble t hkp://i.berkeley.edu.

procee: Stte S Ac)on A Trni)on P(,) (or T(,, )) Rewrd R(,, )

Progrmming Policy Iter)on Qun))e: Policy = mp of tte to c)on

from tte (mx node) Q- Vlue = expected future u)lity from q-

Gridworld Vlue V* The vlue (u)lity) of tte : V * () =

q- tte (,): Q * (,) = expected u)lity tr)ng out hving tken

1 CS 188: Ar)ficil Intelligence Mrkov Deciion Procee II Stoch)c Plnning: MDP Sttic Environment Fully Obervble Perfect Wht ction next? Stochtic Intntneou Intructor: Dn Klein nd Pieter Abbeel Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI t UC Berkeley. All CS188 mteril re vilble t hkp://i.berkeley.edu.] Percept Action 3 Recp: MDP Solving MDP Mrkov deciion procee: Stte S Ac)on A Trni)on P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ) Strt tte 0 Vlue Iter)on Rel- Time Dynmic Progrmming Policy Iter)on Qun))e: Policy = mp of tte to c)on U)lity = um of dicounted rewrd Vlue = expected future u)lity from tte (mx node) Q- Vlue = expected future u)lity from q- tte (chnce node),,, Reinforcement Lerning Op)ml Qun))e Gridworld Vlue V* The vlue (u)lity) of tte : V * () = expected u)lity tr)ng in nd c)ng op)mlly The vlue (u)lity) of q- tte (,): Q * (,) = expected u)lity tr)ng out hving tken c)on from tte nd (there]er) c)ng op)mlly The op)ml policy: π * () = op)ml c)on from tte,,, i tte (, ) i q-tte (,, ) i trnition [Demo: gridworld vlue (L9D1)]

Gridworld: Q* The Bellmn Equ)on How to be op)ml: Step 1: Tke correct firt c)on Step 2: Keep being op)ml The Bellmn Equ)on Defini)on of op)ml

op)ml vlue in wy we ll ue over nd over,,, We re doing wy too much work with expec)mx!

depth un)l chnge i mll Note: deep prt of the tree eventully don t mker if γ < 1 Rcing Serch Tree Time- Limited Vlue Time- Limited Vlue:

2 Gridworld: Q* The Bellmn Equ)on How to be op)ml: Step 1: Tke correct firt c)on Step 2: Keep being op)ml The Bellmn Equ)on Defini)on of op)ml u)lity vi expec)mx recurrence give imple one- tep lookhed rel)onhip mongt op)ml u)lity vlue Thee re the Bellmn equ)on, nd they chrcterize op)ml vlue in wy we ll ue over nd over,,, We re doing wy too much work with expec)mx! Problem: Stte re repeted Ide: Only compute needed qun))e once Problem: Tree goe on forever Ide: Do depth- limited comput)on, but with increing depth un)l chnge i mll Note: deep prt of the tree eventully don t mker if γ < 1 Rcing Serch Tree Time- Limited Vlue Time- Limited Vlue: Avoiding Redundnt Comput)on Key ide: )me- limited vlue Define V k () to be the op)ml vlue of if the gme end in k more )me tep Equivlently, it wht depth- k expec)mx would give from [Demo )me- limited vlue (L8D6)]

Vlue Iter)on Exmple: Vlue Iter)on 3.5 2.

Bellmn Bckup Strt with V 0 () = 0: no 8me

l convergence (trut me, it doe),,, V k (

3 Vlue Iter)on Exmple: Vlue Iter)on Aume no dicount (gmm=1) to keep mth imple! Clled Bellmn Bckup Vlue Iter)on Exmple: Bellmn Bckup Strt with V 0 () = 0: no 8me tep le9 men n expected rewrd um of zero Given vector of V k () vlue, do one ply of expec?mx from ech tte: Repet un?l convergence (trut me, it doe),,, V k ( ) V k+1 () V 1 = greedy = mx V 0 = 0 V 0 = 1 V 0 = 2 Q 1 (, 1 ) = 2 + γ 0 ~ 2 Q 1 (, 2 ) = 5 + γ 0.9~ 1 + γ 0.1~ 2 ~ 6.1 Q 1 (, 3 ) = γ 2 ~ 6.5 k=0 k=1 If gent i in 4,3, it only h one legl ction: get jewel. It get rewrd nd the gme i over. If gent i in the pit, it h only one legl ction, die. It get penlty nd the gme i over. Agent doe NOT get rewrd for moving INTO 4,3.

4 k=2 k=3 k=4 k=5 k=6 k=7

5 k=8 k=9 k=10 k=11 k=12 k=100

6 Strt with V 0 () = 0: Vlue Iter)on Given vector of V k () vlue, do one ply of expec?mx from ech tte: Repet un?l convergence Complexity of ech iter?on: O(S 2 A) Number of iter?on: poly( S, A, 1/(1- g)) Theorem: will converge to unique op?ml vlue,,, V k ( ) V k+1 () Vlue Iter)on Bellmn equ)on chrcterize the op)ml vlue: Vlue iter)on compute them: Vlue iter)on i jut fixed point olu)on method though the V k vector re lo interpretble )me- limited vlue V(),,, V( ) Convergence* Policy Extrc)on How do we know the V k vector will converge? Ce 1: If the tree h mximum depth M, then V M hold the ctul untruncted vlue Ce 2: If the dicount i le thn 1 Sketch: For ny tte V k nd V k+1 cn be viewed depth k+1 expec)mx reult in nerly iden)cl erch tree The mx difference hppen if big rewrd t k+1 level Tht lt lyer i t bet ll R MAX But everything i dicounted by γ k tht fr out So V k nd V k+1 re t mot γ k mx R different So k incree, the vlue converge Compu)ng Ac)on from Vlue Let imgine we hve the op)ml vlue V*() How hould we ct? It not obviou! Compu)ng Ac)on from Q- Vlue Let imgine we hve the op)ml q- vlue: How hould we ct? Completely trivil to decide! We need to do mini- expec)mx (one tep) Thi i clled policy extrc)on, ince it get the policy implied by the vlue Importnt leon: c)on re eier to elect from q- vlue thn vlue!

7 Problem with Vlue Iter)on Vlue iter)on repet the Bellmn updte: Problem 1: It low O(S 2,, A) per iter)on Problem 2: The mx t ech tte rrely chnge Problem 3: The policy o]en converge long before the vlue, VI à Aynchronou VI I it een)l to bck up ll tte in ech iter)on? No! Stte my be bcked up mny )me or not t ll in ny order A long no tte get trved convergence proper)e )ll hold!! [Demo: vlue iter)on (L9D2)] 44 Priori)z)on of Bellmn Bckup k=1 Are ll bckup eqully importnt? Cn we void ome bckup? Cn we chedule the bckup more ppropritely? 45 k=2 k=3

Iter)on Policy Iter)on Reinforcement Lerning Policy Method Policy Evlu)on Fixed Policie U)li)e for Fixed Policy Do the op)ml c)on,,, Do wht π y to do π(), π(), π(), Another bic oper)on: compute the

8 Aynch VI: Priori)zed Sweeping Why bckup tte if vlue of ucceor me? Prefer bcking tte whoe ucceor hd mot chnge Priority Queue of (tte, expected chnge in vlue) Bckup in the order of priority A]er bcking tte updte priority queue for ll predeceor Solving MDP Vlue Iter)on Policy Iter)on Reinforcement Lerning Policy Method Policy Evlu)on Fixed Policie U)li)e for Fixed Policy Do the op)ml c)on,,, Do wht π y to do π(), π(), π(), Another bic oper)on: compute the u)lity of tte under fixed (generlly non- op)ml) policy Define the u)lity of tte, under fixed policy π: V π () = expected totl dicounted rewrd tr)ng in nd following π Recurive rel)on (one- tep look- hed / Bellmn equ)on): π(), π(), π(), Expec)mx tree mx over ll c)on to compute the op)ml vlue If we fixed ome policy π(), then the tree would be impler only one c)on per tte though the tree vlue would depend on which policy we fixed

Exmple: Policy Evlu)on Exmple: Policy Evlu)on Alwy Go Right Alwy Go Forwrd Alwy Go Right Alwy Go Forwrd Policy Evlu)on Policy

Ide 1: Turn recurive Bellmn equ)on into updte (like vlue iter)on) π(), π(), π(), Efficiency: O(S 2 ) per iter)on Ide 2: Without

imgine we hve the op)ml vlue V*() How hould we ct? It not obviou!

9 Exmple: Policy Evlu)on Exmple: Policy Evlu)on Alwy Go Right Alwy Go Forwrd Alwy Go Right Alwy Go Forwrd Policy Evlu)on Policy Extrc)on How do we clculte the V for fixed policy π? Ide 1: Turn recurive Bellmn equ)on into updte (like vlue iter)on) π(), π(), π(), Efficiency: O(S 2 ) per iter)on Ide 2: Without the mxe, the Bellmn equ)on re jut liner ytem Solve with Mtlb (or your fvorite liner ytem olver) Compu)ng Ac)on from Vlue Let imgine we hve the op)ml vlue V*() How hould we ct? It not obviou! Compu)ng Ac)on from Q- Vlue Let imgine we hve the op)ml q- vlue: How hould we ct? Completely trivil to decide! We need to do mini- expec)mx (one tep) Thi i clled policy extrc)on, ince it get the policy implied by the vlue Importnt leon: c)on re eier to elect from q- vlue thn vlue!

Policy Iter)on Policy Iter)on Altern)ve pproch for op)ml vlue: Step 1: Policy evlu)on: clculte u)li)e for ome fixed policy (not op)ml u)li)e!

) u)li)e future vlue Repet tep un)l policy converge Thi i policy iter)on It )ll op)ml!

converge: Improvement: For fixed vlue, get beker policy uing policy extrc)on One- tep look- hed: Both vlue iter)on nd policy iter)on compute the me

c)on implicitly recompute it In policy iter)on: We do everl pe tht updte u)li)e with fixed policy (ech p i ft becue we conider only one c)on, not ll of

olving MDP Summry: MDP Algorithm Double Bndit So you wnt to.

10 Policy Iter)on Policy Iter)on Altern)ve pproch for op)ml vlue: Step 1: Policy evlu)on: clculte u)li)e for ome fixed policy (not op)ml u)li)e!) un)l convergence Step 2: Policy improvement: updte policy uing one- tep look- hed with reul)ng converged (but not op)ml!) u)li)e future vlue Repet tep un)l policy converge Thi i policy iter)on It )ll op)ml! Cn converge (much) fter under ome condi)on Policy Iter)on Comprion Evlu)on: For fixed current policy π, find vlue with policy evlu)on: Iterte un)l vlue converge: Improvement: For fixed vlue, get beker policy uing policy extrc)on One- tep look- hed: Both vlue iter)on nd policy iter)on compute the me thing (ll op)ml vlue) In vlue iter)on: Every iter)on updte both the vlue nd (implicitly) the policy We don t trck the policy, but tking the mx over c)on implicitly recompute it In policy iter)on: We do everl pe tht updte u)li)e with fixed policy (ech p i ft becue we conider only one c)on, not ll of them) A]er the policy i evluted, new policy i choen (low like vlue iter)on p) The new policy will be beker (or we re done) Both re dynmic progrm for olving MDP Summry: MDP Algorithm Double Bndit So you wnt to. Compute op)ml vlue: ue vlue iter)on or policy iter)on Compute vlue for pr)culr policy: ue policy evlu)on Turn your vlue into policy: ue policy extrc)on (one- tep lookhed) Thee ll look the me! They biclly re they re ll vri)on of Bellmn updte They ll ue one- tep lookhed expec)mx frgment They differ only in whether we plug in fixed policy or mx over c)on

11 Double- Bndit MDP Offline Plnning Ac)on: Blue, Red Stte: Win, Loe $1 1.0 W 0.75 $ $ $ $0 L No dicount 100 8me tep Both tte hve the me vlue $1 1.0 Solving MDP i offline plnning You determine ll qun))e through comput)on You need to know the detil of the MDP You do not ctully ply the gme! Ply Red Ply Blue Vlue $1 1.0 W 0.75 $ $ $ $0 No dicount 100 8me tep Both tte hve the me vlue L $1 1.0 Let Ply! Online Plnning Rule chnged! Red win chnce i different.?? $0 $2 $2 $0 $2 $2 $2 $2 $0 $0 $0 $1 1.0 W?? $2?? $2?? $0 L $1 1.0 Let Ply! Wht Jut Hppened? Tht wn t plnning, it w lerning! Specificlly, reinforcement lerning There w n MDP, but you couldn t olve it with jut comput)on You needed to ctully ct to figure it out $0 $0 $0 $2 $0 $2 $0 $0 $0 $0 Importnt ide in reinforcement lerning tht cme up Explor)on: you hve to try unknown c)on to get inform)on Exploit)on: eventully, you hve to ue wht you know Regret: even if you lern intelligently, you mke mitke Smpling: becue of chnce, you hve to try thing repetedly Difficulty: lerning cn be much hrder thn olving known MDP

12 Next Time: Reinforcement Lerning!

Static Fully Observable Stochastic What action next? Instantaneous Perfect

Static Fully Observable Stochastic What action next? Instantaneous Perfect CS 188: Ar)ficil Intelligence Mrkov Deciion Procee K+1 Intructor: Dn Klein nd Pieter Abbeel - - - Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to