Static Fully Observable Stochastic What action next? Instantaneous Perfect

Size: px

Start display at page:

Download "Static Fully Observable Stochastic What action next? Instantaneous Perfect"

Helen Montgomery
5 years ago
Views:

CS 188: Ar)ficil Intelligence Mrkov Deciion Procee K+1 Intructor:

[Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188

Formul)ng problem in term of erch Algorithm: DFS, BFS, IDS, bet-

cre)on, plern DB Contrint: formul)on, erch, forwrd checking, rc-

Formul)on, Bellmn eqn, V*, Q*, bckup, vlue iter)on, policy iter)on

Trni)on P(,) (or T()) Rewrd R() (nd dicount γ) Strt tte 0 Qun))e:

expected future u)lity from tte (mx node) Q- Vlue = expected

1 CS 188: Ar)ficil Intelligence Mrkov Deciion Procee K+1 Intructor: Dn Klein nd Pieter Abbeel Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI t UC Berkeley. All CS188 mteril re vilble t hlp://i.berkeley.edu.] Midterm Review Agency: type of gent, type of environment Serch Formul)ng problem in term of erch Algorithm: DFS, BFS, IDS, bet- firt, uniform- cot, A*, locl Heuri)c: dmiibility, conitency, cre)on, plern DB Contrint: formul)on, erch, forwrd checking, rc- conitency, tructure Adverril: min/mx, lph- bet, expec)mx MDP Formul)on, Bellmn eqn, V*, Q*, bckup, vlue iter)on, policy iter)on 3 Stoch)c Plnning: MDP Recp: MDP Sttic Fully Obervble Perfect Percept Environment Wht ction next? Action Stochtic Intntneou 4 Mrkov deciion procee: Stte S Ac)on A Trni)on P(,) (or T()) Rewrd R() (nd dicount γ) Strt tte 0 Qun))e: Policy = mp of tte to c)on U)lity = um of dicounted rewrd Vlue = expected future u)lity from tte (mx node) Q- Vlue = expected future u)lity from q- tte (chnce node), Solving MDP The Bellmn Equ)on Vlue Iter)on Policy Iter)on Reinforcement Lerning How to be op)ml: Step 1: Tke correct firt c)on Step 2: Keep being op)ml

The Bellmn Equ)on Gridworld: Q* Defini)on of op)ml u)lity vi expec)mx

vlue, Thee re the Bellmn equ)on, nd they chrcterize op)ml vlue in wy

Forll, Ini,lize V 0 () = 0 no $me tep le+ men n expected rewrd of

V k ( )] V k+1 () = Mx Q k+1 (, ) Repet un,l V k+1 () V k () < ε,

2 The Bellmn Equ)on Gridworld: Q* Defini)on of op)ml u)lity vi expec)mx recurrence give imple one- tep lookhed rel)onhip mongt op)ml u)lity vlue, Thee re the Bellmn equ)on, nd they chrcterize op)ml vlue in wy we ll ue over nd over Gridworld Vlue V* Vlue Iter)on Vlue Iter)on k=0 Forll, Ini,lize V 0 () = 0 no $me tep le+ men n expected rewrd of zero Repet do Bellmn bckup K += 1 Q k+1 (, ) = Σ T(,, ) [ R(,, ) + γ V k ( )] V k+1 () = Mx Q k+1 (, ) Repet un,l V k+1 () V k () < ε, forll convergence V k+1 (), V k ( ) Noie = 0.2 Dicount = 0.9 Living rewrd = 0

2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Exmple: Bellmn Bckup Policy Extrc)on V 1 = 6.5 0 greedy = 3 2 5 4.5 1 2 3 mx 0.

3 k=1 k=2 If gent i in 4,3, it only h one legl ction: get jewel. It get rewrd nd the gme i over. If gent i in the pit, it h only one legl ction, die. It get penlty nd the gme i over. Agent doe NOT get rewrd for moving INTO 4,3. Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 k=3 k=100 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Exmple: Bellmn Bckup Policy Extrc)on V 1 = greedy = mx V 0 = 0 V 0 = 1 V 0 = 2 Q 1 (, 1 ) = 2 + γ 0 ~ 2 Q 1 (, 2 ) = 5 + γ 0.9~ 1 + γ 0.1~ 2 ~ 6.1 Q 1 (, 3 ) = γ 2 ~ 6.5

Compu)ng Ac)on from Q- Vlue Problem with Vlue Iter)on Let imgine we hve the op)ml q- vlue: Vlue iter)on repet the Bellmn updte: How hould we ct? Completely trivil to decide!

4 Compu)ng Ac)on from Q- Vlue Problem with Vlue Iter)on Let imgine we hve the op)ml q- vlue: Vlue iter)on repet the Bellmn updte: How hould we ct? Completely trivil to decide! Problem 1: It low O(S 2 A) per iter)on, Problem 2: The mx t ech tte rrely chnge Importnt leon: c)on re eier to elect from q- vlue thn vlue! Problem 3: The policy olen converge long before the vlue [Demo: vlue iter)on (L9D2)] VI à Aynchronou VI I it een)l to bck up ll tte in ech iter)on? No! Stte my be bcked up mny )me or not t ll in ny order Priori)z)on of Bellmn Bckup Are ll bckup eqully importnt? Cn we void ome bckup? Cn we chedule the bckup more ppropritely? A long no tte get trved convergence proper)e )ll hold!! k=1 k=2 Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Noie = 0.2 Dicount = 0.9 Living rewrd = 0

for ll predeceor Noie = 0.2 Dicount = 0.

5 k=3 Aynch VI: Priori)zed Sweeping Why bckup tte if vlue of ucceor me? Prefer bcking tte whoe ucceor hd mot chnge Priority Queue of (tte, expected chnge in vlue) Bckup in the order of priority Aler bcking tte updte priority queue for ll predeceor Noie = 0.2 Dicount = 0.9 Living rewrd = 0 Solving MDP Policy Method Vlue Iter)on Policy Iter)on Reinforcement Lerning Policy Evlu)on Fixed Policie Do the op)ml c)on, Do wht π y to do π(), π(), π(), Expec)mx tree mx over ll c)on to compute the op)ml vlue If we fixed ome policy π(), then the tree would be impler only one c)on per tte though the tree vlue would depend on which policy we fixed

U)li)e for Fixed Policy Exmple: Policy Evlu)on Another bic oper)on: compute the u)lity of tte under fixed (generlly non- op)ml) policy Define the u)lity of tte,

Right Alwy Go Forwrd Exmple: Policy Evlu)on Policy Iter)on Alwy Go Right Alwy Go Forwrd Policy Iter)on Ini)lize π() to rndom c)on Repet Step 1: Policy evlu)on:

6 U)li)e for Fixed Policy Exmple: Policy Evlu)on Another bic oper)on: compute the u)lity of tte under fixed (generlly non- op)ml) policy Define the u)lity of tte, under fixed policy π: V π () = expected totl dicounted rewrd tr)ng in nd following π Recurive rel)on (one- tep look- hed / Bellmn equ)on): π(), π(), π(), Alwy Go Right Alwy Go Forwrd Exmple: Policy Evlu)on Policy Iter)on Alwy Go Right Alwy Go Forwrd Policy Iter)on Ini)lize π() to rndom c)on Repet Step 1: Policy evlu)on: clculte u)li)e of π t ech uing neted loop Step 2: Policy improvement: updte policy uing one- tep look- hed For ech, wht the bet c)on I could execute, uming I then follow π? Let π () = thi bet c)on. π = π Un)l policy doen t chnge Let i =0 Ini)lize π i () to rndom c)on Repet Step 1: Policy evlu)on: Ini)lize k=0; Forll, V 0π () = 0 Repet un)l V π converge For ech tte, Let k += 1 Step 2: Policy improvement: For ech tte,, Policy Iter)on Detil If π i == π i+1 then it op)ml; return it. Ele let i += 1

7 Exmple Exmple Ini)lize π 0 to lwy go right Perform policy evlu)on Perform policy improvement Iterte through tte? π 1 y lwy go up Perform policy evlu)on Perform policy improvement Iterte through tte? H policy chnged? Ye! i += 1? H policy chnged? No! We hve the op)ml policy??? Exmple: Policy Evlu)on Policy Iter)on Proper)e Alwy Go Right Alwy Go Forwrd Policy iter)on find the op)ml policy, gurenteed! Olen converge (much) fter Comprion Both vlue iter)on nd policy iter)on compute the me thing (ll op)ml vlue) In vlue iter)on: Every iter)on updte both the vlue nd (implicitly) the policy We don t trck the policy, but tking the mx over c)on implicitly recompute it In policy iter)on: We do everl pe tht updte u)li)e with fixed policy (ech p i ft becue we conider only one c)on, not ll of them) Aler the policy i evluted, new policy i choen (low like vlue iter)on p) The new policy will be beler (or we re done) Summry: MDP Algorithm So you wnt to. Compute op)ml vlue: ue vlue iter)on or policy iter)on Compute vlue for pr)culr policy: ue policy evlu)on Turn your vlue into policy: ue policy extrc)on (one- tep lookhed) Thee ll look the me! They biclly re they re ll vri)on of Bellmn updte They ll ue one- tep lookhed expec)mx frgment They differ only in whether we plug in fixed policy or mx over c)on Both re dynmic progrm for olving MDP

8 Double Bndit Next Time: Reinforcement Lerning!

Fully Observable. Perfect

Fully Observable. Perfect CS 188: Ar)ficil Intelligence Mrkov Deciion Procee II Stoch)c Plnning: MDP Sttic Environment Fully Obervble Perfect Wht ction next? Stochtic Intntneou Intructor: Dn Klein nd Pieter Abbeel - - - Univerity