4/30/2012. Overview. MDPs. Planning Agent. Grid World. Review: Expectimax. Introduction & Agents Search, Heuristics & CSPs Adversarial Search

Size: px

Start display at page:

Download "4/30/2012. Overview. MDPs. Planning Agent. Grid World. Review: Expectimax. Introduction & Agents Search, Heuristics & CSPs Adversarial Search"

Norah Foster
5 years ago
Views:

Overview CSE 473 Mrkov Deciion Procee Dn Weld Mny lide from Chri Bihop, Mum, Dn Klein, Sturt Ruell, Andrew Moore & Luke Zettlemoyer Introduction & Agent Serch, Heuritic & CSP Adverril Serch Logicl

1 Overview CSE 473 Mrkov Deciion Procee Dn Weld Mny lide from Chri Bihop, Mum, Dn Klein, Sturt Ruell, Andrew Moore & Luke Zettlemoyer Introduction & Agent Serch, Heuritic & CSP Adverril Serch Logicl Knowledge Repreenttion Plnning & MDP Reinforcement Lerning Uncertinty & Byein Network Mchine Lerning NLP& Specil Topic MDP Plnning Agent Mrkov Deciion Procee Plnning Under Uncertinty Sttic v. Dynmic Environment Mthemticl Frmework Bellmn Eqution Vlue Itertion Rel Time Dynmic Progrmming Policy Itertion Reinforcement Lerning Andrey Mrkov ( ) Fully v. Prtilly Obervblebl Perfect v. Noiy Wht ction next Determinitic v. Stochtic Intntneou v. Durtive Percept Action Review: Expectimx Wht if we don t know wht the reult of n ction will be E.g., In olitire, next crd i unknown In pcmn, the ghot ct rndomly Cn do expectimx erch Mx node in minimx erch Chnce node, like min node, except the outcome i uncertin tke verge (expecttion) of children Clculte expected utilitie Tody, we formlize n Mrkov Deciion Proce Hndle intermedite rewrd & infinite pln More efficient proceing mx chnce Grid World Wll block the gent pth Agent ction my go try: 80% of the time, North ction tke the gent North (uming no wll) 10% ctully go Wet 10% ctully go Et If there i wll in the choen direction, the gent ty put Smll living rewrd ech tep Big rewrd come t the end Gol: mximize um of rewrd 1

2 Mrkov Deciion Procee An MDP i defined by: A et of tte S A et of ction A A trnition function T(,, ) Probtht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R( ) A trt tte (or ditribution) Mybe terminl tte MDP: non determinitic erch Reinforcement lerning: MDP where we don t know the trnition or rewrd function Axiom of Probbility Theory All probbilitie between 0 nd 1 0 P( A) 1 Probbility of truth nd flity P(true) = 1 P(fle) = 0. The probbility of dijunction i: P( A B) P( A) P( B) P( A B) A A B B 8 Terminology Conditionl Probbility Joint Probbility Mrginl Probbility Conditionl Probbility P(A B) i the probbility of A given B Aume: B i ll nd only informtion known. Defined by: P( A B) P( A B) P( B) A A B B X vlue i given 10 Independence Independence A nd B re independent iff: P( A B) P( A) Thee contrint logiclly equivlent P( B A) P( B) Therefore, if A nd B re independent: d P( A B) P( A B) P( A) P( B) P( A B) P( A) P( B) True A P(A B) = P(A)P(B) A B B 12 Dniel S. Weld 13 2

Conditionl Independence A&B not independent, ince P(A B) < P(A) Conditionl

P(A C) = P(A B, C) B C Dniel S. Weld 14 Dniel S.

conditioned on the preent tte, the future i independent of the pt For Mrkov

problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we

policymximize expected utilityif if followed Define reflex gent Optiml policy

03 for ll non terminl Exmple Optiml Policie Exmple: High Low R() = 0.

03 Three crd type: 2, 3, 4 Infinite deck, twice mny 2 Strt with 3 howing

3 Conditionl Independence A&B not independent, ince P(A B) < P(A) Conditionl Independence But: A&B re mde independent by C True A A B B True A A B A C B C P(A C) = P(A B, C) B C Dniel S. Weld 14 Dniel S. Weld 15 Wht i Mrkov bout MDP Andrey Mrkov ( ) Mrkov generlly men tht conditioned on the preent tte, the future i independent of the pt For Mrkov deciion procee, Mrkov men: Solving MDP In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy *: S A A policy give n ction for ech tte Anoptiml policymximize expected utilityif if followed Define reflex gent Optiml policy when R(,, ) = 0.03 for ll non terminl Exmple Optiml Policie Exmple: High Low R() = 0.01 R() = 0.03 Three crd type: 2, 3, 4 Infinite deck, twice mny 2 Strt with 3 howing After ech crd, you y high or low New crd i flipped If you re right, ih you win the point hown on the new crd Tie re no op (no rewrd) 0 If you re wrong, gme end 3 R() = 0.4 R() = 2.0 Difference from expectimx problem: #1: get rewrd you go #2: you might ply forever! 3

4 High Low n MDP Stte: 2, 3, 4, done Action: High, Low Model: T(,, ): P( =4 4, Low) = 1/4 P( =3 4, Low) = 1/4 3 P( =2 4, Low) = 1/2 P( =done 4, Low) = 0 P( =4 4, High) = 1/4 P( =3 4, High) = 0 P( =2 4, High) = 0 P( =done 4, High) = 3/4 Rewrd: R(,, ): Number hown on if < = high 0 otherwie Strt: 3 T = 0.5, R = 2 Serch Tree: High Low T = 0.25, R = 3 Low High Low High Low High Low High, Low, High T = 0, R = 4 T = 0.25, R = 0 MDP Serch Tree Ech MDP tte give n expectimx like erch tree (, ) i q-tte,,, i tte (,, ) clled trnition T(,, ) = P(,) R(,, ) Utilitie of Sequence In order to formlize optimlity of policy, need to undertnd utilitie of equence of rewrd Typiclly conider ttionry preference: Theorem: only two wy to define ttionry utilitie Additive utility: Dicounted utility: Infinite Utilitie! Dicounting Problem: infinite tte equence hve infinite rewrd Solution: Finite horizon: Terminte epiode fter fixed T tep (e.g. life) Give nonttionry policie ( depend on time left) Aborbing tte: gurntee tht for every policy, terminl tte will eventully be reched (like done for High Low) Dicounting: for 0 < < 1 Smller men mller horizon horter term focu Typiclly dicount rewrd by < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge 4

Recp: Defining MDP Optiml Utilitie Mrkov deciion procee: Stte S Strt tte 0 Action A Trnition P(, ) k T(,, ),, Rewrd R(,, ) (nd dicount ),,, Define the vlue of tte : V * () = expected utility trting

o fr: Policy, = Function tht chooe n ction for ech tte Utility (k return ) = um of dicounted rewrd The Bellmn Eqution Definition of optiml utility led to imple one tep look hed reltionhip between

itertion Compute optiml vlue for ll tte ll t once uing ucceive pproximtion Will be bottom up dynmic progrm imilr in cot to memoiztion Do ll plnning offline, no replnning needed!

5 Recp: Defining MDP Optiml Utilitie Mrkov deciion procee: Stte S Strt tte 0 Action A Trnition P(, ) k T(,, ),, Rewrd R(,, ) (nd dicount ),,, Define the vlue of tte : V * () = expected utility trting in nd cting optimlly Define the vlue of q tte (,): Q * (,) = expected utility trting in, tking ction nd therefter cting optimlly Define the optiml policy: * () = optiml ction from tte,,, MDP quntitie o fr: Policy, = Function tht chooe n ction for ech tte Utility (k return ) = um of dicounted rewrd The Bellmn Eqution Definition of optiml utility led to imple one tep look hed reltionhip between optiml utility vlue:,,, ( ) Why Not Serch Tree Why not olve with expectimx Problem: Thi tree i uully infinite (why) Sme tte pper over nd over (why) Wewoulderch once per tte (why) Ide: Vlue itertion Compute optiml vlue for ll tte ll t once uing ucceive pproximtion Will be bottom up dynmic progrm imilr in cot to memoiztion Do ll plnning offline, no replnning needed! Vlue Etimte Vlue Itertion Clculte etimte V k* () The optiml vlue conidering only next k time tep (k rewrd) A k, V k pproche the optiml vlue Why: If dicounting, ditnt rewrd become negligible If terminl tte rechble from everywhere, frction of epiode not ending become negligible Otherwie, cn get infinite expected utility nd then thi pproch ctully won t work Ide: Strt with V 0* () = 0, which we know i right (why) Given V i*, clculte the vlue for ll tte for depth i+1: Thi i clled vlue updte or Bellmn updte Repet until convergence Theorem: will converge to unique optiml vlue Bic ide: pproximtion get refined towrd optiml vlue Policy my converge long before vlue do 5

Exmple: Bellmn Updte Exmple: =0.9, living rewrd=0, noie=0.

Prctice: Computing Action Which ction hould we choe from tte : Given optiml vlue Q QuickTime nd GIF decompreor re needed to ee thi picture.

6 Exmple: Bellmn Updte Exmple: =0.9, living rewrd=0, noie=0.2 Exmple: Vlue Itertion V 1 V 2 Informtion propgte outwrd from terminl tte nd eventully ll tte hve correct vlue etimte Exmple: Vlue Itertion Prctice: Computing Action Which ction hould we choe from tte : Given optiml vlue Q QuickTime nd GIF decompreor re needed to ee thi picture. Given optiml vlue V Leon: ction re eier to elect from Q! Convergence Vlue Itertion Complexity Define the mx norm: Theorem: For ny two pproximtion U nd V I.e. ny ditinct pproximtion mut get cloer to ech other, o, in prticulr, ny pproximtion mut get cloer to the true U nd vlue itertion converge to unique, tble, optiml olution Theorem: I.e. once the chnge in our pproximtion i mll, it mut lo be cloe to correct Problem ize: A ction nd S tte Ech Itertion Computtion: O( A S 2 ) Spce: O( S ) Num of itertion Cn be exponentil in the dicount fctor γ 6

7 Bellmn Eqution for MDP 2 <,, r,, 0, > Define V*() {optiml vlue} the mximum expected dicounted rewrd from thi tte. V* hould tify the following eqution: Bellmn Bckup (MDP 2 ) Bellmn Bckup Given n etimte of V* function (y ) Bckup function t tte clculte new etimte (+1 ) : V x V V 1 = 6.5 ( 0 greedy = 3 5 mx V 0 = 0 V 0 = 1 Q 1 (, 1 ) = ~ 2 Q 1 (, 2 ) = ~ 6.1 Q 1 1( (, 3 ) = ~ 6.5 Q n+1 (,) : vlue/cot of the trtegy: execute ction in, execute n ubequently n = rgmx Ap() Q n (,) 3 V 0 = 2 Vlue itertion [Bellmn 57] ign n rbitrry ignment of V 0 to ech tte. repet for ll tte compute +1 () by Bellmn bckup t. Itertion n+1 n1 until mx +1 () () < Reidul() Policy Computtion Optiml policy i ttionry nd time independent. for infinite/indefinite x horizon problem x V 7

8 Aynchronou Vlue Itertion Stte my be bcked up in ny order inted of n itertion by itertion A long ll tte bcked up infinitely often Aynchronou Vlue Itertion converge to optiml Aynch VI: Prioritized Sweeping Why bckup tte if vlue of ucceor me Prefer bcking tte whoe ucceor hd mot chnge Priority Queue of (tte, expected chnge in vlue) Bckup in the order of priority After bcking tte updte priority queue for ll predeceor Aynch VI: Rel Time Dynmic Progrmming [Brto, Brdtke, Singh 95] RTDP Tril Tril: imulte greedy policy trting from trt tte; perform Bellmn bckup on viited tte Q n+1 ( 0,) RTDP: repet Tril until vlue function converge greedy = 2 Min +1 ( 0 ) Gol 3 Comment Propertie if ll tte re viited infinitely often then V* Advntge Anytime: more probble tte explored quickly Didvntge complete convergence cn be low! Review: Expectimx Wht if we don t know wht the reult of n ction will be E.g., In olitire, next crd i unknown In mineweeper, mine loction In pcmn, the ghot ct rndomly Cn do expectimx erch Chnce node, like min node, except the outcome i uncertin Clculte expected utilitie Mx node in minimx erch Chnce node tke verge (expecttion) of vlue of children Tody, we ll lern how to formlize the underlying problem Mrkov Deciion Proce m x chnc e 8

Grid World The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time, North tke

9 Grid World The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put Smll living rewrd ech tep Big rewrd come t the end Gol: mximize um of rewrd Mrkov Deciion Procee An MDP i defined by: A et of tte S A et of ction A A trnition function T(,, ) Prob tht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R( ) A trt tte (or ditribution) Mybe terminl tte MDP: non-determinitic erch problem Reinforcement lerning: MDP where we don t know the trnition or rewrd function Wht i Mrkov bout MDP Andrey Mrkov ( ) Mrkov generlly men tht given the preent tte, the future nd the pt re independent For Mrkov deciion procee, Mrkov men: Solving MDP In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy *: S A A policy give n ction for ech tte Anoptiml policymximize expected utilityif if followed Define reflex gent Optiml policy when R(,, ) = for ll nonterminl Exmple Optiml Policie Exmple: High Low R() = R() = Three crd type: 2, 3, 4 Infinite deck, twice mny 2 Strt with 3 howing After ech crd, you y high or low New crd i flipped If you re right, you win the point hown on the new crd Tie re no op If you re wrong, gme end 3 R() = - R() = - Difference from expectimx problem: #1: get rewrd you go #2: you might ply forever! 9

10 High Low n MDP Stte: 2, 3, 4, done Action: High, Low Model: T(,, ): P( =4 4, Low) = 1/4 P( =3 4, Low) = 1/4 P( =2 4, Low) = 1/2 3 P( =done 4, Low) = 0 P( =4 4, High) = 1/4 P( =3 4, High) = 0 P( =2 4, High) = 0 P( =done 4, High) = 3/4 Rewrd: R(,, ): Number hown on if 0 otherwie Strt: 3 T = 0.5, R = 2 Serch Tree: High Low T = 0.25, R = 3 Low High Low High Low High Low High, Low, High T = 0, R = 4 T = 0.25, R = 0 MDP Serch Tree Ech MDP tte give n expectimx like erch tree (, ) i q-tte,,, i tte (,, ) clled trnition T(,, ) = P(,) R(,, ) Utilitie of Sequence In order to formlize optimlity of policy, need to undertnd utilitie of equence of rewrd Typiclly conider ttionry preference: Theorem: only two wy to define ttionry utilitie Additive utility: Dicounted utility: Infinite Utilitie! Dicounting Problem: infinite tte equence hve infinite rewrd Solution: Finite horizon: Terminte epiode fter fixed T tep (e.g. life) Give nonttionry policie ( depend on time left) Aborbing tte: gurntee tht for every policy, terminl tte will eventully be reched (like done for High Low) Dicounting: for 0 < < 1 Smller men mller horizon horter term focu Typiclly dicount rewrd by < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge 10

11 Recp: Defining MDP Mrkov deciion procee: Stte S Strt tte 0 Action A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount ) MDP quntitie o fr: Policy = Choice of ction for ech tte Utility (or return) = um of dicounted rewrd,,, Optiml Utilitie Define the vlue of tte : V * () = expected utility trting in nd cting optimlly Define the vlue of q tte (,): Q * (,) = expected utility trting in, tking ction nd therefter cting optimlly, Define the optiml policy: * () = optiml ction from tte,, The Bellmn Eqution Why Not Serch Tree Definition of optiml utility led to imple one tep lookhed reltionhip mongt optiml utility vlue: Formlly:,,, Why not olve with expectimx Problem: Thi tree i uully infinite (why) Sme tte pper over nd over (why) Wewoulderch once per tte (why) Ide: Vlue itertion Compute optiml vlue for ll tte ll t once uing ucceive pproximtion Will be bottom up dynmic progrm imilr in cot to memoiztion Do ll plnning offline, no replnning needed! Vlue Etimte Vlue Itertion Clculte etimte V k* () The optiml vlue conidering only next k time tep (k rewrd) A k, it pproche the optiml vlue Why: If dicounting, ditnt rewrd become negligible If terminl tte rechble from everywhere, frction of epiode not ending become negligible Otherwie, cn get infinite expected utility nd then thi pproch ctully won t work Ide: Strt with V 0* () = 0, which we know i right (why) Given V i*, clculte the vlue for ll tte for depth i+1: Thi i clled vlue updte or Bellmn updte Repet until convergence Theorem: will converge to unique optiml vlue Bic ide: pproximtion get refined towrd optiml vlue Policy my converge long before vlue do 11

12 Exmple: Bellmn Updte Exmple: =0.9, living rewrd=0, noie=0.2 Exmple: Vlue Itertion V 1 V 2 Informtion propgte outwrd from terminl tte nd eventully ll tte hve correct vlue etimte Exmple: Vlue Itertion Prctice: Computing Action Which ction hould we choe from tte : Given optiml vlue Q Given optiml vlue V QuickTime nd GIF decompreor re needed to ee thi picture. Leon: ction re eier to elect from Q! Convergence Vlue Itertion Complexity Define the mx norm: Theorem: For ny two pproximtion U nd V I.e. ny ditinct pproximtion mut get cloer to ech other, o, in prticulr, ny pproximtion mut get cloer to the true U nd vlue itertion converge to unique, tble, optiml olution Theorem: I.e. once the chnge in our pproximtion i mll, it mut lo be cloe to correct Problem ize: A ction nd S tte Ech Itertion Computtion: O( A S 2 ) Spce: O( S ) Num of itertion Cn be exponentil in the dicount fctor γ 12

13 Utilitie for Fixed Policie Policy Evlution How do we clculte the V for fixed policy Another bic opertion: compute the utility of tte under fix (generl non optiml) policy Define the utility of tte, under fixed policy : V () = expected totl dicounted rewrd (return) trting in nd following Recurive reltion (one tep lookhed / Bellmn eqution):, (), (), () Ide one: modify Bellmn updte Ide two: it jut liner ytem, olve with Mtlb (or whtever) Policy Itertion Problem with vlue itertion: Conidering ll ction ech itertion i low: tke A time longer thn policy evlution But policy doen t chnge ech itertion, time wted Alterntive to vlue itertion: Step 1: Policy evlution: clculte utilitie for fixed policy (not optiml utilitie!) until convergence (ft) Step 2: Policy improvement: updte policy uing one tep lookhed with reulting converged (but not optiml!) utilitie (low but infrequent) Repet tep until policy converge Policy Itertion Policy evlution: with fixed current policy, find vlue with implified Bellmn updte: Iterte until vlue converge Policy improvement: with fixed utilitie, find the bet ction ccording to one tep look hed Policy Itertion Complexity Problem ize: A ction nd S tte Ech Itertion Computtion: O( S 3 + A S 2 ) Spce: O( S ) Num of itertion Unknown, but cn be fter in prctice Convergence i gurnteed Comprion In vlue itertion: Every p (or bckup ) updte both utilitie (explicitly, bed on current utilitie) nd policy (poibly implicitly, bed on current policy) In policy itertion: Severl pe to updte utilitie with frozen policy Occionl pe to updte policie Hybrid pproche (ynchronou policy itertion): Any equence of prtil updte to either policy entrie or utilitie will converge if every tte i viited infinitely often 13

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs? CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to) ct