Outline. CS 188: Artificial Intelligence Spring Speeding Up Game Tree Search. Minimax Example. Alpha-Beta Pruning. Pruning

Size: px

Start display at page:

Download "Outline. CS 188: Artificial Intelligence Spring Speeding Up Game Tree Search. Minimax Example. Alpha-Beta Pruning. Pruning"

June Floyd
6 years ago
Views:

1 CS 188: Artificil Intelligence Spring 2011 Lecture 8: Gme, MDP 2/14/2010 Pieter Abbeel UC Berkeley Mny lide dpted from Dn Klein Outline Zero-um determinitic two plyer gme Minimx Evlution function for non-terminl tte Alph-Bet pruning Stochtic gme Single plyer: expectimx Two plyer: expectiminimx Non-zero um Mrkov deciion procee (MDP) 1 2 Minimx Exmple Speeding Up Gme Tree Serch Evlution function for non-terminl tte Pruning: not erch prt of the tree Alph-Bet pruning doe o without loing ccurcy, O(b d ) O(b d/2 ) 4 Pruning Alph-Bet Pruning Generl configurtion We re computing the MIN- VALUE t n MA We re looping over n children n vlue etimte i dropping i the bet vlue tht MA cn get t ny choice point long the current pth If n become wore thn, MA will void it, o cn top conidering n other children Define b imilrly for MIN MIN MA MIN n 5 7 1

2 Alph-Bet Pruning Exmple Alph-Bet Pruning Exmple Strting /b =- Riing =- = = = 2 1 Lowering b =- =- =- =- = = = = = = b= b= b= b=2 b=14 b=5 b= i MA bet lterntive here or bove b i MIN bet lterntive here or bove Riing =- b= 8 =8 b= i MA bet lterntive here or bove b i MIN bet lterntive here or bove Alph-Bet Peudocode Alph-Bet Pruning Propertie Thi pruning h no effect on finl reult t the root Vlue of intermedite node might be wrong! Good child ordering improve effectivene of pruning Heuritic: order by evlution function or bed on previou erch b With perfect ordering : Time complexity drop to O(b m/2 ) Double olvble depth! Full erch of, e.g. che, i till hopele v Thi i imple exmple of metreoning (computing bout wht to compute) 11 Outline Expectimx Serch Tree Zero-um determinitic two plyer gme Minimx Evlution function for non-terminl tte Alph-Bet pruning Stochtic gme Single plyer: expectimx Two plyer: expectiminimx Non-zero um Mrkov deciion procee (MDP) Wht if we don t know wht the reult of n ction will be? E.g., In olitire, next crd i unknown In mineweeper, mine loction In pcmn, the ghot ct rndomly Cn do expectimx erch to mximize verge core Chnce node, like min node, except the outcome i uncertin Clculte expected utilitie Mx node in minimx erch Chnce node tke verge (expecttion) of vlue of children Lter, we ll lern how to formlize the underlying problem Mrkov Deciion Proce mx chnce

3 Expectimx Peudocode Expectimx Quntitie def vlue() if i mx node return mxvlue() if i n exp node return expvlue() if i terminl node return evlution() def mxvlue() vlue = [vlue( ) for in ucceor()] return mx(vlue) def expvlue() vlue = [vlue( ) for in ucceor()] weight = [probbility(, ) for in ucceor()] return expecttion(vlue, weight) Expectimx Pruning? Expectimx Serch Chnce node Chnce node re like min node, except the outcome i uncertin Clculte expected utilitie Chnce node verge ucceor vlue (weighted) Ech chnce node h probbility ditribution over it outcome (clled model) For now, ume we re given the model Utilitie for terminl tte Sttic evlution function give u limited-depth erch 1 erch ply Etimte of true expectimx vlue (which would require lot of work to compute) 16 Expectimx for Pcmn Notice tht we ve gotten wy from thinking tht the ghot re trying to minimize pcmn core Inted, they re now prt of the environment Pcmn h belief (ditribution) over how they will ct Quiz: Cn we ee minimx pecil ce of expectimx? Quiz: wht would pcmn computtion look like if we umed tht the ghot were doing 1-ply minimx nd tking the reult 80% of the time, otherwie moving rndomly? If you tke thi further, you end up clculting belief ditribution over your opponent belief ditribution over your belief ditribution, etc Cn get unmngeble very quickly! 18 19

Expectimx Utilitie For minimx, terminl function cle doen t mtter We jut wnt better tte to hve higher evlution (get the ordering right) We cll thi inenitivity to monotonic trnformtion For expectimx,

4 Expectimx Utilitie For minimx, terminl function cle doen t mtter We jut wnt better tte to hve higher evlution (get the ordering right) We cll thi inenitivity to monotonic trnformtion For expectimx, we need mgnitude to be meningful Stochtic Two-Plyer E.g. bckgmmon Expectiminimx (!) Environment i n extr plyer tht move fter ech gent Chnce node tke expecttion, otherwie like minimx x Stochtic Two-Plyer Non-Zero-Sum Utilitie Dice roll incree b: 21 poible roll with 2 dice Bckgmmon 20 legl move Depth 2 = 20 x (21 x 20) = 1.2 x 10 9 A depth incree, probbility of reching given erch node hrink So uefulne of erch i diminihed So limiting depth i le dmging But pruning i trickier TDGmmon ue depth-2 erch + very good evlution function + reinforcement lerning: world-chmpion level ply 1 t AI world chmpion in ny gme! Similr to minimx: Terminl hve utility tuple Node vlue re lo utility tuple Ech plyer mximize it own utility nd propgte (or bck up) node from children Cn give rie to coopertion nd competition dynmiclly 1,6,6 7,1,2 6,1,2 7,2,1 5,1,7 1,5,2 7,7,1 5,2,5 25 Outline Zero-um determinitic two plyer gme Minimx Evlution function for non-terminl tte Alph-Bet pruning Stochtic gme Single plyer: expectimx Two plyer: expectiminimx Non-zero um Mrkov deciion procee (MDP) 26 Reinforcement Lerning Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut lern to ct o to mximize expected rewrd 4

Grid World Grid Future The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time,

Grid World E N S W Stochtic Grid World E N S W? 0 Mrkov Deciion Procee An MDP i defined by: A et of tte S A et of ction A A trnition function T(,, ) Prob tht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R( ) A trt tte (or ditribution) Mybe terminl tte Wht i Mrkov bout MDP?

Reinforcement lerning: MDP where we don t know the trnition or rewrd function 1 Solving MDP Exmple Optiml Policie In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from

5 Grid World Grid Future The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put Smll living rewrd ech tep Big rewrd come t the end Gol: mximize um of rewrd Determinitic Grid World E N S W Stochtic Grid World E N S W? 0 Mrkov Deciion Procee An MDP i defined by: A et of tte S A et of ction A A trnition function T(,, ) Prob tht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R( ) A trt tte (or ditribution) Mybe terminl tte Wht i Mrkov bout MDP? Andrey Mrkov ( ) Mrkov generlly men tht given the preent tte, the future nd the pt re independent For Mrkov deciion procee, Mrkov men: MDP re fmily of nondeterminitic erch problem Reinforcement lerning: MDP where we don t know the trnition or rewrd function 1 Solving MDP Exmple Optiml Policie In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy π*: S A A policy π give n ction for ech tte An optiml policy mximize expected utility if followed Define reflex gent R() = R() = -0.0 Optiml policy when R(,, ) = -0.0 for ll non-terminl R() = -0.4 R() =

6 Exmple: High-Low High-Low n MDP Three crd type: 2,, 4 Infinite deck, twice mny 2 Strt with howing After ech crd, you y high or low New crd i flipped If you re right, you win the point hown on the new crd Tie re no-op If you re wrong, gme end Difference from expectimx: #1: get rewrd you go #2: you might ply forever! 6 Stte: 2,, 4, done Action: High, Low Model: T(,, ): P( =4 4, Low) = 1/4 P( = 4, Low) = 1/4 P( =2 4, Low) = 1/2 P( =done 4, Low) = 0 P( =4 4, High) = 1/4 P( = 4, High) = 0 P( =2 4, High) = 0 P( =done 4, High) = /4 Rewrd: R(,, ): Number hown on if nd i correct 0 otherwie Strt: Exmple: High-Low MDP Serch Tree Ech MDP tte give n expectimx-like erch tree Low High i tte, Low, High T = 0.5, R = 2 T = 0.25, R = T = 0, R = 4 T = 0.25, R = 0 (, ) i q-tte,,, (,, ) clled trnition T(,, ) = P(,) R(,, ) High Low High Low High Low 8 9 Utilitie of Sequence In order to formlize optimlity of policy, need to undertnd utilitie of equence of rewrd Typiclly conider ttionry preference: Theorem: only two wy to define ttionry utilitie Additive utility: Infinite Utilitie?! Problem: infinite tte equence hve infinite rewrd Solution: Finite horizon: Terminte epiode fter fixed T tep (e.g. life) Give nonttionry policie (π depend on time left) Aborbing tte: gurntee tht for every policy, terminl tte will eventully be reched (like done for High-Low) Dicounting: for 0 < γ < 1 Dicounted utility: Smller γ men mller horizon horter term focu

Typiclly dicount rewrd by γ < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge Dicounting Recp: Defining MDP Mrkov deciion procee: Stte S Strt tte 0 Action

7 Typiclly dicount rewrd by γ < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge Dicounting Recp: Defining MDP Mrkov deciion procee: Stte S Strt tte 0 Action A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ),, MDP quntitie o fr: Policy = Choice of ction for ech tte Utility (or return) = um of dicounted rewrd, 42 4 Optiml Utilitie The Bellmn Eqution Fundmentl opertion: compute the vlue (optiml expectimx utilitie) of tte Why? Optiml vlue define optiml policie! Define the vlue of tte : V * () = expected utility trting in nd cting optimlly Define the vlue of q-tte (,): Q * (,) = expected utility trting in, tking ction nd therefter cting optimlly,,, Definition of optiml utility led to imple one-tep lookhed reltionhip mongt optiml utility vlue: Optiml rewrd = mximize over firt ction nd then follow optiml policy Formlly:,,, Define the optiml policy: π * () = optiml ction from tte Vlue Etimte Clculte etimte V k* () Not the optiml vlue of! The optiml vlue conidering only next k time tep (k rewrd) A k, it pproche the optiml vlue Almot olution: recurion (i.e. expectimx) Correct olution: dynmic progrmming 47 7

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs? CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to) ct