Maximum Expected Utility. CS 188: Artificial Intelligence Fall Preferences. MEU Principle. Rational Preferences. Utilities: Uncertain Outcomes

Size: px

Start display at page:

Download "Maximum Expected Utility. CS 188: Artificial Intelligence Fall Preferences. MEU Principle. Rational Preferences. Utilities: Uncertain Outcomes"

Gwenda Fox
6 years ago
Views:

1 CS 188: Artificil Intelligence Fll 2011 Mximum Expected Utility Why hould we verge utilitie? Why not minimx? Lecture 8: Utilitie / MDP 9/20/2011 Dn Klein UC Berkeley Principle of mximum expected utility: A rtionl gent hould choe the ction which mximize it expected utility, given it knowledge Quetion: Where do utilitie come from? How do we know uch utilitie even exit? Why re we tking expecttion of utilitie (not, e.g. minimx)? Wht if our behvior cn t be decribed by utilitie? Mny lide over the coure dpted from either Sturt Ruell or Andrew Moore 1 2 Utilitie: Uncertin Outcome Preference Going to irport from home An gent chooe mong: Prize: A, B, etc. Get Double Get Single Lotterie: itution with uncertin prize Oop Whew Nottion: 4 Rtionl Preference MEU Principle Preference of rtionl gent mut obey contrint. The xiom of rtionlity: Theorem: [Rmey, 191; von Neumnn & Morgentern, 1944] Given ny preference tifying thee contrint, there exit rel-vlued function U uch tht: Theorem: Rtionl preference imply behvior decribble mximiztion of expected utility 5 Mximum expected utility (MEU) principle: Chooe the ction tht mximize expected utility Note: n gent cn be entirely rtionl (conitent with MEU) without ever repreenting or mnipulting utilitie nd probbilitie E.g., lookup tble for perfect tictctoe, reflex vcuum clener 6 1

2 Utility Scle Normlized utilitie: u + = 1.0, u - = 0.0 Micromort: one-millionth chnce of deth, ueful for pying to reduce product rik, etc. QALY: qulity-djuted life yer, ueful for medicl deciion involving ubtntil rik Note: behvior i invrint under poitive liner trnformtion Humn Utilitie Utilitie mp tte to rel number. Which number? Stndrd pproch to ement of humn utilitie: Compre tte A to tndrd lottery L p between bet poible prize u + with probbility p wort poible cttrophe u - with probbility 1-p Adjut lottery probbility p until A ~ L p Reulting p i utility in [0,1] With determinitic prize only (no lottery choice), only ordinl utility cn be determined, i.e., totl order on prize 7 8 Money Money doe not behve utility function, but we cn tlk bout the utility of hving money (or being in debt) Given lottery L = [p, $; (1-p), $Y] The expected monetry vlue EMV(L) i p* + (1-p)*Y U(L) = p*u($) + (1-p)*U($Y) Typiclly, U(L) < U( EMV(L) ): why? In thi ene, people re rik-vere When deep in debt, we re rik-prone Utility curve: for wht probbility p m I indifferent between: Some ure outcome x A lottery [p,$m; (1-p),$0], M lrge Exmple: Inurnce Conider the lottery [0.5,$1000; 0.5,$0] Wht i it expected monetry vlue? ($500) Wht i it certinty equivlent? Monetry vlue cceptble in lieu of lottery $400 for mot people Difference of $100 i the inurnce premium There n inurnce indutry becue people will py to reduce their rik If everyone were rik-neutrl, no inurnce needed! 9 10 Exmple: Humn Rtionlity? Fmou exmple of Alli (195) A: [0.8,$4k; 0.2,$0] B: [1.0,$k; 0.0,$0] Non-Determinitic Serch How do you pln when your ction might fil? C: [0.2,$4k; 0.8,$0] D: [0.25,$k; 0.75,$0] Mot people prefer B > A, C > D But if U($0) = 0, then B > A U($k) > 0.8 U($4k) C > D 0.8 U($4k) > U($k)

Exmple: Grid World [DEMO Gridworld Intro] Action Reult The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there

3 Exmple: Grid World [DEMO Gridworld Intro] Action Reult The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put Smll living rewrd ech tep Big rewrd come t the end Gol: mximize um of rewrd* Determinitic Grid World E N S W Stochtic Grid World E N S W? 14 Mrkov Deciion Procee An MDP i defined by: A et of tte S A et of ction A A trnition function T() Prob tht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R() A trt tte (or ditribution) Mybe terminl tte Wht i Mrkov bout MDP? Andrey Mrkov ( ) Mrkov generlly men tht given the preent tte, the future nd the pt re independent For Mrkov deciion procee, Mrkov men: MDP re fmily of nondeterminitic erch problem One wy to olve them i with expectimx erch but we ll hve new tool oon 15 Solving MDP Exmple Optiml Policie In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy π*: S A A policy π give n ction for ech tte An optiml policy mximize expected utility if followed Define reflex gent (if precomputed) R() = R() = -0.0 Optiml policy when R(,, ) = -0.0 for ll non-terminl [Demo] R() = -0.4 R() =

4 Exmple: High-Low High-Low n MDP Rule Three crd type: 2,, 4 Infinite deck, twice mny 2 Strt with howing After ech crd, you gue the next crd will be high or low New crd i flipped If you re right, you win the point hown on the new crd Tie re no-op If you re wrong, gme end How i thi different from the chnce gme in lt lecture? #1: get rewrd you go #2: you might ply forever! You cn ptch expectimx to del with #1, but not #2 19 Stte: 2,, 4, done Action: High, Low Model: T(,, ): P(=4 4, Low) = 1/4 P(= 4, Low) = 1/4 P(=2 4, Low) = 1/2 P(=done 4, Low) = 0 P(=4 4, High) = 1/4 P(= 4, High) = 0 P(=2 4, High) = 0 P(=done 4, High) = /4 Rewrd: R(,, ): Number hown on if 0 otherwie Strt: High-Low: Outcome Tree MDP Serch Tree Ech MDP tte give n expectimx-like erch tree Low High i tte, Low, High T = 0.5, R = 2 T = 0.25, R = T = 0, R = 4 T = 0.25, R = 0 (, ) i q-tte, () clled trnition T() = P(,) R() High Low High Low High Low Utilitie of Sequence Wht utility doe equence of rewrd hve? Formlly, we generlly ume ttionry preference: Theorem: only two wy to define ttionry utilitie Additive utility: Infinite Utilitie?! Problem: infinite tte equence hve infinite rewrd Solution: Finite horizon: Terminte epiode fter fixed T tep (e.g. life) Give nonttionry policie (π depend on time left) Aborbing tte: gurntee tht for every policy, terminl tte will eventully be reched (like done for High-Low) Dicounting: for 0 < γ < 1 Dicounted utility: Smller γ men mller horizon horter term focu

5 Dicounting Recp: Defining MDP Typiclly dicount rewrd by γ < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge Mrkov deciion procee: Stte S Strt tte 0 Action A Trnition P(,) (or T()) Rewrd R() (nd dicount γ), Exmple: dicount of 0.5 U([1,2,]) = 1* * * U([1,2,]) < U([,2,1]) MDP quntitie o fr: Policy = Choice of ction for ech tte Utility (or return) = expectimx vlue of tte Optiml Utilitie The Bellmn Eqution Fundmentl opertion: compute the vlue (optiml expectimx utilitie) of tte Why? Optiml vlue define optiml policie! Define the vlue of tte : V * () = expected utility trting in nd cting optimlly Define the vlue of q-tte (,): Q * (,) = expected utility trting out hving tken ction from tte nd (therefter) cting optimlly [DEMO Grid Vlue], Definition of optiml utility led to imple one-tep lookhed reltionhip mongt optiml utility vlue: Optiml rewrd = mximize over firt ction nd then follow optiml policy Formlly:, Define the optiml policy: π * () = optiml ction from tte Why Not Serch Tree? Why not olve with expectimx? Problem: Thi tree i uully infinite (why?) Sme tte pper over nd over (why?) We would erch once per tte (why?) Ide: Vlue itertion Compute optiml vlue for ll tte ll t once uing ucceive pproximtion Will be bottom-up dynmic progrm imilr in cot to memoiztion Do ll plnning offline, no replnning needed! 29 5

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?

Reinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs? CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to) ct