Announcements. Maximizing Expected Utility. Preferences. Rational Preferences. Rational Preferences. Introduction to Artificial Intelligence

Size: px

Start display at page:

Download "Announcements. Maximizing Expected Utility. Preferences. Rational Preferences. Rational Preferences. Introduction to Artificial Intelligence"

Jason Charles
6 years ago
Views:

Introduction to Artificil Intelligence V22.0472-001 Fll 2009 Lecture 8: Utilitie Announcement Will hve Aignment 1 grded by Wed.

1 Introduction to Artificil Intelligence V Fll 2009 Lecture 8: Utilitie Announcement Will hve Aignment 1 grded by Wed. Aignment 2 i up on webpge Due on Mon 19 th October (2 week) Rob Fergu Dept of Computer Science, Cournt Intitute, NYU Mny lide from Dn Klein, Sturt Ruell or Andrew Moore 2 Mximizing Expected Utility Principle of mximum expected utility: A rtionl gent hould chooe the ction tht mximize it expected utility, given it knowledge Where do utilitie come from? An gent chooe mong: Outcome: A, B, etc. Lotterie: itution with uncertin prize Preference How do we know uch utilitie even exit? Nottion: Why re we tking expecttion of utilitie? Wht if our behvior cn t be decribed by utilitie? 4 Rtionl Preference Rtionl Preference We wnt ome contrint on preference before we cll them rtionl For exmple: n gent with intrnitive iti preference cn be induced to give wy ll of it money If B > C, then n gent with C would py (y) 1 cent to get B If A > B, then n gent with B would py (y) 1 cent to get A If C > A, then n gent with A would py (y) 1 cent to get C ( A f B) ( B f C) ( A f C) 5 Preference of rtionl gent mut obey contrint. The xiom of rtionlity: Theorem: Rtionl preference imply behvior decribble mximiztion of expected utility 6 1

2 MEU Principle Humn Utilitie Theorem: [Rmey, 191; von Neumnn & Morgentern, 1944] Given ny preference tifying thee xiom, there exit rel-vlued function U uch tht: Mximum expected utilitiy (MEU) principle: Chooe the ction tht mximize expected utility Note: n gent cn be entirely rtionl (conitent with MEU) without ever repreenting or mnipulting utilitie nd probbilitie thi i bout behvior E.g., lookup tble for perfect tic-tc-toe 7 Utilitie mp tte to rel number. Which number? Stndrd pproch to ement of humn utilitie: Compre tte A to tndrd lottery L p between bet poible prize u + with probbility p wort poible cttrophe u - with probbility 1-p Adjut lottery probbility p until A ~ L p Reulting p i utility in [0,1] 8 Utility Scle Money Normlized utilitie: u + = 1.0, u - = 0.0 Micromort: one-millionth chnce of deth, ueful for pying to reduce product rik, etc. Worth bout $20 in QALY: qulity-djuted life yer, ueful for medicl deciion involving ubtntil rik Note: behvior i invrint under poitive liner trnformtion Money doe not behve utility function, but we cn tlk bout the utility of hving money (or being in debt) Given lottery L = [p, $X; (1-p), $Y] The expected monetry vlue EMV(L) i p*x + (1-p)*Y U(L) = p*u($x) + (1-p)*U($Y) Typiclly, U(L)<U(EMV(L)):peoplepreferurething ): prefer In thi ene, people re rik-vere When deep in debt (or in Veg), we re rik-prone With determinitic prize only (no lottery choice), only ordinl utility cn be determined, i.e., totl order on prize 9 Utility curve: for wht probbility p m I indifferent between: Some ure outcome x A lottery [p,$m; (1-p),$0] 10 Exmple: Inurnce Exmple: Inurnce Becue people cribe different utilitie to different mount of money, inurnce greement cn incree both prtie expected utility Becue people cribe different utilitie to different mount of money, inurnce greement cn incree both prtie expected utility John own cr. Hi lottery: LJ = [0.8, $0 ; 0.2, -$200] John Utility Amount U i.e., 20% chnce of crhing J $0 0 John i rik-vere. He doe not wnt -$200! -$200 -$ John own cr. Hi lottery: LJ = [0.8, $0 ; 0.2, -$200] i.e., 20% chnce of crhing John i rik-vere. He doe not wnt -$200! Inurnce compny buy rik: LI = [0.8, $50 ; 0.2, -$150] i.e., $50 revenue + John LJ Inurer i rik-neutrl: U(L)=U(EMV(L)) UJ(LJ) = 0.2*UJ(-$200) = -200 UJ(-$50) = $ UJ(LJ) = 0.2*UJ(-$200) = -200 UJ(-$50) = -150 UI(LI) = U(0.8* *-150) = U($10) > U($0) 2

Are Humn Rtionl? A: [0.8, $4k ; 0.2, $0] B: [1.0, $k ; 0.0, $0] C: [0.2, $4k ; 0.

75, $0] Reinforcement Lerning Bic ide: Receive feedbck in the form of rewrd Agent

Chnge the rewrd, chnge the lerned behvior Fmou exmple of Alli (195) Mot people prefer B

8 U($4k) > U($k) One explntion: people don t wnt to feel regret.

plnned: 80% of the time, the ction North tke the gent North (if there i no wll there)

gent would hve been tken, the gent ty put Sme for other direction Big rewrd come t the

, P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R( ) A trt tte

Reinforcement lerning: MDP where we don t know the trnition or rewrd function 16

3 Are Humn Rtionl? A: [0.8, $4k ; 0.2, $0] B: [1.0, $k ; 0.0, $0] C: [0.2, $4k ; 0.8, $0] D: [0.25, $k ; 0.75, $0] Reinforcement Lerning Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut lern to ct o to mximize expected rewrd Chnge the rewrd, chnge the lerned behvior Fmou exmple of Alli (195) Mot people prefer B > A, C > D But if U($0) = 0, then B > A U($k) > 0.8 U($4k) C > D 0.8 U($4k) > U($k) One explntion: people don t wnt to feel regret. 14 Grid World The gent live in grid Wll block the gent pth The gent ction do not lwy go plnned: 80% of the time, the ction North tke the gent North (if there i no wll there) 10% of the time, North tke the gent Wet; 10% Et If there i wll in the direction the gent would hve been tken, the gent ty put Sme for other direction Big rewrd come t the end Mrkov Deciion Procee An MDP i defined by: A et of tte S A et of ction A A trnition function T() Prob tht from led to i.e., P(,) Alo clled the model A rewrd function R(,, ) Sometime jut R() or R( ) A trt tte (or ditribution) Mybe terminl tte MDP re fmily of nondeterminitic erch problem Reinforcement lerning: MDP where we don t know the trnition or rewrd function 16 Solving MDP Exmple Optiml Policie In determinitic ingle-gent erch problem, wnt n optiml pln, or equence of ction, from trt to gol In n MDP, we wnt n optiml policy π() A policy give n ction for ech tte Optiml policy mximize expected if followed Define reflex gent R() = R() = -0.0 Optiml policy when R(,, ) = for ll non-terminl 17 R() = -0.4 R() =

4 Exmple: High-Low High-Low Three crd type: 2,, 4 Infinite deck, twice mny 2 Strt with howing After ech crd, you y high or low New crd i flipped If you re right, you win the point hown on the new crd Tie re no-op If you re wrong, gme end Difference from expectimx: #1: get rewrd you go #2: you might ply forever! 19 Stte: 2,, 4, done Action: High, Low Model: T(,, ): P( =done 4, High) = /4 P( =2 4, High) = 0 P( = 4, High) = 0 P( =4 4, High) = 1/4 P( =done 4, Low) = 0 P( =2 4, Low) = 1/2 P( = 4, Low) = 1/4 P( =4 4, Low) = 1/4 Rewrd: R(,, ): Number hown on if 0 otherwie Note: could chooe ction Strt: with erch. How? 20 Exmple: High-Low MDP Serch Tree Ech MDP tte give n expectimx-like erch tree High Low i tte, High, Low T = 0, R = 2 T = 0.25, R = T = 0.25, T = 0.25, R = 4 R = 0 (, ) i q-tte, () clled trnition T() = P(,) R() High Low High Low High Low Utilitie of Sequence In order to formlize optimlity of policy, need to undertnd utilitie of equence of rewrd Typiclly conider ttionry preference: Theorem: only two wy to define ttionry utilitie Additive utility: Auming tht rewrd depend only on tte for thee lide! Infinite Utilitie?! Problem: infinite equence with infinite rewrd Solution: Finite horizon: Terminte fter fixed T tep Give nonttionry policy (π depend on time left) Aborbing tte(): gurntee tht for every policy, gent will eventully die (like done for High-Low) Dicounting: for 0 < γ < 1 Dicounted utility: Smller γ men mller horizon horter term focu

5 Dicounting Optiml Utilitie Typiclly dicount rewrd by γ < 1 ech time tep Sooner rewrd hve higher utility thn lter rewrd Alo help the lgorithm converge Fundmentl opertion: compute the optiml utilitie of tte Define the utility of tte : V * () = expected return trting in nd cting optimlly Define the utility of q-tte (,): Q * (,) = expected return trting in, tking ction nd therefter cting optimlly Define the optiml policy: π * () = optiml ction from tte, The Bellmn Eqution Solving MDP Definition of utility led to imple reltionhip mongt optiml utility vlue: Optiml rewrd = mximize over firt ction nd then follow optiml policy Formlly:, We wnt to find the optiml policy π* Propol 1: modified expectimx erch:, MDP Serch Tree? Problem: Thi tree i uully infinite (why?) The me tte pper over nd over (why?) There ctully one tree per tte (why?) Ide: Compute to finite depth (like expectimx) Conider return from equence of increing length Cche vlue o we don t repet work 29 Vlue Etimte Clculte etimte V k* () Not the optiml vlue of! The optiml vlue conidering only next k time tep (k rewrd) A k, it pproche the optiml vlue Why: If dicounting, ditnt rewrd become negligible If terminl tte rechble from everywhere, frction of epiode not ending become negligible Otherwie, cn get infinite expected utility nd then thi pproch ctully won t work 0 5

Maximum Expected Utility. CS 188: Artificial Intelligence Fall Preferences. MEU Principle. Rational Preferences. Utilities: Uncertain Outcomes

Maximum Expected Utility. CS 188: Artificial Intelligence Fall Preferences. MEU Principle. Rational Preferences. Utilities: Uncertain Outcomes CS 188: Artificil Intelligence Fll 2011 Mximum Expected Utility Why hould we verge utilitie? Why not minimx? Lecture 8: Utilitie / MDP 9/20/2011 Dn Klein UC Berkeley Principle of mximum expected utility: