Announcements. CS 188: Artificial Intelligence Fall Preferences. Rational Preferences. Rational Preferences. MEU Principle. Project 2 (due 10/1)

Size: px

Start display at page:

Download "Announcements. CS 188: Artificial Intelligence Fall Preferences. Rational Preferences. Rational Preferences. MEU Principle. Project 2 (due 10/1)"

Dulcie Conley
5 years ago
Views:

1 CS 188: Artificial Intelligence Fall 007 Lecture 9: Utilitie 9/5/007 Dan Klein UC Berkeley Project (due 10/1) Announcement SVN group available, u to requet Midterm 10/16 in cla One ide of a page cheat heet allowed (provided you write it yourelf) Tell u NOW about conflict! An agent chooe among: Prize: A, B, etc. Lotterie: ituation with uncertain prize Notation: Preference Rational Preference We want ome contraint on preference before we call them rational For example: an agent with intranitive preference can be induced to give away all it money If B > C, then an agent with C would pay (ay) 1 cent to get B If A > B, then an agent with B would pay (ay) 1 cent to get A If C > A, then an agent with A would pay (ay) 1 cent to get C Rational Preference Preference of a rational agent mut obey contraint. Thee contraint are the axiom of rationality MEU Principle Theorem: [Ramey, 191; von Neumann & Morgentern, 19] Given any preference atifying thee contraint, there exit a real-valued function U uch that: Theorem: Rational preference imply behavior decribable a maximization of expected utility Maximum expected likelihood (MEU) principle: Chooe the action that maximize expected utility Note: an agent can be entirely rational (conitent with MEU) without ever repreenting or manipulating utilitie and probabilitie E.g., a lookup table for perfect tictactoe, reflex vacuum cleaner 1

2 Human Utilitie Utilitie map tate to real number. Which number? Standard approach to aement of human utilitie: Compare a tate A to a tandard lottery L p between ``bet poible prize'' u + with probability p ``wort poible catatrophe'' u - with probability 1-p Adjut lottery probability p until A ~ L p Reulting p i a utility in [0,1] Utility Scale Normalized utilitie: u + = 1.0, u - = 0.0 Micromort: one-millionth chance of death, ueful for paying to reduce product rik, etc. QALY: quality-adjuted life year, ueful for medical deciion involving ubtantial rik Note: behavior i invariant under poitive linear tranformation With determinitic prize only (no lottery choice), only ordinal utility can be determined, i.e., total order on prize Example: Inurance Conider the lottery [0.5,$1000; 0.5,$0] What i it expected monetary value? ($500) What i it certainty equivalent? Monetary value acceptable in lieu of lottery $00 for mot people Difference of $100 i the inurance premium There an inurance indutry becaue people will pay to reduce their rik If everyone were rik-prone, no inurance needed! Money Money doe not behave a a utility function Given a lottery L: Define expected monetary value EMV(L) Uually U(L) < U(EMV(L)) I.e., people are rik-avere Utility curve: for what probability p am I indifferent between: A prize x A lottery [p,$m; (1-p),$0] for large M? Typical empirical data, extrapolated with rik-prone behavior: Example: Human Rationality? Famou example of Allai (195) A: [0.8,$k; 0.,$0] B: [1.0,$k; 0.0,$0] C: [0.,$k; 0.8,$0] D: [0.5,$k; 0.75,$0] Mot people prefer B > A, C > D But if U($0) = 0, then B > A U($k) > 0.8 U($k) C > D 0.8 U($k) > U($k) [DEMOS] Reinforcement Learning Baic idea: Receive feedback in the form of reward Agent utility i defined by the reward function Mut learn to act o a to maximize expected reward Change the reward, change the learned behavior Example: Playing a game, reward at the end for winning / loing Vacuuming a houe, reward for each piece of dirt picked up Automated taxi, reward for each paenger delivered

3 Markov Deciion Procee An MDP i defined by: A et of tate S A et of action a A A tranition function T() Prob that a from lead to i.e., P(,a) Alo called the model A reward function R(, a, ) Sometime jut R() or R() A tart tate (or ditribution) Maybe a terminal tate MDP are a family of nondeterminitic earch problem Reinforcement learning: MDP where we don t know the tranition or reward function Solving MDP In determinitic ingle-agent earch problem, want an optimal plan, or equence of action, from tart to a goal In an MDP, we want an optimal policy π() A policy give an action for each tate Optimal policy maximize expected if followed Define a reflex agent Optimal policy when R(, a, ) = -0.0 for all non-terminal Example Optimal Policie Example: High-Low R() = R() = -0.0 Three card type:,, Infinite deck, twice a many Start with howing After each card, you ay high or low New card i flipped If you re right, you win the point hown on the new card Tie are no-op If you re wrong, game end R() = -0. R() = -.0 Difference from expectimax: #1: get reward a you go #: you might play forever! State:,,, done Action: High, Low Model: T(, a, ): P(=done, High) = / P(=, High) = 0 P(=, High) = 0 P(=, High) = 1/ P(=done, Low) = 0 P(=, Low) = 1/ P(=, Low) = 1/ P(=, Low) = 1/ Reward: R(, a, ): Number hown on if 0 otherwie Start: High-Low Note: could chooe action with earch. How? Example: High-Low T = 0.5, R = T = 0.5, R = High High Low High Low High Low Low, High, Low T = 0, R = T = 0.5, R = 0

4 MDP Search Tree Each MDP tate give an expectimax-like earch tree (, a) i a q-tate a, a () called a tranition T() = P(,a) R() i a tate Utilitie of Sequence In order to formalize optimality of a policy, need to undertand utilitie of equence of reward Typically conider tationary preference: Auming that reward depend only on tate for thee lide! Theorem: only two way to define tationary utilitie Additive utility: Dicounted utility: Infinite Utilitie?! Problem: infinite equence with infinite reward Solution: Finite horizon: Terminate after a fixed T tep Give nontationary policy (π depend on time left) Aborbing tate(): guarantee that for every policy, agent will eventually die (like done for High-Low) Dicounting: for 0 < γ < 1 Smaller γ mean maller horizon horter term focu Typically dicount reward by γ < 1 each time tep Sooner reward have higher utility than later reward Alo help the algorithm converge Dicounting Utilitie of State Policy Evaluation Fundamental operation: compute the utility of a tate Define the utility of a tate, under a fixed policy π: V π () = expected total dicounted reward (return) tarting in and following π π(), π() How do we calculate the V for a fixed policy? Idea one: turn recurive equation into update Recurive relation (one-tep lookahead): Idea two: it jut a linear ytem, olve with Matlab (or whatever)

5 Example: High-Low Policy: alway ay high Iterative update: [DEMO] Example: GridWorld Equivalent to doing fixed depth earch and plugging in zero at leave Q-Function To implify thing, introduce a q- value, for a tate and action (q-tate) under a policy Utility of tarting in tate, taking action a, then following π thereafter a, a Goal: calculate the optimal utility of each tate Optimal Utilitie V*() = expected (dicounted) reward with optimal action Why? Given optimal utilitie, MEU let u compute the optimal policy Practice: Computing Action Which action hould we choe from tate : Given optimal q-value Q? Given optimal value V? 5

CS 188: Artificial Intelligence Fall Markov Decision Processes

CS 188: Artificial Intelligence Fall 2007 Lecture 10: MDP 9/27/2007 Dan Klein UC Berkeley Markov Deciion Procee An MDP i defined by: A et of tate S A et of action a A A tranition function T(,a, ) Prob