TDT4171 Artificial Intelligence Methods

Size: px
Start display at page:

Download "TDT4171 Artificial Intelligence Methods"

Transcription

1 TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

2 Outline Summary from last time Chapter 7: Making Complex Decisions Setup for sequential problems Markov Decision Processes (MDPs) Bounded vs. infinite time Finding optimal strategies Value iteration Finding optimal strategies Policy iteration Partial observability and POMDPs Summary TDT47 Artificial Intelligence Methods

3 Summary from last time Summary from last time Rational agents can always use utilities to make decisions The MEU principle tells us how to behave It can be quite laborious to elicit preference structures from domain experts structured approaches are available Value of Information helps focus information gathering for rational agents Influence diagrams are extensions to BNs that let us make rational decisions. Note! Assignment is out today. TDT47 Artificial Intelligence Methods

4 Chapter 7: Making Complex Decisions Chapter 7 Learning goals Understanding the relationship between One-shot decisions Sequential decisions Decisions for finite horizon Decisions for infinite horizon Being familiar with: Markov Decision processes Value iteration Policy iteration Know about: Partially Observable Markov Decision processes 4 TDT47 Artificial Intelligence Methods

5 Chapter 7: Making Complex Decisions Setup for sequential problems Decision problems with an unbounded time horizon Examples of decision problems with an unbounded time horizon: Fishing in the North Sea. Robot navigation find path from current position to a certain goal position TDT47 Artificial Intelligence Methods

6 Chapter 7: Making Complex Decisions Setup for sequential problems Decision problems with an unbounded time horizon Examples of decision problems with an unbounded time horizon: Fishing in the North Sea. Robot navigation find path from current position to a certain goal position. 5 0 Characteristics: at each step we are faced with the same type of decision, at each step we are given a certain reward (possibly negative) determined by the chosen decision and the state of the world, the outcome of a decision may be uncertain, the time horizon of the decision problem is unbounded. 5 TDT47 Artificial Intelligence Methods

7 Chapter 7: Making Complex Decisions What are we trying to obtain? Setup for sequential problems To solve these decision-problems we need... A mapping from any state/action history σ t = {S 0,A 0,S,A,S,A,...,S t } to A t (the next action). π t (σ t ) = a t means If you ve seen σ t at t, then do a t. We want to simplify e.g., to say π t (σ t ) = π t (S t ) = π(s t ) Possible representation of π(s t ) Always sufficient? 6 TDT47 Artificial Intelligence Methods

8 Chapter 7: Making Complex Decisions What are we trying to obtain? Setup for sequential problems To solve these decision-problems we need... A mapping from any state/action history σ t = {S 0,A 0,S,A,S,A,...,S t } to A t (the next action). π t (σ t ) = a t means If you ve seen σ t at t, then do a t. We want to simplify e.g., to say π t (σ t ) = π t (S t ) = π(s t ) We can think of this in two steps: Find a utility function Ut (σ t ) representing how good it is to be in S t Define π t so that it maximizes the expected utility at each step Possible repr.; U t (σ t ) Corresponding π t (S t ) 6 TDT47 Artificial Intelligence Methods

9 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Solving such problems relate to solving a planning task. Find shortest path in maze Golden square is goal 7 TDT47 Artificial Intelligence Methods

10 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Detail: Near-wall behaviour Solution Ordinary planning 7 TDT47 Artificial Intelligence Methods

11 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Detail: Near-wall behaviour Solution Stochastic planning (Robot may fail to do an action, penalty for hitting walls.) 7 TDT47 Artificial Intelligence Methods

12 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) The robot navigation problem can roughly be described as a loop: Observe the state of the world, Collect (possibly negative) reward R (not the same as U!), Decide on the next action and perform it. This can be modelled as a Markov decision process: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ Note! The model adheres to the Markov assumption. In particular, {S 0,A 0,...,S t,a t } A t S t, so π t (σ t ) = π t (S t ). 8 TDT47 Artificial Intelligence Methods

13 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) The robot navigation problem can roughly be described as a loop: Observe the state of the world, Collect (possibly negative) reward R (not the same as U!), Decide on the next action and perform it. This can be modelled as a Markov decision process: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ Note also! R t is determined by A t and S t. However, as R t (s t,a t ) = R t (s t ) in the robot example I will simplify the math accordingly. 8 TDT47 Artificial Intelligence Methods

14 Chapter 7: Making Complex Decisions The quantitative part of the MDP Markov Decision Processes (MDPs) In order to specify the transition probabilities P(S t A t,s t ) and the reward function R(S t ) we need some more information about the domain: The robot can move north, east, south, and west. For each move there is a fuel expenditure of 0.. A move succeeds with probability 0.7; otherwise it moves in one of the other directions with equal probability. If it walks into a wall, the robot effectively stands still TDT47 Artificial Intelligence Methods

15 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) The quantitative part of the MDP cont d This gives the transition probabilities (for P(S t+ north,s t )): S t+ {,} {,} {,} {,} {,} {,} {,} {,} {,} {, } {, } {, } {, } {, } {, } {, } {, } {, } S t We say that {,} is absorbing when P({,} A t,{,}) =. {,}{,}{,} {,}{,}{,} {,}{,}{,} 0 TDT47 Artificial Intelligence Methods

16 Chapter 7: Making Complex Decisions MDPs in general Markov Decision Processes (MDPs) In general, in a Markov decision process... The world is fully observable, The uncertainty in the system is due to non-deterministic actions For each decision we get a reward (which may be negative); may depend on current world state and chosen action, but is independent of time (stationarity of reward-model). S 0 S t S i S t+ A 0 A t A t A t+ R 0 R t R t R t+ TDT47 Artificial Intelligence Methods

17 Chapter 7: Making Complex Decisions Decision policies Markov Decision Processes (MDPs) A decision policy for A t is in general a function over the entire past, {S 0,A 0,S,A,S,A,...,S t }. S 0 S t S t S t+ A 0 A t A t A t+ R 0 R t R t R t+ However, from the conditional independence properties we see that the relevant past is reduced to S t. This is similar to filtering and prediction in the dynamic models. TDT47 Artificial Intelligence Methods

18 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: a bounded time horizon The (approximated) North Sea fishing example over a five year period; S t is amount of fish in the sea at the start of time period t, A t is the amount being fished at during period t. S0 S S S S4 A0 A A A A4 R0 R R R R4 TDT47 Artificial Intelligence Methods

19 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: a bounded time horizon The (approximated) North Sea fishing example over a five year period; S t is amount of fish in the sea at the start of time period t, A t is the amount being fished at during period t. S0 S S S S4 A0 A A A A4 R0 R R R R4 Even if S 0 and S 4 are the same state: At t = 0 we may specify a conservative number to ensure that there is enough fish in the coming years. At t = 4 we have no concerns about the future, and specify a high volume. The optimal policy for A t depends on the time step t!! TDT47 Artificial Intelligence Methods

20 Chapter 7: Making Complex Decisions Bounded vs. infinite time Length of horizon vs. Optimal strategy Optimal strategy changes as the the time-horizon increases: Let k be the length of the time horizon, and consider the robot navigation task R(S t ) k = k large For k = and start position {, } we have to accept the penalty of in {,} to make it to the goal state on time. For k 6 we have time to take the long route. Non-stationarity again! 4 TDT47 Artificial Intelligence Methods

21 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: an unbounded time horizon The (approximated) North Sea fishing example with an unbounded time horizon: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ The optimal policy for A t depends on the current state and what may happen in the future. If two time steps, say year t = 0 and t = 4, are in the same state then they have the same possibilities in the future. The optimal policies for A 0 and A 4 are the same! The strategy is said to be stationary. 5 TDT47 Artificial Intelligence Methods

22 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: fixed horizon Consider only the rewards obtained in the first k states: U(s 0,s,s,...) = R(s 0 )+R(s )+...+R(s k ) <. But how do we choose k? For k = 0 we only care about the immediate reward, hence we pursue a very greedy strategy. As k approaches, we put less and less focus on the initial behaviour. 6 TDT47 Artificial Intelligence Methods

23 Chapter 7: Making Complex Decisions Bounded vs. infinite time Evaluating strategies Assume that the reward function is specified as: Imagine there is no terminal state and no uncertainty on the result of an action. Then: EU = and EU = But which one is better? 7 TDT47 Artificial Intelligence Methods

24 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. 8 TDT47 Artificial Intelligence Methods

25 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. Possible interpretations of the discounting factor γ: The decision process may terminate with probability ( γ) at any point in time, e.g. the robot breaking down. In economics, γ may be thought of as an interest rate of r = (/γ). 8 TDT47 Artificial Intelligence Methods

26 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. For γ = 0 we have a greedy strategy. With 0 < γ < and R = max t R(s t ) < we have U(s 0,s,s,...) = γ i R(s i ) i=0 i=0 γ i R = R γ <. For γ = we have normal additive rewards. 8 TDT47 Artificial Intelligence Methods

27 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. 9 TDT47 Artificial Intelligence Methods

28 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i 9 TDT47 Artificial Intelligence Methods

29 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i The expected discounted reward of π is defined as (for a fixed N): U(s 0,π) = ( ) N γ i R(s i ) P(S i = s i π,s 0 = s 0 ). s i i=0 9 TDT47 Artificial Intelligence Methods

30 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i The expected discounted reward of π is defined as: U(s 0,π) = lim N ( ) N γ i R(s i ) P(S i = s i π,s 0 = s 0 ). s i i=0 9 TDT47 Artificial Intelligence Methods

31 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i A standard notation for U(s 0,π) is also: [ ] E γ i R(S i ) π,s 0 = s 0. i=0 9 TDT47 Artificial Intelligence Methods

32 Chapter 7: Making Complex Decisions A side-step: Fix-point iterations Bounded vs. infinite time Solve the equation x cos(x) = 0 on x [0,] How can we proceed to find an approximate solution? Discuss with your neighbour for a couple of minutes... 0 TDT47 Artificial Intelligence Methods

33 Chapter 7: Making Complex Decisions A side-step: Fix-point iterations Bounded vs. infinite time Rather solve x = cos(x) (same solution for x [0,]) Why is that easier? 0 TDT47 Artificial Intelligence Methods

34 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) <. Code fragment: x = 0.5; % Initial value for iter = :noiter fprintf( Iter %d: %6.4f\n, iter, x); x = sqrt(cos(x)); % Do the update end TDT47 Artificial Intelligence Methods

35 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) < TDT47 Artificial Intelligence Methods

36 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) <. Output: Iter : Iter 9: 0.87 Iter : Iter 0: 0.84 Iter : Iter : 0.84 Iter 4: Iter : 0.84 Iter 5: 0.86 Iter : 0.84 Iter 6: Iter 4: 0.84 Iter 7: 0.80 Iter 5: 0.84 Iter 8: 0.85 Iter 6: 0.84 TDT47 Artificial Intelligence Methods

37 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (higher dims) It can work in higher dimensions, too! We solve this set of equations [ ] [ x (x = y +)/4 y ( x 4 4y 4 +8y +4)/ simply by using x i+ (x i y i +)/4 y i+ ( x i 4 4y i 4 +8y i +4)/ ] to obtain [ x y ] [ ]. Note indices! y i+ calculated using x i even if x i+ is known. TDT47 Artificial Intelligence Methods

38 Chapter 7: Making Complex Decisions Recap: What we are up tp Bounded vs. infinite time To solve these decision-problems we need... A mapping from any state/action history σ t to A t (the next action). We have simplified π t (σ t ) Markov = π t (S t ) Stationarity = π(s t ) We proceed in two steps: Find a utility function U (S t ) how good it is to be in S t Define π(s t ) to maximizes the expected utility Possible repr.; U (S t ) Corresponding π(s t ) We have defined U (S t ), but so far do not know how to calculate it. Fix-point iterations coming up next... TDT47 Artificial Intelligence Methods

39 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? 4 TDT47 Artificial Intelligence Methods

40 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? In any state we choose the action maximizing the expected utility: π(s) = argmax P(s s,a) U (s ). a s Thus, U (s) = R(s)+γmax a s P(s s,a) U (s ). 4 TDT47 Artificial Intelligence Methods

41 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? In any state we choose the action maximizing the expected utility: π(s) = argmax P(s s,a) U (s ). a s Thus, U (s) = R(s)+γmax a s P(s s,a) U (s ). We now have: n non-linear equations; n unknowns (n = no. states) A solution to these equations correspond to U. 4 TDT47 Artificial Intelligence Methods

42 Chapter 7: Making Complex Decisions Finding optimal strategies Value iteration Value iteration Fix-point iterations in value-space Start with an initial guess at the utility function, and iteratively refine this using the idea of fix-point iterations: Initial guess After the first iteration After the second iteration The updating function: Û j+ (s) R(s)+γ max a s P(s a,s) Ûj(s ). 5 TDT47 Artificial Intelligence Methods

43 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 The corresponding optimal strategy is uninformative: 6 TDT47 Artificial Intelligence Methods

44 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û 6 TDT47 Artificial Intelligence Methods

45 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û { Û ({,}) = R({,})+γ max P(s north,{,})û0(s ),..., s } P(s west,{,})û0(s ) s = max{0,0,0,0}= TDT47 Artificial Intelligence Methods

46 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û { Û ({,}) = R({,})+γ max P(s north,{,})û0(s ),..., s } P(s west,{,})û0(s ) s = max{0,0,0,0}= 5. 6 TDT47 Artificial Intelligence Methods

47 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û The corresponding optimal strategy is still rather silly: 6 TDT47 Artificial Intelligence Methods

48 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û 6 TDT47 Artificial Intelligence Methods

49 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û Û ({,}) = max{ , , , } = max{0.4,6.48,.5,0.56} = 5.7, 6 TDT47 Artificial Intelligence Methods

50 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û Û ({,}) = max{ , , , } = max{ 0.9, 0.7, 0.9, 0.9} = 5.7, 6 TDT47 Artificial Intelligence Methods

51 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û The optimal strategy corresponding to Û: 6 TDT47 Artificial Intelligence Methods

52 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û The optimal strategy corresponding to Û : 6 TDT47 Artificial Intelligence Methods

53 Chapter 7: Making Complex Decisions Finding optimal strategies Value iteration Value iteration: The impact of the discounting factor Û function (γ = 0.9) Û function (γ = 0.) Optimal strategy (γ = 0.9) Optimal strategy (γ = 0.) Get to the goal Money accumulates Avoid setbacks Future less relevant 7 TDT47 Artificial Intelligence Methods

54 Chapter 7: Making Complex Decisions Value iteration: The algorithm Finding optimal strategies Value iteration Choose an ǫ > 0 to regulate the stopping criterion. Let U 0 be an initial estimate of the utility function (for example, initialized to zero for all states). Set i := 0. 4 Repeat Let i := i+. For each states s in S do Û i (s) := R(s)+γ max a s P(s a,s)ûi (s ). 5 Until Ûi(s) Ûi (s) < ǫ( γ)/γ for all s. 8 TDT47 Artificial Intelligence Methods

55 Chapter 7: Making Complex Decisions Value iteration: Convergence Finding optimal strategies Value iteration So the algorithm converges for this particular example, but does this hold in general? Yes! It can be proven that there is only one true utility function, and that value iteration is guaranteed to converge to this utility function. 9 TDT47 Artificial Intelligence Methods

56 Chapter 7: Making Complex Decisions Value iteration: Demo Finding optimal strategies Value iteration Deterministic actions Stochastic actions rl_sim-demo: 8_big.maze 0 TDT47 Artificial Intelligence Methods

57 Chapter 7: Making Complex Decisions Value iteration: Efficiency Finding optimal strategies Value iteration Value iteration converges, but is it efficient? Estimates utilities of all states with same requirements towards accuracy, also those states that are rarely visited. Convergence defined from accuracy of utility estimates, but the agent only cares about making optimal decisions. Error Error, Û Error, π Iteration TDT47 Artificial Intelligence Methods

58 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating TDT47 Artificial Intelligence Methods

59 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating The updating function: π i+ (s) := argmax a s P(s a,s)ûπ i (s ). TDT47 Artificial Intelligence Methods

60 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating The updating function: π i+ (s) := argmax a The evaluation function: s P(s a,s)ûπ i (s ). Û πi (s) = R(s)+γ s P(s π i (s),s) Ûπ i (s ), which defines a system of linear equalities; the solution is U πi. TDT47 Artificial Intelligence Methods

61 Chapter 7: Making Complex Decisions Policy iteration: the algorithm Finding optimal strategies Policy iteration Let π 0 be an initial randomly chosen policy. Set i := 0. Repeat Find the utility function Ûπ i corresponding to the policy π i [Policy evaluation]. Let i := i+. For each s π i (s) := argmax a s P(s a,s)ûπ i (s )[Policy updating]. 4 Until π i = π i TDT47 Artificial Intelligence Methods

62 Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Value iteration vs. Policy Iteration: Demo Value iteration Policy iteration rl_sim-demo: _smallest.maze 4 TDT47 Artificial Intelligence Methods

63 Chapter 7: Making Complex Decisions Partial observability Partial observability and POMDPs Partially Observable Markov Decision Process (POMDP): The agent does not observe the environment fully, so does not know the state it is in. A POMDP is like an MDP, but has an observation model P(e t s t ) defining the probability that the agent obtains evidence e t when in state s t Agent does not know which state it is in, so it makes no sense to talk about policy π(s)!! 5 TDT47 Artificial Intelligence Methods

64 Chapter 7: Making Complex Decisions Solving POMDPs Partial observability and POMDPs Theorem The optimal policy in a POMDP is a function π(b) where b is the belief state (probability distribution over states) for the agent. Hence, we can convert a POMDP into an MDP in belief-state space, where P(b t+ a t,b t ) is the probability that the new belief state b t+ given that the current belief state is b t and the agent does a t. If there are n states, b is an n-dimensional real-valued vector solving POMDPs is very hard! (PSPACE-hard) The real world is a POMDP (with initially unknown transition and observation models) 6 TDT47 Artificial Intelligence Methods

65 Summary Summary Sequential decision problems Assumptions: Stationarity, Markov assumption, Additive rewards, infinite horizon with discount Model class: Markov decision problems Algorithm: Value iteration / policy iteration Intuitively, MDPs combine probabilistic models over time (filtering, prediction) with maximum expected utiltiy principle. 7 TDT47 Artificial Intelligence Methods

Making Complex Decisions

Making Complex Decisions Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2

More information

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in

More information

COS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration

COS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration COS402- Artificial Intelligence Fall 2015 Lecture 17: MDP: Value Iteration and Policy Iteration Outline The Bellman equation and Bellman update Contraction Value iteration Policy iteration The Bellman

More information

Reasoning with Uncertainty

Reasoning with Uncertainty Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov

More information

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

CPS 270: Artificial Intelligence  Markov decision processes, POMDPs CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig] Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal

More information

MDPs: Bellman Equations, Value Iteration

MDPs: Bellman Equations, Value Iteration MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

Overview: Representation Techniques

Overview: Representation Techniques 1 Overview: Representation Techniques Week 6 Representations for classical planning problems deterministic environment; complete information Week 7 Logic programs for problem representations including

More information

17 MAKING COMPLEX DECISIONS

17 MAKING COMPLEX DECISIONS 267 17 MAKING COMPLEX DECISIONS The agent s utility now depends on a sequence of decisions In the following 4 3grid environment the agent makes a decision to move (U, R, D, L) at each time step When the

More information

Decision Theory: Value Iteration

Decision Theory: Value Iteration Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision

More information

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,

More information

Markov Decision Processes. Lirong Xia

Markov Decision Processes. Lirong Xia Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent

More information

2D5362 Machine Learning

2D5362 Machine Learning 2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files

More information

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials

More information

POMDPs: Partially Observable Markov Decision Processes Advanced AI

POMDPs: Partially Observable Markov Decision Processes Advanced AI POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic

More information

Markov Decision Processes. CS 486/686: Introduction to Artificial Intelligence

Markov Decision Processes. CS 486/686: Introduction to Artificial Intelligence Markov Decision Processes CS 486/686: Introduction to Artificial Intelligence 1 Outline Markov Chains Discounted Rewards Markov Decision Processes (MDP) - Value Iteration - Policy Iteration 2 Markov Chains

More information

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure

More information

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use

More information

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) 1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used

More information

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring 2018 / Sadigh CS221 / Spring

More information

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I CS221 / Spring 2018 / Sadigh Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring

More information

Deep RL and Controls Homework 1 Spring 2017

Deep RL and Controls Homework 1 Spring 2017 10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact

More information

Introduction to Reinforcement Learning. MAL Seminar

Introduction to Reinforcement Learning. MAL Seminar Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology

More information

Probabilistic Robotics: Probabilistic Planning and MDPs

Probabilistic Robotics: Probabilistic Planning and MDPs Probabilistic Robotics: Probabilistic Planning and MDPs Slide credits: Wolfram Burgard, Dieter Fox, Cyrill Stachniss, Giorgio Grisetti, Maren Bennewitz, Christian Plagemann, Dirk Haehnel, Mike Montemerlo,

More information

AM 121: Intro to Optimization Models and Methods

AM 121: Intro to Optimization Models and Methods AM 121: Intro to Optimization Models and Methods Lecture 18: Markov Decision Processes Yiling Chen and David Parkes Lesson Plan Markov decision processes Policies and Value functions Solving: average reward,

More information

Markov Decision Process

Markov Decision Process Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their

More information

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Reinforcement Learning 04 - Monte Carlo. Elena, Xi Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment

More information

Long-Term Values in MDPs, Corecursively

Long-Term Values in MDPs, Corecursively Long-Term Values in MDPs, Corecursively Applied Category Theory, 15-16 March 2018, NIST Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) MDPs, Corecursively NIST, 15/Mar/2018

More information

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

MDPs and Value Iteration 2/20/17

MDPs and Value Iteration 2/20/17 MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that,

More information

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Reinforcement Learning. Monte Carlo and Temporal Difference Learning Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes. CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use

More information

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming Dynamic Programming: An overview These notes summarize some key properties of the Dynamic Programming principle to optimize a function or cost that depends on an interval or stages. This plays a key role

More information

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)

More information

Lecture Quantitative Finance Spring Term 2015

Lecture Quantitative Finance Spring Term 2015 implied Lecture Quantitative Finance Spring Term 2015 : May 7, 2015 1 / 28 implied 1 implied 2 / 28 Motivation and setup implied the goal of this chapter is to treat the implied which requires an algorithm

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 9 Sep, 28, 2016 Slide 1 CPSC 422, Lecture 9 An MDP Approach to Multi-Category Patient Scheduling in a Diagnostic Facility Adapted from: Matthew

More information

Lecture 7: Bayesian approach to MAB - Gittins index

Lecture 7: Bayesian approach to MAB - Gittins index Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach

More information

Handout 4: Deterministic Systems and the Shortest Path Problem

Handout 4: Deterministic Systems and the Shortest Path Problem SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas

More information

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

The exam is closed book, closed calculator, and closed notes except your three crib sheets. CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.

More information

Lecture 1: Lucas Model and Asset Pricing

Lecture 1: Lucas Model and Asset Pricing Lecture 1: Lucas Model and Asset Pricing Economics 714, Spring 2018 1 Asset Pricing 1.1 Lucas (1978) Asset Pricing Model We assume that there are a large number of identical agents, modeled as a representative

More information

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction

More information

Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities

Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities CS 188: Artificial Intelligence Markov Deciion Procee II Intructor: Dan Klein and Pieter Abbeel --- Univerity of California, Berkeley [Thee lide were created by Dan Klein and Pieter Abbeel for CS188 Intro

More information

Temporal Abstraction in RL

Temporal Abstraction in RL Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;

More information

Almost essential MICROECONOMICS

Almost essential MICROECONOMICS Prerequisites Almost essential Games: Mixed Strategies GAMES: UNCERTAINTY MICROECONOMICS Principles and Analysis Frank Cowell April 2018 1 Overview Games: Uncertainty Basic structure Introduction to the

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

MDP Algorithms. Thomas Keller. June 20, University of Basel

MDP Algorithms. Thomas Keller. June 20, University of Basel MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search

More information

Lecture 4: Model-Free Prediction

Lecture 4: Model-Free Prediction Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning

More information

Stochastic Games and Bayesian Games

Stochastic Games and Bayesian Games Stochastic Games and Bayesian Games CPSC 532l Lecture 10 Stochastic Games and Bayesian Games CPSC 532l Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games 4 Analyzing Bayesian

More information

Reinforcement Learning Lectures 4 and 5

Reinforcement Learning Lectures 4 and 5 Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and

More information

The Problem of Temporal Abstraction

The Problem of Temporal Abstraction The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,

More information

Topics in Computational Sustainability CS 325 Spring 2016

Topics in Computational Sustainability CS 325 Spring 2016 Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures.

More information

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE Rollout algorithms Cost improvement property Discrete deterministic problems Approximations of rollout algorithms Discretization of continuous time

More information

Sequential Coalition Formation for Uncertain Environments

Sequential Coalition Formation for Uncertain Environments Sequential Coalition Formation for Uncertain Environments Hosam Hanna Computer Sciences Department GREYC - University of Caen 14032 Caen - France hanna@info.unicaen.fr Abstract In several applications,

More information

Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks

Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks Hussein Abouzeid Department of Electrical Computer and Systems Engineering Rensselaer Polytechnic Institute abouzeid@ecse.rpi.edu

More information

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens. 102 OPTIMAL STOPPING TIME 4. Optimal Stopping Time 4.1. Definitions. On the first day I explained the basic problem using one example in the book. On the second day I explained how the solution to the

More information

Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints

Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints David Laibson 9/11/2014 Outline: 1. Precautionary savings motives 2. Liquidity constraints 3. Application: Numerical solution

More information

Introduction to Fall 2007 Artificial Intelligence Final Exam

Introduction to Fall 2007 Artificial Intelligence Final Exam NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Final Exam You have 180 minutes. The exam is closed book, closed notes except a two-page crib sheet, basic calculators

More information

The Irrevocable Multi-Armed Bandit Problem

The Irrevocable Multi-Armed Bandit Problem The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision

More information

CEC login. Student Details Name SOLUTIONS

CEC login. Student Details Name SOLUTIONS Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching

More information

Lecture Notes 1

Lecture Notes 1 4.45 Lecture Notes Guido Lorenzoni Fall 2009 A portfolio problem To set the stage, consider a simple nite horizon problem. A risk averse agent can invest in two assets: riskless asset (bond) pays gross

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes. CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use

More information

Math-Stat-491-Fall2014-Notes-V

Math-Stat-491-Fall2014-Notes-V Math-Stat-491-Fall2014-Notes-V Hariharan Narayanan December 7, 2014 Martingales 1 Introduction Martingales were originally introduced into probability theory as a model for fair betting games. Essentially

More information

Final exam solutions

Final exam solutions EE365 Stochastic Control / MS&E251 Stochastic Decision Models Profs. S. Lall, S. Boyd June 5 6 or June 6 7, 2013 Final exam solutions This is a 24 hour take-home final. Please turn it in to one of the

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Fall 2018 Introduction to rtificial Intelligence Practice Midterm 2 To earn the extra credit, one of the following has to hold true. Please circle and sign. I spent 2 or more hours on the practice

More information

Optimal Dam Management

Optimal Dam Management Optimal Dam Management Michel De Lara et Vincent Leclère July 3, 2012 Contents 1 Problem statement 1 1.1 Dam dynamics.................................. 2 1.2 Intertemporal payoff criterion..........................

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods

More information

Long Term Values in MDPs Second Workshop on Open Games

Long Term Values in MDPs Second Workshop on Open Games A (Co)Algebraic Perspective on Long Term Values in MDPs Second Workshop on Open Games Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July 2018

More information

Notes on Intertemporal Optimization

Notes on Intertemporal Optimization Notes on Intertemporal Optimization Econ 204A - Henning Bohn * Most of modern macroeconomics involves models of agents that optimize over time. he basic ideas and tools are the same as in microeconomics,

More information

Stochastic Games and Bayesian Games

Stochastic Games and Bayesian Games Stochastic Games and Bayesian Games CPSC 532L Lecture 10 Stochastic Games and Bayesian Games CPSC 532L Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games Stochastic Games

More information

Solution algorithm for Boz-Mendoza JME by Enrique G. Mendoza University of Pennsylvania, NBER & PIER

Solution algorithm for Boz-Mendoza JME by Enrique G. Mendoza University of Pennsylvania, NBER & PIER Solution algorithm for Boz-Mendoza JME 2014 by Enrique G. Mendoza University of Pennsylvania, NBER & PIER Two-stage solution method At each date t of a sequence of T periods of observed realizations of

More information