TDT4171 Artificial Intelligence Methods

Size: px

Start display at page:

Download "TDT4171 Artificial Intelligence Methods"

Hilda Claire Allison
5 years ago
Views:

1 TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

2 Outline Summary from last time Chapter 7: Making Complex Decisions Setup for sequential problems Markov Decision Processes (MDPs) Bounded vs. infinite time Finding optimal strategies Value iteration Finding optimal strategies Policy iteration Partial observability and POMDPs Summary TDT47 Artificial Intelligence Methods

3 Summary from last time Summary from last time Rational agents can always use utilities to make decisions The MEU principle tells us how to behave It can be quite laborious to elicit preference structures from domain experts structured approaches are available Value of Information helps focus information gathering for rational agents Influence diagrams are extensions to BNs that let us make rational decisions. Note! Assignment is out today. TDT47 Artificial Intelligence Methods

4 Chapter 7: Making Complex Decisions Chapter 7 Learning goals Understanding the relationship between One-shot decisions Sequential decisions Decisions for finite horizon Decisions for infinite horizon Being familiar with: Markov Decision processes Value iteration Policy iteration Know about: Partially Observable Markov Decision processes 4 TDT47 Artificial Intelligence Methods

Chapter 7: Making Complex Decisions Setup for sequential problems Decision problems with an unbounded time horizon Examples of decision problems with an

5 Chapter 7: Making Complex Decisions Setup for sequential problems Decision problems with an unbounded time horizon Examples of decision problems with an unbounded time horizon: Fishing in the North Sea. Robot navigation find path from current position to a certain goal position TDT47 Artificial Intelligence Methods

6 Chapter 7: Making Complex Decisions Setup for sequential problems Decision problems with an unbounded time horizon Examples of decision problems with an unbounded time horizon: Fishing in the North Sea. Robot navigation find path from current position to a certain goal position. 5 0 Characteristics: at each step we are faced with the same type of decision, at each step we are given a certain reward (possibly negative) determined by the chosen decision and the state of the world, the outcome of a decision may be uncertain, the time horizon of the decision problem is unbounded. 5 TDT47 Artificial Intelligence Methods

7 Chapter 7: Making Complex Decisions What are we trying to obtain? Setup for sequential problems To solve these decision-problems we need... A mapping from any state/action history σ t = {S 0,A 0,S,A,S,A,...,S t } to A t (the next action). π t (σ t ) = a t means If you ve seen σ t at t, then do a t. We want to simplify e.g., to say π t (σ t ) = π t (S t ) = π(s t ) Possible representation of π(s t ) Always sufficient? 6 TDT47 Artificial Intelligence Methods

8 Chapter 7: Making Complex Decisions What are we trying to obtain? Setup for sequential problems To solve these decision-problems we need... A mapping from any state/action history σ t = {S 0,A 0,S,A,S,A,...,S t } to A t (the next action). π t (σ t ) = a t means If you ve seen σ t at t, then do a t. We want to simplify e.g., to say π t (σ t ) = π t (S t ) = π(s t ) We can think of this in two steps: Find a utility function Ut (σ t ) representing how good it is to be in S t Define π t so that it maximizes the expected utility at each step Possible repr.; U t (σ t ) Corresponding π t (S t ) 6 TDT47 Artificial Intelligence Methods

9 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Solving such problems relate to solving a planning task. Find shortest path in maze Golden square is goal 7 TDT47 Artificial Intelligence Methods

10 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Detail: Near-wall behaviour Solution Ordinary planning 7 TDT47 Artificial Intelligence Methods

Solution Stochastic planning (Robot may fail to do an

11 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Detail: Near-wall behaviour Solution Stochastic planning (Robot may fail to do an action, penalty for hitting walls.) 7 TDT47 Artificial Intelligence Methods

12 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) The robot navigation problem can roughly be described as a loop: Observe the state of the world, Collect (possibly negative) reward R (not the same as U!), Decide on the next action and perform it. This can be modelled as a Markov decision process: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ Note! The model adheres to the Markov assumption. In particular, {S 0,A 0,...,S t,a t } A t S t, so π t (σ t ) = π t (S t ). 8 TDT47 Artificial Intelligence Methods

13 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) The robot navigation problem can roughly be described as a loop: Observe the state of the world, Collect (possibly negative) reward R (not the same as U!), Decide on the next action and perform it. This can be modelled as a Markov decision process: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ Note also! R t is determined by A t and S t. However, as R t (s t,a t ) = R t (s t ) in the robot example I will simplify the math accordingly. 8 TDT47 Artificial Intelligence Methods

14 Chapter 7: Making Complex Decisions The quantitative part of the MDP Markov Decision Processes (MDPs) In order to specify the transition probabilities P(S t A t,s t ) and the reward function R(S t ) we need some more information about the domain: The robot can move north, east, south, and west. For each move there is a fuel expenditure of 0.. A move succeeds with probability 0.7; otherwise it moves in one of the other directions with equal probability. If it walks into a wall, the robot effectively stands still TDT47 Artificial Intelligence Methods

15 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) The quantitative part of the MDP cont d This gives the transition probabilities (for P(S t+ north,s t )): S t+ {,} {,} {,} {,} {,} {,} {,} {,} {,} {, } {, } {, } {, } {, } {, } {, } {, } {, } S t We say that {,} is absorbing when P({,} A t,{,}) =. {,}{,}{,} {,}{,}{,} {,}{,}{,} 0 TDT47 Artificial Intelligence Methods

16 Chapter 7: Making Complex Decisions MDPs in general Markov Decision Processes (MDPs) In general, in a Markov decision process... The world is fully observable, The uncertainty in the system is due to non-deterministic actions For each decision we get a reward (which may be negative); may depend on current world state and chosen action, but is independent of time (stationarity of reward-model). S 0 S t S i S t+ A 0 A t A t A t+ R 0 R t R t R t+ TDT47 Artificial Intelligence Methods

17 Chapter 7: Making Complex Decisions Decision policies Markov Decision Processes (MDPs) A decision policy for A t is in general a function over the entire past, {S 0,A 0,S,A,S,A,...,S t }. S 0 S t S t S t+ A 0 A t A t A t+ R 0 R t R t R t+ However, from the conditional independence properties we see that the relevant past is reduced to S t. This is similar to filtering and prediction in the dynamic models. TDT47 Artificial Intelligence Methods

18 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: a bounded time horizon The (approximated) North Sea fishing example over a five year period; S t is amount of fish in the sea at the start of time period t, A t is the amount being fished at during period t. S0 S S S S4 A0 A A A A4 R0 R R R R4 TDT47 Artificial Intelligence Methods

19 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: a bounded time horizon The (approximated) North Sea fishing example over a five year period; S t is amount of fish in the sea at the start of time period t, A t is the amount being fished at during period t. S0 S S S S4 A0 A A A A4 R0 R R R R4 Even if S 0 and S 4 are the same state: At t = 0 we may specify a conservative number to ensure that there is enough fish in the coming years. At t = 4 we have no concerns about the future, and specify a high volume. The optimal policy for A t depends on the time step t!! TDT47 Artificial Intelligence Methods

20 Chapter 7: Making Complex Decisions Bounded vs. infinite time Length of horizon vs. Optimal strategy Optimal strategy changes as the the time-horizon increases: Let k be the length of the time horizon, and consider the robot navigation task R(S t ) k = k large For k = and start position {, } we have to accept the penalty of in {,} to make it to the goal state on time. For k 6 we have time to take the long route. Non-stationarity again! 4 TDT47 Artificial Intelligence Methods

21 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: an unbounded time horizon The (approximated) North Sea fishing example with an unbounded time horizon: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ The optimal policy for A t depends on the current state and what may happen in the future. If two time steps, say year t = 0 and t = 4, are in the same state then they have the same possibilities in the future. The optimal policies for A 0 and A 4 are the same! The strategy is said to be stationary. 5 TDT47 Artificial Intelligence Methods

22 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: fixed horizon Consider only the rewards obtained in the first k states: U(s 0,s,s,...) = R(s 0 )+R(s )+...+R(s k ) <. But how do we choose k? For k = 0 we only care about the immediate reward, hence we pursue a very greedy strategy. As k approaches, we put less and less focus on the initial behaviour. 6 TDT47 Artificial Intelligence Methods

23 Chapter 7: Making Complex Decisions Bounded vs. infinite time Evaluating strategies Assume that the reward function is specified as: Imagine there is no terminal state and no uncertainty on the result of an action. Then: EU = and EU = But which one is better? 7 TDT47 Artificial Intelligence Methods

24 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. 8 TDT47 Artificial Intelligence Methods

25 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. Possible interpretations of the discounting factor γ: The decision process may terminate with probability ( γ) at any point in time, e.g. the robot breaking down. In economics, γ may be thought of as an interest rate of r = (/γ). 8 TDT47 Artificial Intelligence Methods

26 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. For γ = 0 we have a greedy strategy. With 0 < γ < and R = max t R(s t ) < we have U(s 0,s,s,...) = γ i R(s i ) i=0 i=0 γ i R = R γ <. For γ = we have normal additive rewards. 8 TDT47 Artificial Intelligence Methods

27 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. 9 TDT47 Artificial Intelligence Methods

28 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i 9 TDT47 Artificial Intelligence Methods

29 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i The expected discounted reward of π is defined as (for a fixed N): U(s 0,π) = ( ) N γ i R(s i ) P(S i = s i π,s 0 = s 0 ). s i i=0 9 TDT47 Artificial Intelligence Methods

30 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i The expected discounted reward of π is defined as: U(s 0,π) = lim N ( ) N γ i R(s i ) P(S i = s i π,s 0 = s 0 ). s i i=0 9 TDT47 Artificial Intelligence Methods

31 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i A standard notation for U(s 0,π) is also: [ ] E γ i R(S i ) π,s 0 = s 0. i=0 9 TDT47 Artificial Intelligence Methods

32 Chapter 7: Making Complex Decisions A side-step: Fix-point iterations Bounded vs. infinite time Solve the equation x cos(x) = 0 on x [0,] How can we proceed to find an approximate solution? Discuss with your neighbour for a couple of minutes... 0 TDT47 Artificial Intelligence Methods

33 Chapter 7: Making Complex Decisions A side-step: Fix-point iterations Bounded vs. infinite time Rather solve x = cos(x) (same solution for x [0,]) Why is that easier? 0 TDT47 Artificial Intelligence Methods

34 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) <. Code fragment: x = 0.5; % Initial value for iter = :noiter fprintf( Iter %d: %6.4f\n, iter, x); x = sqrt(cos(x)); % Do the update end TDT47 Artificial Intelligence Methods

35 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) < TDT47 Artificial Intelligence Methods

36 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) <. Output: Iter : Iter 9: 0.87 Iter : Iter 0: 0.84 Iter : Iter : 0.84 Iter 4: Iter : 0.84 Iter 5: 0.86 Iter : 0.84 Iter 6: Iter 4: 0.84 Iter 7: 0.80 Iter 5: 0.84 Iter 8: 0.85 Iter 6: 0.84 TDT47 Artificial Intelligence Methods

37 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (higher dims) It can work in higher dimensions, too! We solve this set of equations [ ] [ x (x = y +)/4 y ( x 4 4y 4 +8y +4)/ simply by using x i+ (x i y i +)/4 y i+ ( x i 4 4y i 4 +8y i +4)/ ] to obtain [ x y ] [ ]. Note indices! y i+ calculated using x i even if x i+ is known. TDT47 Artificial Intelligence Methods

38 Chapter 7: Making Complex Decisions Recap: What we are up tp Bounded vs. infinite time To solve these decision-problems we need... A mapping from any state/action history σ t to A t (the next action). We have simplified π t (σ t ) Markov = π t (S t ) Stationarity = π(s t ) We proceed in two steps: Find a utility function U (S t ) how good it is to be in S t Define π(s t ) to maximizes the expected utility Possible repr.; U (S t ) Corresponding π(s t ) We have defined U (S t ), but so far do not know how to calculate it. Fix-point iterations coming up next... TDT47 Artificial Intelligence Methods

39 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? 4 TDT47 Artificial Intelligence Methods

40 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? In any state we choose the action maximizing the expected utility: π(s) = argmax P(s s,a) U (s ). a s Thus, U (s) = R(s)+γmax a s P(s s,a) U (s ). 4 TDT47 Artificial Intelligence Methods

41 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? In any state we choose the action maximizing the expected utility: π(s) = argmax P(s s,a) U (s ). a s Thus, U (s) = R(s)+γmax a s P(s s,a) U (s ). We now have: n non-linear equations; n unknowns (n = no. states) A solution to these equations correspond to U. 4 TDT47 Artificial Intelligence Methods

42 Chapter 7: Making Complex Decisions Finding optimal strategies Value iteration Value iteration Fix-point iterations in value-space Start with an initial guess at the utility function, and iteratively refine this using the idea of fix-point iterations: Initial guess After the first iteration After the second iteration The updating function: Û j+ (s) R(s)+γ max a s P(s a,s) Ûj(s ). 5 TDT47 Artificial Intelligence Methods

43 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 The corresponding optimal strategy is uninformative: 6 TDT47 Artificial Intelligence Methods

44 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û 6 TDT47 Artificial Intelligence Methods

45 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û { Û ({,}) = R({,})+γ max P(s north,{,})û0(s ),..., s } P(s west,{,})û0(s ) s = max{0,0,0,0}= TDT47 Artificial Intelligence Methods

46 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û { Û ({,}) = R({,})+γ max P(s north,{,})û0(s ),..., s } P(s west,{,})û0(s ) s = max{0,0,0,0}= 5. 6 TDT47 Artificial Intelligence Methods

47 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û The corresponding optimal strategy is still rather silly: 6 TDT47 Artificial Intelligence Methods

48 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û 6 TDT47 Artificial Intelligence Methods

49 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û Û ({,}) = max{ , , , } = max{0.4,6.48,.5,0.56} = 5.7, 6 TDT47 Artificial Intelligence Methods

50 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û Û ({,}) = max{ , , , } = max{ 0.9, 0.7, 0.9, 0.9} = 5.7, 6 TDT47 Artificial Intelligence Methods

51 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û The optimal strategy corresponding to Û: 6 TDT47 Artificial Intelligence Methods

52 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û The optimal strategy corresponding to Û : 6 TDT47 Artificial Intelligence Methods

53 Chapter 7: Making Complex Decisions Finding optimal strategies Value iteration Value iteration: The impact of the discounting factor Û function (γ = 0.9) Û function (γ = 0.) Optimal strategy (γ = 0.9) Optimal strategy (γ = 0.) Get to the goal Money accumulates Avoid setbacks Future less relevant 7 TDT47 Artificial Intelligence Methods

54 Chapter 7: Making Complex Decisions Value iteration: The algorithm Finding optimal strategies Value iteration Choose an ǫ > 0 to regulate the stopping criterion. Let U 0 be an initial estimate of the utility function (for example, initialized to zero for all states). Set i := 0. 4 Repeat Let i := i+. For each states s in S do Û i (s) := R(s)+γ max a s P(s a,s)ûi (s ). 5 Until Ûi(s) Ûi (s) < ǫ( γ)/γ for all s. 8 TDT47 Artificial Intelligence Methods

55 Chapter 7: Making Complex Decisions Value iteration: Convergence Finding optimal strategies Value iteration So the algorithm converges for this particular example, but does this hold in general? Yes! It can be proven that there is only one true utility function, and that value iteration is guaranteed to converge to this utility function. 9 TDT47 Artificial Intelligence Methods

56 Chapter 7: Making Complex Decisions Value iteration: Demo Finding optimal strategies Value iteration Deterministic actions Stochastic actions rl_sim-demo: 8_big.maze 0 TDT47 Artificial Intelligence Methods

57 Chapter 7: Making Complex Decisions Value iteration: Efficiency Finding optimal strategies Value iteration Value iteration converges, but is it efficient? Estimates utilities of all states with same requirements towards accuracy, also those states that are rarely visited. Convergence defined from accuracy of utility estimates, but the agent only cares about making optimal decisions. Error Error, Û Error, π Iteration TDT47 Artificial Intelligence Methods

58 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating TDT47 Artificial Intelligence Methods

59 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating The updating function: π i+ (s) := argmax a s P(s a,s)ûπ i (s ). TDT47 Artificial Intelligence Methods

60 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating The updating function: π i+ (s) := argmax a The evaluation function: s P(s a,s)ûπ i (s ). Û πi (s) = R(s)+γ s P(s π i (s),s) Ûπ i (s ), which defines a system of linear equalities; the solution is U πi. TDT47 Artificial Intelligence Methods

61 Chapter 7: Making Complex Decisions Policy iteration: the algorithm Finding optimal strategies Policy iteration Let π 0 be an initial randomly chosen policy. Set i := 0. Repeat Find the utility function Ûπ i corresponding to the policy π i [Policy evaluation]. Let i := i+. For each s π i (s) := argmax a s P(s a,s)ûπ i (s )[Policy updating]. 4 Until π i = π i TDT47 Artificial Intelligence Methods

62 Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Value iteration vs. Policy Iteration: Demo Value iteration Policy iteration rl_sim-demo: _smallest.maze 4 TDT47 Artificial Intelligence Methods

Chapter 7: Making Complex Decisions Partial observability Partial observability and POMDPs Partially Observable Markov Decision Process (POMDP): The agent does not observe the environment fully, so

63 Chapter 7: Making Complex Decisions Partial observability Partial observability and POMDPs Partially Observable Markov Decision Process (POMDP): The agent does not observe the environment fully, so does not know the state it is in. A POMDP is like an MDP, but has an observation model P(e t s t ) defining the probability that the agent obtains evidence e t when in state s t Agent does not know which state it is in, so it makes no sense to talk about policy π(s)!! 5 TDT47 Artificial Intelligence Methods

64 Chapter 7: Making Complex Decisions Solving POMDPs Partial observability and POMDPs Theorem The optimal policy in a POMDP is a function π(b) where b is the belief state (probability distribution over states) for the agent. Hence, we can convert a POMDP into an MDP in belief-state space, where P(b t+ a t,b t ) is the probability that the new belief state b t+ given that the current belief state is b t and the agent does a t. If there are n states, b is an n-dimensional real-valued vector solving POMDPs is very hard! (PSPACE-hard) The real world is a POMDP (with initially unknown transition and observation models) 6 TDT47 Artificial Intelligence Methods

65 Summary Summary Sequential decision problems Assumptions: Stationarity, Markov assumption, Additive rewards, infinite horizon with discount Model class: Markov decision problems Algorithm: Value iteration / policy iteration Intuitively, MDPs combine probabilistic models over time (filtering, prediction) with maximum expected utiltiy principle. 7 TDT47 Artificial Intelligence Methods

Making Complex Decisions

Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2