TDT4171 Artificial Intelligence Methods
|
|
- Hilda Claire Allison
- 5 years ago
- Views:
Transcription
1 TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods
2 Outline Summary from last time Chapter 7: Making Complex Decisions Setup for sequential problems Markov Decision Processes (MDPs) Bounded vs. infinite time Finding optimal strategies Value iteration Finding optimal strategies Policy iteration Partial observability and POMDPs Summary TDT47 Artificial Intelligence Methods
3 Summary from last time Summary from last time Rational agents can always use utilities to make decisions The MEU principle tells us how to behave It can be quite laborious to elicit preference structures from domain experts structured approaches are available Value of Information helps focus information gathering for rational agents Influence diagrams are extensions to BNs that let us make rational decisions. Note! Assignment is out today. TDT47 Artificial Intelligence Methods
4 Chapter 7: Making Complex Decisions Chapter 7 Learning goals Understanding the relationship between One-shot decisions Sequential decisions Decisions for finite horizon Decisions for infinite horizon Being familiar with: Markov Decision processes Value iteration Policy iteration Know about: Partially Observable Markov Decision processes 4 TDT47 Artificial Intelligence Methods
5 Chapter 7: Making Complex Decisions Setup for sequential problems Decision problems with an unbounded time horizon Examples of decision problems with an unbounded time horizon: Fishing in the North Sea. Robot navigation find path from current position to a certain goal position TDT47 Artificial Intelligence Methods
6 Chapter 7: Making Complex Decisions Setup for sequential problems Decision problems with an unbounded time horizon Examples of decision problems with an unbounded time horizon: Fishing in the North Sea. Robot navigation find path from current position to a certain goal position. 5 0 Characteristics: at each step we are faced with the same type of decision, at each step we are given a certain reward (possibly negative) determined by the chosen decision and the state of the world, the outcome of a decision may be uncertain, the time horizon of the decision problem is unbounded. 5 TDT47 Artificial Intelligence Methods
7 Chapter 7: Making Complex Decisions What are we trying to obtain? Setup for sequential problems To solve these decision-problems we need... A mapping from any state/action history σ t = {S 0,A 0,S,A,S,A,...,S t } to A t (the next action). π t (σ t ) = a t means If you ve seen σ t at t, then do a t. We want to simplify e.g., to say π t (σ t ) = π t (S t ) = π(s t ) Possible representation of π(s t ) Always sufficient? 6 TDT47 Artificial Intelligence Methods
8 Chapter 7: Making Complex Decisions What are we trying to obtain? Setup for sequential problems To solve these decision-problems we need... A mapping from any state/action history σ t = {S 0,A 0,S,A,S,A,...,S t } to A t (the next action). π t (σ t ) = a t means If you ve seen σ t at t, then do a t. We want to simplify e.g., to say π t (σ t ) = π t (S t ) = π(s t ) We can think of this in two steps: Find a utility function Ut (σ t ) representing how good it is to be in S t Define π t so that it maximizes the expected utility at each step Possible repr.; U t (σ t ) Corresponding π t (S t ) 6 TDT47 Artificial Intelligence Methods
9 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Solving such problems relate to solving a planning task. Find shortest path in maze Golden square is goal 7 TDT47 Artificial Intelligence Methods
10 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Detail: Near-wall behaviour Solution Ordinary planning 7 TDT47 Artificial Intelligence Methods
11 Chapter 7: Making Complex Decisions Relation to Planning Setup for sequential problems Detail: Near-wall behaviour Solution Stochastic planning (Robot may fail to do an action, penalty for hitting walls.) 7 TDT47 Artificial Intelligence Methods
12 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) The robot navigation problem can roughly be described as a loop: Observe the state of the world, Collect (possibly negative) reward R (not the same as U!), Decide on the next action and perform it. This can be modelled as a Markov decision process: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ Note! The model adheres to the Markov assumption. In particular, {S 0,A 0,...,S t,a t } A t S t, so π t (σ t ) = π t (S t ). 8 TDT47 Artificial Intelligence Methods
13 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) The robot navigation problem can roughly be described as a loop: Observe the state of the world, Collect (possibly negative) reward R (not the same as U!), Decide on the next action and perform it. This can be modelled as a Markov decision process: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ Note also! R t is determined by A t and S t. However, as R t (s t,a t ) = R t (s t ) in the robot example I will simplify the math accordingly. 8 TDT47 Artificial Intelligence Methods
14 Chapter 7: Making Complex Decisions The quantitative part of the MDP Markov Decision Processes (MDPs) In order to specify the transition probabilities P(S t A t,s t ) and the reward function R(S t ) we need some more information about the domain: The robot can move north, east, south, and west. For each move there is a fuel expenditure of 0.. A move succeeds with probability 0.7; otherwise it moves in one of the other directions with equal probability. If it walks into a wall, the robot effectively stands still TDT47 Artificial Intelligence Methods
15 Chapter 7: Making Complex Decisions Markov Decision Processes (MDPs) The quantitative part of the MDP cont d This gives the transition probabilities (for P(S t+ north,s t )): S t+ {,} {,} {,} {,} {,} {,} {,} {,} {,} {, } {, } {, } {, } {, } {, } {, } {, } {, } S t We say that {,} is absorbing when P({,} A t,{,}) =. {,}{,}{,} {,}{,}{,} {,}{,}{,} 0 TDT47 Artificial Intelligence Methods
16 Chapter 7: Making Complex Decisions MDPs in general Markov Decision Processes (MDPs) In general, in a Markov decision process... The world is fully observable, The uncertainty in the system is due to non-deterministic actions For each decision we get a reward (which may be negative); may depend on current world state and chosen action, but is independent of time (stationarity of reward-model). S 0 S t S i S t+ A 0 A t A t A t+ R 0 R t R t R t+ TDT47 Artificial Intelligence Methods
17 Chapter 7: Making Complex Decisions Decision policies Markov Decision Processes (MDPs) A decision policy for A t is in general a function over the entire past, {S 0,A 0,S,A,S,A,...,S t }. S 0 S t S t S t+ A 0 A t A t A t+ R 0 R t R t R t+ However, from the conditional independence properties we see that the relevant past is reduced to S t. This is similar to filtering and prediction in the dynamic models. TDT47 Artificial Intelligence Methods
18 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: a bounded time horizon The (approximated) North Sea fishing example over a five year period; S t is amount of fish in the sea at the start of time period t, A t is the amount being fished at during period t. S0 S S S S4 A0 A A A A4 R0 R R R R4 TDT47 Artificial Intelligence Methods
19 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: a bounded time horizon The (approximated) North Sea fishing example over a five year period; S t is amount of fish in the sea at the start of time period t, A t is the amount being fished at during period t. S0 S S S S4 A0 A A A A4 R0 R R R R4 Even if S 0 and S 4 are the same state: At t = 0 we may specify a conservative number to ensure that there is enough fish in the coming years. At t = 4 we have no concerns about the future, and specify a high volume. The optimal policy for A t depends on the time step t!! TDT47 Artificial Intelligence Methods
20 Chapter 7: Making Complex Decisions Bounded vs. infinite time Length of horizon vs. Optimal strategy Optimal strategy changes as the the time-horizon increases: Let k be the length of the time horizon, and consider the robot navigation task R(S t ) k = k large For k = and start position {, } we have to accept the penalty of in {,} to make it to the goal state on time. For k 6 we have time to take the long route. Non-stationarity again! 4 TDT47 Artificial Intelligence Methods
21 Chapter 7: Making Complex Decisions Bounded vs. infinite time Types of strategies: an unbounded time horizon The (approximated) North Sea fishing example with an unbounded time horizon: S0 St St St+ A0 At At At+ R0 Rt Rt Rt+ The optimal policy for A t depends on the current state and what may happen in the future. If two time steps, say year t = 0 and t = 4, are in the same state then they have the same possibilities in the future. The optimal policies for A 0 and A 4 are the same! The strategy is said to be stationary. 5 TDT47 Artificial Intelligence Methods
22 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: fixed horizon Consider only the rewards obtained in the first k states: U(s 0,s,s,...) = R(s 0 )+R(s )+...+R(s k ) <. But how do we choose k? For k = 0 we only care about the immediate reward, hence we pursue a very greedy strategy. As k approaches, we put less and less focus on the initial behaviour. 6 TDT47 Artificial Intelligence Methods
23 Chapter 7: Making Complex Decisions Bounded vs. infinite time Evaluating strategies Assume that the reward function is specified as: Imagine there is no terminal state and no uncertainty on the result of an action. Then: EU = and EU = But which one is better? 7 TDT47 Artificial Intelligence Methods
24 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. 8 TDT47 Artificial Intelligence Methods
25 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. Possible interpretations of the discounting factor γ: The decision process may terminate with probability ( γ) at any point in time, e.g. the robot breaking down. In economics, γ may be thought of as an interest rate of r = (/γ). 8 TDT47 Artificial Intelligence Methods
26 Chapter 7: Making Complex Decisions Bounded vs. infinite time The utility of an unbounded sequence: discounted rewards Weigh rewards in the immediate future higher than rewards in the distant future: U(s 0,s,s,...) = R(s 0 )+γr(s )+γ R(s )+, where 0 γ. For γ = 0 we have a greedy strategy. With 0 < γ < and R = max t R(s t ) < we have U(s 0,s,s,...) = γ i R(s i ) i=0 i=0 γ i R = R γ <. For γ = we have normal additive rewards. 8 TDT47 Artificial Intelligence Methods
27 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. 9 TDT47 Artificial Intelligence Methods
28 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i 9 TDT47 Artificial Intelligence Methods
29 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i The expected discounted reward of π is defined as (for a fixed N): U(s 0,π) = ( ) N γ i R(s i ) P(S i = s i π,s 0 = s 0 ). s i i=0 9 TDT47 Artificial Intelligence Methods
30 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i The expected discounted reward of π is defined as: U(s 0,π) = lim N ( ) N γ i R(s i ) P(S i = s i π,s 0 = s 0 ). s i i=0 9 TDT47 Artificial Intelligence Methods
31 Chapter 7: Making Complex Decisions Expected utilities Bounded vs. infinite time The actions may be non-deterministic so a strategy may only take you to a state with a certain probability. Strategies should be compared based on the expected rewards they can produce. Starting in state s 0 and following strategy π, the expected (discounted) reward in step i is: R(s i )P(S i = s i π,s 0 = s 0 ) γ i R(s i )P(S i = s i π,s 0 = s 0 ) s i s i A standard notation for U(s 0,π) is also: [ ] E γ i R(S i ) π,s 0 = s 0. i=0 9 TDT47 Artificial Intelligence Methods
32 Chapter 7: Making Complex Decisions A side-step: Fix-point iterations Bounded vs. infinite time Solve the equation x cos(x) = 0 on x [0,] How can we proceed to find an approximate solution? Discuss with your neighbour for a couple of minutes... 0 TDT47 Artificial Intelligence Methods
33 Chapter 7: Making Complex Decisions A side-step: Fix-point iterations Bounded vs. infinite time Rather solve x = cos(x) (same solution for x [0,]) Why is that easier? 0 TDT47 Artificial Intelligence Methods
34 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) <. Code fragment: x = 0.5; % Initial value for iter = :noiter fprintf( Iter %d: %6.4f\n, iter, x); x = sqrt(cos(x)); % Do the update end TDT47 Artificial Intelligence Methods
35 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) < TDT47 Artificial Intelligence Methods
36 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (cont d) Solve iteratively: x i+ cos(x i ) An equation x = g(x) can be solved iteratively when g (x) <. Output: Iter : Iter 9: 0.87 Iter : Iter 0: 0.84 Iter : Iter : 0.84 Iter 4: Iter : 0.84 Iter 5: 0.86 Iter : 0.84 Iter 6: Iter 4: 0.84 Iter 7: 0.80 Iter 5: 0.84 Iter 8: 0.85 Iter 6: 0.84 TDT47 Artificial Intelligence Methods
37 Chapter 7: Making Complex Decisions Bounded vs. infinite time A side-step: Fix-point iterations (higher dims) It can work in higher dimensions, too! We solve this set of equations [ ] [ x (x = y +)/4 y ( x 4 4y 4 +8y +4)/ simply by using x i+ (x i y i +)/4 y i+ ( x i 4 4y i 4 +8y i +4)/ ] to obtain [ x y ] [ ]. Note indices! y i+ calculated using x i even if x i+ is known. TDT47 Artificial Intelligence Methods
38 Chapter 7: Making Complex Decisions Recap: What we are up tp Bounded vs. infinite time To solve these decision-problems we need... A mapping from any state/action history σ t to A t (the next action). We have simplified π t (σ t ) Markov = π t (S t ) Stationarity = π(s t ) We proceed in two steps: Find a utility function U (S t ) how good it is to be in S t Define π(s t ) to maximizes the expected utility Possible repr.; U (S t ) Corresponding π(s t ) We have defined U (S t ), but so far do not know how to calculate it. Fix-point iterations coming up next... TDT47 Artificial Intelligence Methods
39 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? 4 TDT47 Artificial Intelligence Methods
40 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? In any state we choose the action maximizing the expected utility: π(s) = argmax P(s s,a) U (s ). a s Thus, U (s) = R(s)+γmax a s P(s s,a) U (s ). 4 TDT47 Artificial Intelligence Methods
41 Chapter 7: Making Complex Decisions Finding optimal strategies Finding optimal strategies Value iteration The maximum expected utility of starting in state s 0 is: [ ] U (s 0 ) = max U(s 0,π) = max E γ i R(S i ) π π π,s 0 = s 0. i=0 But how do we calculate this? In any state we choose the action maximizing the expected utility: π(s) = argmax P(s s,a) U (s ). a s Thus, U (s) = R(s)+γmax a s P(s s,a) U (s ). We now have: n non-linear equations; n unknowns (n = no. states) A solution to these equations correspond to U. 4 TDT47 Artificial Intelligence Methods
42 Chapter 7: Making Complex Decisions Finding optimal strategies Value iteration Value iteration Fix-point iterations in value-space Start with an initial guess at the utility function, and iteratively refine this using the idea of fix-point iterations: Initial guess After the first iteration After the second iteration The updating function: Û j+ (s) R(s)+γ max a s P(s a,s) Ûj(s ). 5 TDT47 Artificial Intelligence Methods
43 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 The corresponding optimal strategy is uninformative: 6 TDT47 Artificial Intelligence Methods
44 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û 6 TDT47 Artificial Intelligence Methods
45 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û { Û ({,}) = R({,})+γ max P(s north,{,})û0(s ),..., s } P(s west,{,})û0(s ) s = max{0,0,0,0}= TDT47 Artificial Intelligence Methods
46 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û { Û ({,}) = R({,})+γ max P(s north,{,})û0(s ),..., s } P(s west,{,})û0(s ) s = max{0,0,0,0}= 5. 6 TDT47 Artificial Intelligence Methods
47 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û The corresponding optimal strategy is still rather silly: 6 TDT47 Artificial Intelligence Methods
48 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û 6 TDT47 Artificial Intelligence Methods
49 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û Û ({,}) = max{ , , , } = max{0.4,6.48,.5,0.56} = 5.7, 6 TDT47 Artificial Intelligence Methods
50 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û Û ({,}) = max{ , , , } = max{ 0.9, 0.7, 0.9, 0.9} = 5.7, 6 TDT47 Artificial Intelligence Methods
51 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û The optimal strategy corresponding to Û: 6 TDT47 Artificial Intelligence Methods
52 Chapter 7: Making Complex Decisions Value iteration: Robot navigation Finding optimal strategies Value iteration Initial guess Û0 First iteration Û Second iteration Û The optimal strategy corresponding to Û : 6 TDT47 Artificial Intelligence Methods
53 Chapter 7: Making Complex Decisions Finding optimal strategies Value iteration Value iteration: The impact of the discounting factor Û function (γ = 0.9) Û function (γ = 0.) Optimal strategy (γ = 0.9) Optimal strategy (γ = 0.) Get to the goal Money accumulates Avoid setbacks Future less relevant 7 TDT47 Artificial Intelligence Methods
54 Chapter 7: Making Complex Decisions Value iteration: The algorithm Finding optimal strategies Value iteration Choose an ǫ > 0 to regulate the stopping criterion. Let U 0 be an initial estimate of the utility function (for example, initialized to zero for all states). Set i := 0. 4 Repeat Let i := i+. For each states s in S do Û i (s) := R(s)+γ max a s P(s a,s)ûi (s ). 5 Until Ûi(s) Ûi (s) < ǫ( γ)/γ for all s. 8 TDT47 Artificial Intelligence Methods
55 Chapter 7: Making Complex Decisions Value iteration: Convergence Finding optimal strategies Value iteration So the algorithm converges for this particular example, but does this hold in general? Yes! It can be proven that there is only one true utility function, and that value iteration is guaranteed to converge to this utility function. 9 TDT47 Artificial Intelligence Methods
56 Chapter 7: Making Complex Decisions Value iteration: Demo Finding optimal strategies Value iteration Deterministic actions Stochastic actions rl_sim-demo: 8_big.maze 0 TDT47 Artificial Intelligence Methods
57 Chapter 7: Making Complex Decisions Value iteration: Efficiency Finding optimal strategies Value iteration Value iteration converges, but is it efficient? Estimates utilities of all states with same requirements towards accuracy, also those states that are rarely visited. Convergence defined from accuracy of utility estimates, but the agent only cares about making optimal decisions. Error Error, Û Error, π Iteration TDT47 Artificial Intelligence Methods
58 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating TDT47 Artificial Intelligence Methods
59 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating The updating function: π i+ (s) := argmax a s P(s a,s)ûπ i (s ). TDT47 Artificial Intelligence Methods
60 Policy iteration Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Instead of updating the utility function, make an initial guess at the optimal policy and perform an iterative refinement of this guess: π0 π πm Ûπ0 Ûπ Policy updating Policy evaluation Policy updating Policy evaluation Policy updating The updating function: π i+ (s) := argmax a The evaluation function: s P(s a,s)ûπ i (s ). Û πi (s) = R(s)+γ s P(s π i (s),s) Ûπ i (s ), which defines a system of linear equalities; the solution is U πi. TDT47 Artificial Intelligence Methods
61 Chapter 7: Making Complex Decisions Policy iteration: the algorithm Finding optimal strategies Policy iteration Let π 0 be an initial randomly chosen policy. Set i := 0. Repeat Find the utility function Ûπ i corresponding to the policy π i [Policy evaluation]. Let i := i+. For each s π i (s) := argmax a s P(s a,s)ûπ i (s )[Policy updating]. 4 Until π i = π i TDT47 Artificial Intelligence Methods
62 Chapter 7: Making Complex Decisions Finding optimal strategies Policy iteration Value iteration vs. Policy Iteration: Demo Value iteration Policy iteration rl_sim-demo: _smallest.maze 4 TDT47 Artificial Intelligence Methods
63 Chapter 7: Making Complex Decisions Partial observability Partial observability and POMDPs Partially Observable Markov Decision Process (POMDP): The agent does not observe the environment fully, so does not know the state it is in. A POMDP is like an MDP, but has an observation model P(e t s t ) defining the probability that the agent obtains evidence e t when in state s t Agent does not know which state it is in, so it makes no sense to talk about policy π(s)!! 5 TDT47 Artificial Intelligence Methods
64 Chapter 7: Making Complex Decisions Solving POMDPs Partial observability and POMDPs Theorem The optimal policy in a POMDP is a function π(b) where b is the belief state (probability distribution over states) for the agent. Hence, we can convert a POMDP into an MDP in belief-state space, where P(b t+ a t,b t ) is the probability that the new belief state b t+ given that the current belief state is b t and the agent does a t. If there are n states, b is an n-dimensional real-valued vector solving POMDPs is very hard! (PSPACE-hard) The real world is a POMDP (with initially unknown transition and observation models) 6 TDT47 Artificial Intelligence Methods
65 Summary Summary Sequential decision problems Assumptions: Stationarity, Markov assumption, Additive rewards, infinite horizon with discount Model class: Markov decision problems Algorithm: Value iteration / policy iteration Intuitively, MDPs combine probabilistic models over time (filtering, prediction) with maximum expected utiltiy principle. 7 TDT47 Artificial Intelligence Methods
Making Complex Decisions
Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2
More informationLecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision
More informationComplex Decisions. Sequential Decision Making
Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by
More informationNon-Deterministic Search
Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:
More informationMaking Decisions. CS 3793 Artificial Intelligence Making Decisions 1
Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside
More informationMarkov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N
Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in
More informationCOS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration
COS402- Artificial Intelligence Fall 2015 Lecture 17: MDP: Value Iteration and Policy Iteration Outline The Bellman equation and Bellman update Contraction Value iteration Policy iteration The Bellman
More informationReasoning with Uncertainty
Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in
More information91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010
91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline
More informationSequential Decision Making
Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due
More informationReinforcement Learning
Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical
More informationMarkov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo
Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov
More informationCPS 270: Artificial Intelligence Markov decision processes, POMDPs
CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationReinforcement Learning
Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC
More informationBasic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]
Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal
More informationMDPs: Bellman Equations, Value Iteration
MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations
More informationCOMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2
COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman
More informationOverview: Representation Techniques
1 Overview: Representation Techniques Week 6 Representations for classical planning problems deterministic environment; complete information Week 7 Logic programs for problem representations including
More information17 MAKING COMPLEX DECISIONS
267 17 MAKING COMPLEX DECISIONS The agent s utility now depends on a sequence of decisions In the following 4 3grid environment the agent makes a decision to move (U, R, D, L) at each time step When the
More informationDecision Theory: Value Iteration
Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision
More informationCS 188: Artificial Intelligence. Outline
C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence
More informationLecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018
Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationReinforcement Learning and Simulation-Based Search
Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision
More informationReinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum
Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,
More informationMarkov Decision Processes. Lirong Xia
Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent
More information2D5362 Machine Learning
2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files
More informationLogistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week
CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials
More informationPOMDPs: Partially Observable Markov Decision Processes Advanced AI
POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic
More informationMarkov Decision Processes. CS 486/686: Introduction to Artificial Intelligence
Markov Decision Processes CS 486/686: Introduction to Artificial Intelligence 1 Outline Markov Chains Discounted Rewards Markov Decision Processes (MDP) - Value Iteration - Policy Iteration 2 Markov Chains
More informationCS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm
CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure
More information4 Reinforcement Learning Basic Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use
More informationMonte Carlo Methods (Estimators, On-policy/Off-policy Learning)
1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used
More informationLecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world
Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring 2018 / Sadigh CS221 / Spring
More informationCS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I
CS221 / Spring 2018 / Sadigh Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring
More informationDeep RL and Controls Homework 1 Spring 2017
10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact
More informationIntroduction to Reinforcement Learning. MAL Seminar
Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology
More informationProbabilistic Robotics: Probabilistic Planning and MDPs
Probabilistic Robotics: Probabilistic Planning and MDPs Slide credits: Wolfram Burgard, Dieter Fox, Cyrill Stachniss, Giorgio Grisetti, Maren Bennewitz, Christian Plagemann, Dirk Haehnel, Mike Montemerlo,
More informationAM 121: Intro to Optimization Models and Methods
AM 121: Intro to Optimization Models and Methods Lecture 18: Markov Decision Processes Yiling Chen and David Parkes Lesson Plan Markov decision processes Policies and Value functions Solving: average reward,
More informationMarkov Decision Process
Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer
More informationMarkov Decision Processes
Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their
More informationReinforcement Learning 04 - Monte Carlo. Elena, Xi
Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment
More informationLong-Term Values in MDPs, Corecursively
Long-Term Values in MDPs, Corecursively Applied Category Theory, 15-16 March 2018, NIST Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) MDPs, Corecursively NIST, 15/Mar/2018
More informationCS 461: Machine Learning Lecture 8
CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?
More information16 MAKING SIMPLE DECISIONS
247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result
More informationIntro to Reinforcement Learning. Part 3: Core Theory
Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2
More informationMDPs and Value Iteration 2/20/17
MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that,
More informationReinforcement Learning. Monte Carlo and Temporal Difference Learning
Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.
CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use
More informationDynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming
Dynamic Programming: An overview These notes summarize some key properties of the Dynamic Programming principle to optimize a function or cost that depends on an interval or stages. This plays a key role
More informationThe Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions
The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}
More information16 MAKING SIMPLE DECISIONS
253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)
More informationLecture Quantitative Finance Spring Term 2015
implied Lecture Quantitative Finance Spring Term 2015 : May 7, 2015 1 / 28 implied 1 implied 2 / 28 Motivation and setup implied the goal of this chapter is to treat the implied which requires an algorithm
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 9 Sep, 28, 2016 Slide 1 CPSC 422, Lecture 9 An MDP Approach to Multi-Category Patient Scheduling in a Diagnostic Facility Adapted from: Matthew
More informationLecture 7: Bayesian approach to MAB - Gittins index
Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach
More informationHandout 4: Deterministic Systems and the Shortest Path Problem
SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas
More informationThe exam is closed book, closed calculator, and closed notes except your three crib sheets.
CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.
More informationLecture 1: Lucas Model and Asset Pricing
Lecture 1: Lucas Model and Asset Pricing Economics 714, Spring 2018 1 Asset Pricing 1.1 Lucas (1978) Asset Pricing Model We assume that there are a large number of identical agents, modeled as a representative
More informationCS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning
CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction
More informationExample: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities
CS 188: Artificial Intelligence Markov Deciion Procee II Intructor: Dan Klein and Pieter Abbeel --- Univerity of California, Berkeley [Thee lide were created by Dan Klein and Pieter Abbeel for CS188 Intro
More informationTemporal Abstraction in RL
Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;
More informationAlmost essential MICROECONOMICS
Prerequisites Almost essential Games: Mixed Strategies GAMES: UNCERTAINTY MICROECONOMICS Principles and Analysis Frank Cowell April 2018 1 Overview Games: Uncertainty Basic structure Introduction to the
More informationQ1. [?? pts] Search Traces
CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a
More informationMDP Algorithms. Thomas Keller. June 20, University of Basel
MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search
More informationLecture 4: Model-Free Prediction
Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning
More informationStochastic Games and Bayesian Games
Stochastic Games and Bayesian Games CPSC 532l Lecture 10 Stochastic Games and Bayesian Games CPSC 532l Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games 4 Analyzing Bayesian
More informationReinforcement Learning Lectures 4 and 5
Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and
More informationThe Problem of Temporal Abstraction
The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,
More informationTopics in Computational Sustainability CS 325 Spring 2016
Topics in Computational Sustainability CS 325 Spring 2016 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures.
More information6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE Rollout algorithms Cost improvement property Discrete deterministic problems Approximations of rollout algorithms Discretization of continuous time
More informationSequential Coalition Formation for Uncertain Environments
Sequential Coalition Formation for Uncertain Environments Hosam Hanna Computer Sciences Department GREYC - University of Caen 14032 Caen - France hanna@info.unicaen.fr Abstract In several applications,
More informationOptimal Policies for Distributed Data Aggregation in Wireless Sensor Networks
Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks Hussein Abouzeid Department of Electrical Computer and Systems Engineering Rensselaer Polytechnic Institute abouzeid@ecse.rpi.edu
More informationDefinition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.
102 OPTIMAL STOPPING TIME 4. Optimal Stopping Time 4.1. Definitions. On the first day I explained the basic problem using one example in the book. On the second day I explained how the solution to the
More informationEconomics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints
Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints David Laibson 9/11/2014 Outline: 1. Precautionary savings motives 2. Liquidity constraints 3. Application: Numerical solution
More informationIntroduction to Fall 2007 Artificial Intelligence Final Exam
NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Final Exam You have 180 minutes. The exam is closed book, closed notes except a two-page crib sheet, basic calculators
More informationThe Irrevocable Multi-Armed Bandit Problem
The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision
More informationCEC login. Student Details Name SOLUTIONS
Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching
More informationLecture Notes 1
4.45 Lecture Notes Guido Lorenzoni Fall 2009 A portfolio problem To set the stage, consider a simple nite horizon problem. A risk averse agent can invest in two assets: riskless asset (bond) pays gross
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.
CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use
More informationMath-Stat-491-Fall2014-Notes-V
Math-Stat-491-Fall2014-Notes-V Hariharan Narayanan December 7, 2014 Martingales 1 Introduction Martingales were originally introduced into probability theory as a model for fair betting games. Essentially
More informationFinal exam solutions
EE365 Stochastic Control / MS&E251 Stochastic Decision Models Profs. S. Lall, S. Boyd June 5 6 or June 6 7, 2013 Final exam solutions This is a 24 hour take-home final. Please turn it in to one of the
More informationTo earn the extra credit, one of the following has to hold true. Please circle and sign.
CS 188 Fall 2018 Introduction to rtificial Intelligence Practice Midterm 2 To earn the extra credit, one of the following has to hold true. Please circle and sign. I spent 2 or more hours on the practice
More informationOptimal Dam Management
Optimal Dam Management Michel De Lara et Vincent Leclère July 3, 2012 Contents 1 Problem statement 1 1.1 Dam dynamics.................................. 2 1.2 Intertemporal payoff criterion..........................
More informationReinforcement Learning
Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods
More informationLong Term Values in MDPs Second Workshop on Open Games
A (Co)Algebraic Perspective on Long Term Values in MDPs Second Workshop on Open Games Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July 2018
More informationNotes on Intertemporal Optimization
Notes on Intertemporal Optimization Econ 204A - Henning Bohn * Most of modern macroeconomics involves models of agents that optimize over time. he basic ideas and tools are the same as in microeconomics,
More informationStochastic Games and Bayesian Games
Stochastic Games and Bayesian Games CPSC 532L Lecture 10 Stochastic Games and Bayesian Games CPSC 532L Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games Stochastic Games
More informationSolution algorithm for Boz-Mendoza JME by Enrique G. Mendoza University of Pennsylvania, NBER & PIER
Solution algorithm for Boz-Mendoza JME 2014 by Enrique G. Mendoza University of Pennsylvania, NBER & PIER Two-stage solution method At each date t of a sequence of T periods of observed realizations of
More information