MDPs and Value Iteration 2/20/17

Size: px

Start display at page:

Download "MDPs and Value Iteration 2/20/17"

Anthony Randall
6 years ago
Views:

1 MDPs and Value Iteration 2/20/17

2 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an action, returns a new state A set of goal states, often specified as a function A way to measure solution quality

3 What if actions aren t perfect? We might not know exactly which next state will result from an action. We can model this as a probability distribution over next states.

4 Search with Non-Deterministic Actions A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an action, returns a new state a probability distribution over next states A set of goal states, often specified as a function A way to measure solution quality A set of terminal states A reward function that gives a utility for each state

5 Markov Decision Processes (MDPs) Named after the Markov property : if you know the state then you know the transition probabilities. We still represent states and actions. Actions no longer lead to a single next state. Instead they lead to one of several possible states, determined randomly. We re now working with utilities instead of goals. Expected utility works well for handling randomness. We need to plan for unintended consequences. Even an optimal agent may run forever!

6 State Space Search MDPs States: S States: S Actions: A s Transition function F(s, a) = s Start S Goals S Action Costs: C(a) Actions: A s Transition probabilities P(s s, a) Start S Terminal S State Rewards: R(s) Can also have costs: C(a)

7 We can t rely on a single plan! Actions might not have the outcome we expect, so our plans need to include contingencies for states we could end up in. Instead of searching for a plan, we devise a policy. A policy is a function that maps states to actions. For each state we could end up in, the policy tells us which action to take.

8 A simple example: Grid World end +1 end -1 start If actions were deterministic, we could solve this with state space search. (3,2) would be a goal state (3,1) would be a dead end

9 A simple example: Grid World end +1 end -1 start Suppose instead that the move we try to make only works correctly 80% of the time. 10% of the time, we go in each perpendicular direction, e.g. try to go right, go up instead. If impossible, stay in place.

10 A simple example: Grid World end +1 end -1 start Before, we had two equally-good alternatives. Which path is better when actions are uncertain? What should we do if we find ourselves in (2,1)?

11 Discount Factor Specifies how impatient the agent is. Key idea: reward now is better than reward later. Rewards in the future are exponentially decayed. Reward t steps in the future is discounted by γ t U = t R t Why do we need a discount factor?

12 Value of a State To come up with an optimal policy, we start by determining a value for each state. The value of a state is reward now, plus discounted future reward: V (s) =R(s)+ [future value] Assume we ll do the best thing in the future.

13 Future Value If we know the value of other states, we can calculate the expected value of each action: E(s, a) = X s 0 P (s 0 s, a) V (s 0 ) Future value is the expected value of the best action: max a E(s, a)

14 Value Iteration The value of state s depends on the value of other states s. The value of s may depend on the value of s. We can iteratively approximate the value using dynamic programming. Initialize all values to the immediate rewards. Update values based on the best next-state. Repeat until convergence (values don t change).

15 Value Iteration Pseudocode values = {state : R(state) for each state} until values don t change: prev = copy of values for each state s: initialize best_ev for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_ev = max(ev, best_ev) values[s] = R(s) + gamma*best_ev

16 Value Iteration on Grid World discount =.9 V (2, 2) = 0 + V (2, 1) = 0 + V (3, 0) = 0 + max [E((2, 2),u), E((2, 2),d), E((2, 2),l), E((2, 2),r)] max [E((2, 1),u), E((2, 1),d), E((2, 1),l), E((2, 1),r)] max [E((3, 0),u), E((3, 0),d), E((3, 0),l), E((3, 0),r)]

17 Value Iteration on Grid World discount =.9 V (2, 2) = max[ , , , ] V (2, 1) = max [ , , , ] V (3, 0) = max [ , , , ]

18 Value Iteration on Grid World Exercise: Continue value iteration discount =.9

19 What do we do with the values? When values have converged, the optimal policy is to select the action with the highest expected value at each state What should we do if we find ourselves in (2,1)?

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in