Probabilistic Robotics: Probabilistic Planning and MDPs

Size: px

Start display at page:

Download "Probabilistic Robotics: Probabilistic Planning and MDPs"

Virgil Eaton
5 years ago
Views:

1 Probabilistic Robotics: Probabilistic Planning and MDPs Slide credits: Wolfram Burgard, Dieter Fox, Cyrill Stachniss, Giorgio Grisetti, Maren Bennewitz, Christian Plagemann, Dirk Haehnel, Mike Montemerlo, Nick Roy, Kai Arras, Patrick Pfaff and others SA-1

2 Planning: Classical Situation heaven hell World deterministic State observable

3 MDP-Style Planning heaven hell Policy Universal Plan Navigation function World stochastic State observable [Koditschek 87, Barto et al. 89]

4 Stochastic, Partially Observable heaven? hell? sign [Sondik 72] [Littman/Cassandra/Kaelbling 97]

5 Stochastic, Partially Observable heaven hell hell heaven sign sign

6 Stochastic, Partially Observable heaven hell?? hell heaven start sign sign sign 50% 50%

7 Stochastic, Partially Observable?? heaven hell?? hell heaven start start sign sign sign sign 50% 50%

8 A Quiz # states sensors actions size belief space? 3 perfect 3 perfect 3 abstract states 3 stochastic 3 none 1-dim continuous stochastic 1-dim continuous stochastic -dim continuous stochastic deterministic stochastic deterministic deterministic stochastic deterministic stochastic stochastic 3: s 1, s 2, s 3 3: s 1, s 2, s : s 1, s 2, s 3,s 12,s 13,s 23,s dim continuous: p(s=s 1 ), p(s=s 2 ) 2-dim continuous: p(s=s 1 ), p(s=s 2 ) -dim continuous -dim continuous aargh!

9 MPD Planning Solution for Planning problem Noisy controls Perfect perception Generates universal plan (=policy)

10 What is the problem? Consider a non-deterministic robot/environment. Actions have desired outcome with a probability less then 1. What is the best action for a robot under this constraint? Example: a mobile robot does not exactly perform the desired action. Uncertainty about performing actions!

11 Example (1) Bumping to wall reflects robot. Reward for free cells (travel cost). What is the best way to reach the cell labeled with +1 without moving to 1?

12 Example (2) Deterministic Transition Model: move on the shortest path!

13 Example (3) But now consider the non-deterministic transition model (N / E / S / W): (desired action) What is now the best way?

14 Example (4) Use a longer path with lower probability to move to the cell labeled with 1. This path has the highest overall utility!

15 Utility and Policy Compute for every state a utility: What is the usage (utility) of this state for the overall task? A Policy is a complete mapping from states to actions ( In which state should I perform which action? ).

16 Markov Decision Problem (MDP) Compute the optimal policy in an accessible, stochastic environment with known transition model. Markov Property: The transition probabilities depend only the current state and not on the history of predecessor states. Not every decision problem is a MDP.

17 Markov Decision Process (MDP) r=1 0.7 s r=0 s s 3 r= r=0 0.2 s 4 s r=-10 17

18 Markov Decision Process (MDP) Given: States x Actions u Transition probabilities p(x u,x) Reward / payoff function r(x,u) Wanted: Policy (x) that maximizes the future expected reward 18

19 Rewards and Policies Policy (general case): : z 1 : t 1, u 1: t 1 Policy (fully observable case): : xt u t Expected cumulative payoff: R T T E r t 1 T=1: greedy policy u t T>1: finite horizon case, typically no discount T=infty: infinite-horizon case, finite reward if discount < 1 19

20 20 Policies contd. Expected cumulative payoff of policy: Optimal policy: 1-step optimal policy: Value function of 1-step optimal policy: T t t t t t T u z u r E x R 1 1 1: 1 : 1 ) ( ) ( ), ( argmax ) ( 1 u x r x u ), ( max ) ( 1 u x r x V u ) ( argmax t T x R

21 2-step Policies Optimal policy: Value function: ( x) argmax r( x, u) V ( x' ) p( x' u, x) dx' 2 1 u V ( x) max r( x, u) V ( x') p( x' u, x) dx' 2 1 u 21

22 T-step Policies Optimal policy: T Value function: ( x) argmax r( x, u) V ( x' ) p( x' u, x) dx' T 1 u V T ( x) max r( x, u) V ( x') p( x' u, x) dx' T 1 u 22

23 Infinite Horizon Optimal policy: V ( x) max r( x, u) V ( x') p( x' u, x) dx' u Bellman equation Fix point is optimal policy Necessary and sufficient condition 23

24 Value Iteration for all x do Vˆ ( x) r min endfor repeat until convergence for all x do Vˆ ( x) max u r( x, u) V ˆ( x') p( x' u, x) dx' endfor endrepeat ( x) argmax r( x, u) V ˆ( x' ) p( x' u, x) dx' u 24

25 Value Iteration for Motion Planning 25

26 The optimal Policy Probability of reaching state j form state i with action a. Utility of state j. If we know the utility we can easily compute the optimal policy. The problem is to compute the correct utilities for all states.

27 The Utility (1) To compute the utility of a state we have to consider a tree of states. The utility of a state depends on the utility of all successor states. Not all utility functions can be used. The utility function must have the property of separability. E.g. additive utility functions: (R = reward function)

28 The Utility (2) The utility can be expressed similar to the policy function: The reward R(i) is the utility of the state itself (without considering the successors).

29 Dynamic Programming This Utility function is the basis for dynamic programming. Fast solution to compute n-step decision problems. Naive solution: O( A n ). Dynamic Programming: O(n A S ). But what is the correct value of n? If the graph has loops:???

30 Iterative Computation Idea: The Utility is computed iteratively: Optimal utility: Abort, if change in the utility is below a threshold.

31 The Value Iteration Algorithm

32 Value Iteration Example Calculate utility of the center cell (desired action=north) u=10 u=5 r=1 u=-8 u=1 Transition Model State Space (u=utility, r=reward)

33 Value Iteration Example u=10 u=5 r=1 u=-8 u=1

34 Value Iteration: Example

35 Another Example Map Value Function and Plan

36 Another Example Map Value Function and Plan

37 From Utilities to Policies Computes the optimal utility function. Optimal Policy can easily be computed using the optimal utility values: Value Iteration is an optimal solution to the Markov Decision Problem!

38 Convergence close-enough Different possibilities to detect convergence: RMS error root mean square error Policy Loss

39 Convergence-Criteria: RMS CLOSE-ENOUGH(U,U ) in the algorithm can be formulated by:

40 Example: RMS-Convergence

41 Example: Value Iteration 1. The given environment.

42 Example: Value Iteration 1. The given environment. 2. Calculate Utilities.

43 Example: Value Iteration 1. The given environment. 2. Calculate Utilities. 3. Extract optimal policy.

44 Example: Value Iteration 1. The given environment. 2. Calculate Utilities. 3. Extract optimal policy. 4. Execute actions.

45 Example: Value Iteration The Utilities. The optimal policy. (3,2) has higher utility than (2,3). Why does the polity of (3,3) points to the left?

46 Example: Value Iteration The Utilities. The optimal policy. (3,2) has higher utility than (2,3). Why does the polity of (3,3) points to the left? Because the Policy is not the gradient! It is:

47 Convergence of Policy and Utilities In practice: policy converges faster than the utility values. After the relation between the utilities are correct, the policy often does not change anymore (because of the argmax). Is there an algorithm to compute the optimal policy faster?

48 Policy Iteration Idea for faster convergence of the policy: 1. Start with one policy. 2. Calculate utilities based on the current policy. 3. Update policy based on policy formula. 4. Repeat Step 2 and 3 until policy is stable.

49 The Policy Iteration Algorithm Value Determination

POMDPs: Partially Observable Markov Decision Processes Advanced AI

POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic